Updated: 13 Mar 2024
You're trying to train a model, you have a large dataset, you want to use your GPU. But you've spent several hours
researching, reading blog posts and being overwhelmed with documentation. tl;dr go to the template repo
here https://github.com/S010MON/tensorflow-gpu
and follow the steps in the README.md
.
Now that those who are in a rush to just get their model training are satisfied, read on for an explanation of the repository
Training on a GPU is more difficult that simply running your code on a single thread on your machine because of the increased complexity of IO (you need to interface with a GPU) and parallelism (your work needs to be chunked, processed, and reduced). This guide is not a one-stop shop, this covers only the most common setup I have encountered, a Debian Linux host machine running Python code for Tensorflow.
Yes, you can run this on Windows with WSL, and I'm sure there are overlaps with Apple OS, but they are outside the scope of this guide.
You will need a GPU that is compatible, you can check out the CUDA version required for the python/tensorflow version that you are using here: https://www.tensorflow.org/install/source#gpu
There are a ton of software requirements required for running Tensorflow on a GPU. The number of device drivers, dependencies, and configurations required to set up correctly are just too much to handle. So Tensorflow and NVIDIA helpfully containerise and publish their products through docker to avoid the hassle. This leaves us with three software dependencies:
I personally find Pop!_OS to have the best driver support for NVIDIA GPUs, they even provide a disk image with drivers installed. Although there are hundreds of guides out there for both Pop!_OS and Ubuntu to install NVIDIA drivers, I recommend you to first try the official guides listed.
To test that the GPU is working use the nvidia smi command: nvidia-smi
which should give you output
with information on your hardware.
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.02 Driver Version: 545.29.02 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3060 Off | 00000000:2B:00.0 On | N/A |
| 0% 44C P8 16W / 170W | 909MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 2986 G /usr/lib/xorg/Xorg 280MiB |
| 0 N/A N/A 3099 G /usr/bin/gnome-shell 56MiB |
| 0 N/A N/A 6074 G firefox 558MiB |
| 0 N/A N/A 185656 G ...ures=SpareRendererForSitePerProcess 2MiB |
+---------------------------------------------------------------------------------------+
If you are already using docker and docker-compose, you can skip this step, otherwise follow the instructions from the official docker documentation here: https://docs.docker.com/engine/install/ubuntu/. I strongly recommend you don't use the APT install method without getting the official gpt key from the docker website. As the repository version can often be quite out of date.
After installation, add docker to the user group so you don't have to use sudo
before all your commands.
I have omitted sudo from this guide, so if you keep getting denied permissions when copying commands, that's likely why.
https://docs.docker.com/engine/install/linux-postinstall/
The container toolkit allows docker containers to interface with the GPU, to install follow the instructions here: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html.
To test the setup, use the nvidia-smi container to check that you have access to the gpu within a container:
docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
Source: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/sample-workload.html
Now that you have all the requirements, to run your project make a copy of the template repository here: https://github.com/S010MON/tensorflow-gpu the project is structured as follows:
.
├── docker-compose.yml
├── Dockerfile
├── README.md
├── requirements.txt
└── src
I'll run through each file and will highlight what they do and where there are options you might want to change.
The dockerfile's first three lines pull a container image from Tensorflow's repository and updates it. There is the
option to use the :latest-gpu-jupyter
tag. Just be mindful that this will always pull the latest version
and might break code you have written for a specific version.
# Update tensorflow version if required from here: https://hub.docker.com/r/tensorflow/tensorflow/tags
# or use :latest-gpu-jupyter
FROM tensorflow/tensorflow:2.12.0-gpu-jupyter
RUN apt-get update -y # Update container
RUN apt-get install ffmpeg libsm6 libxext6 -y # Required for CV2
WORKDIR /tf/notebooks/ # Set working dir to default notebook dir
COPY requirements.txt . # Copy and install requirements
RUN python3 -m pip install --upgrade pip
RUN pip3 install --no-cache-dir --upgrade -r ./requirements.txt
Line 5 is only required if you are using the CV2 library to do image processing, otherwise, just remove it and enjoy a faster startup time. Next we set the working directory and copy all of our requirements over. Note, we don't actually copy over our source code, it is accessed inside the container by mounting a volume, more on that later.
The requirements.txt
file lists all the requirements that you need for you code. I have added common
project dependencies, but feel free to change them to suit your needs. You should note that these dependencies are
installed when you build the container and are cached when you run it. So if you add a new
dependency to the requirements, you will get an error until you stop the container and restart it with the
rebuild flag set. See section 7 on how to do that.
Jupyter
numpy
tensorflow==2.12.0
keras==2.12.0
Pillow==9.3
scikit-learn
matplotlib
scipy
opencv-python==4.7.0.72
pandas==2.0.1
If you want to share your project with others, or are thinking of publishing your work and code along-side, it is
best practice to include a version number or minimum version i.e. >=1.2.3
. The number of times I have
found code from a published paper that does almost exactly what I want, but can't be run because I keep getting
deprecation errors is far higher than it should. Be a good researcher, version your code!
Docker compose is a utility for building sets of composed containers that can communicate between each other. We are using the utility to set up the GPU communication, and it also reduces the amount of environment variables we need to type out when we run the container. The container name can be changed, which can be helpful if you are running more than one of the containers for different projects.
version: "3.1"
services:
jupyter:
container_name: tensorflow-gpu
build: .
ports:
- "8888:8888"
volumes:
- "./:/tf/notebooks"
# - "/your/file/path/:/tf/notebooks/data/" # Only include this line if you have data stored somewhere else on your
# machine that you want to access (like a different drive). This is
# handy if you have a lot of data or a network share.
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
The other point to note is the volume
volumes:
- "./:/tf/notebooks"
This connects the container to the repository route (where the docker-compose.yml
file is) to the file
path specified within the container. So, when you update your code, or add more data, it will be updating the same
file as the one seen in your container. This means you can work on your code as normal in an IDE and run it in your
container without restarting it to make a copy.
- "/your/file/path/:/tf/notebooks/data/"
If you have a huge dataset that you don't want to keep in your repository (perhaps it's on another hard drive) then you can add more volumes as required using the from:to format
To start up a jupyter server, simply run: docker compose up
and you will be presented with tokens for
accessing the notebooks. Note: on older versions of docker-compose the command is docker-compose up
tensorflow-gpu | [I 11:53:33.697 NotebookApp] Writing notebook server cookie secret to /root/.local/share/jupyter/runtime/notebook_cookie_secret
tensorflow-gpu | [I 11:53:33.822 NotebookApp] Serving notebooks from local directory: /tf
tensorflow-gpu | [I 11:53:33.822 NotebookApp] Jupyter Notebook 6.5.3 is running at:
tensorflow-gpu | [I 11:53:33.822 NotebookApp] http://4973a39b65ae:8888/?token=8d4bba37dab29f0dc930937b3714f01bb78e6fc788370d36
tensorflow-gpu | [I 11:53:33.822 NotebookApp] or http://127.0.0.1:8888/?token=8d4bba37dab29f0dc930937b3714f01bb78e6fc788370d36
tensorflow-gpu | [I 11:53:33.822 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
tensorflow-gpu | [C 11:53:33.824 NotebookApp]
tensorflow-gpu |
tensorflow-gpu | To access the notebook, open this file in a browser:
tensorflow-gpu | file:///root/.local/share/jupyter/runtime/nbserver-1-open.html
tensorflow-gpu | Or copy and paste one of these URLs:
tensorflow-gpu | http://4973a39b65ae:8888/?token=8d4bba37dab29f0dc930937b3714f01bb78e6fc788370d36
tensorflow-gpu | or http://127.0.0.1:8888/?token=8d4bba37dab29f0dc930937b3714f01bb78e6fc788370d36
Although notebooks can be useful for rapid prototyping, sometimes you just can't do without a script. To run them
you'll need to do so through the terminal, which you can attach to using the exec
command. First list
your running containers using docker ps -a
and then find the name of the container you want to run code
in. Then use the docker exec -it [CONTAINER_NAME] bash
command. Here the -it
flag means
interactive, and the bash
is required to select the type of shell you want to use.
$ docker exec -it tensorflow-gpu bash
________ _______________
___ __/__________________________________ ____/__ /________ __
__ / _ _ \_ __ \_ ___/ __ \_ ___/_ /_ __ /_ __ \_ | /| / /
_ / / __/ / / /(__ )/ /_/ / / _ __/ _ / / /_/ /_ |/ |/ /
/_/ \___//_/ /_//____/ \____//_/ /_/ /_/ \____/____/|__/
WARNING: You are running this container as root, which can cause new files in
mounted volumes to be created as the root user on your host machine.
To avoid this, run the container by specifying your user's userid:
$ docker run -u $(id -u):$(id -g) args...
root@4973a39b65ae:/tf/notebooks# ls
Dockerfile README.md docker-compose.yml requirements.txt src
You will be greeted with the tensorflow logo and can run python scripts from within the /src
directory
Rebuilding. docker compose up
will run the container as it is by caching the state it
was last in. When you need to build the container from scratch, use the --build
flag, this is handy when
adding new requirements
Detached. If you are only using python scrips and don't need to access the jupyter server, you can
run the container with a -d
flag. This lets you close the terminal and not stop the container