No-Nonsense Tensorflow GPU Setup

1. Bottom Line Up Front

You're trying to train a model, you have a large dataset, you want to use your GPU. But you've spent several hours researching, reading blog posts and being overwhelmed with documentation. tl;dr go to the template repo here https://github.com/S010MON/tensorflow-gpu and follow the steps in the README.md.

Now that those who are in a rush to just get their model training are satisfied, read on for an explanation of the repository

2. Overview

Training on a GPU is more difficult that simply running your code on a single thread on your machine because of the increased complexity of IO (you need to interface with a GPU) and parallelism (your work needs to be chunked, processed, and reduced). This guide is not a one-stop shop, this covers only the most common setup I have encountered, a Debian Linux host machine running Python code for Tensorflow.

Yes, you can run this on Windows with WSL, and I'm sure there are overlaps with Apple OS, but they are outside the scope of this guide.

2.1 Hardware Requirements

You will need a GPU that is compatible, you can check out the CUDA version required for the python/tensorflow version that you are using here: https://www.tensorflow.org/install/source#gpu

2.2 Software Requirements

There are a ton of software requirements required for running Tensorflow on a GPU. The number of device drivers, dependencies, and configurations required to set up correctly are just too much to handle. So Tensorflow and NVIDIA helpfully containerise and publish their products through docker to avoid the hassle. This leaves us with three software dependencies:

  1. NVIDIA Drivers. Not all linux distros come with them, so you need to ensure you have them installed
  2. Docker. For running and managing containers
  3. The NVIDIA container toolkit. Software to enable docker containers to interface with your GPU hardware

3. NVIDIA Drivers

I personally find Pop!_OS to have the best driver support for NVIDIA GPUs, they even provide a disk image with drivers installed. Although there are hundreds of guides out there for both Pop!_OS and Ubuntu to install NVIDIA drivers, I recommend you to first try the official guides listed.

To test that the GPU is working use the nvidia smi command: nvidia-smi which should give you output with information on your hardware.

        
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.02              Driver Version: 545.29.02    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3060        Off | 00000000:2B:00.0  On |                  N/A |
|  0%   44C    P8              16W / 170W |    909MiB / 12288MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2986      G   /usr/lib/xorg/Xorg                          280MiB |
|    0   N/A  N/A      3099      G   /usr/bin/gnome-shell                         56MiB |
|    0   N/A  N/A      6074      G   firefox                                     558MiB |
|    0   N/A  N/A    185656      G   ...ures=SpareRendererForSitePerProcess        2MiB |
+---------------------------------------------------------------------------------------+
      

4. Docker

If you are already using docker and docker-compose, you can skip this step, otherwise follow the instructions from the official docker documentation here: https://docs.docker.com/engine/install/ubuntu/. I strongly recommend you don't use the APT install method without getting the official gpt key from the docker website. As the repository version can often be quite out of date.

After installation, add docker to the user group so you don't have to use sudo before all your commands. I have omitted sudo from this guide, so if you keep getting denied permissions when copying commands, that's likely why. https://docs.docker.com/engine/install/linux-postinstall/

5. NVIDIA Container Toolkit

The container toolkit allows docker containers to interface with the GPU, to install follow the instructions here: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html.

To test the setup, use the nvidia-smi container to check that you have access to the gpu within a container:

        
sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
      

Source: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/sample-workload.html

6. Setting Up Your Project

Now that you have all the requirements, to run your project make a copy of the template repository here: https://github.com/S010MON/tensorflow-gpu the project is structured as follows:

        
.
├── docker-compose.yml
├── Dockerfile
├── README.md
├── requirements.txt
└── src

      

I'll run through each file and will highlight what they do and where there are options you might want to change.

6.1 Dockerfile

The dockerfile's first three lines pull a container image from Tensorflow's repository and updates it. There is the option to use the :latest-gpu-jupyter tag. Just be mindful that this will always pull the latest version and might break code you have written for a specific version.

        
# Update tensorflow version if required from here: https://hub.docker.com/r/tensorflow/tensorflow/tags
# or use :latest-gpu-jupyter
FROM tensorflow/tensorflow:2.12.0-gpu-jupyter
RUN apt-get update -y                           # Update container
RUN apt-get install ffmpeg libsm6 libxext6  -y  # Required for CV2

WORKDIR /tf/notebooks/                          # Set working dir to default notebook dir

COPY requirements.txt .                         # Copy and install requirements
RUN python3 -m pip install --upgrade pip
RUN pip3 install --no-cache-dir --upgrade -r ./requirements.txt

      

Line 5 is only required if you are using the CV2 library to do image processing, otherwise, just remove it and enjoy a faster startup time. Next we set the working directory and copy all of our requirements over. Note, we don't actually copy over our source code, it is accessed inside the container by mounting a volume, more on that later.

6.2 Requirements

The requirements.txt file lists all the requirements that you need for you code. I have added common project dependencies, but feel free to change them to suit your needs. You should note that these dependencies are installed when you build the container and are cached when you run it. So if you add a new dependency to the requirements, you will get an error until you stop the container and restart it with the rebuild flag set. See section 7 on how to do that.

        
Jupyter
numpy
tensorflow==2.12.0
keras==2.12.0
Pillow==9.3
scikit-learn
matplotlib
scipy
opencv-python==4.7.0.72
pandas==2.0.1
      

If you want to share your project with others, or are thinking of publishing your work and code along-side, it is best practice to include a version number or minimum version i.e. >=1.2.3. The number of times I have found code from a published paper that does almost exactly what I want, but can't be run because I keep getting deprecation errors is far higher than it should. Be a good researcher, version your code!

6.3 Docker-compose

Docker compose is a utility for building sets of composed containers that can communicate between each other. We are using the utility to set up the GPU communication, and it also reduces the amount of environment variables we need to type out when we run the container. The container name can be changed, which can be helpful if you are running more than one of the containers for different projects. The other point to note is the volume

        
volumes:
- "./:/tf/notebooks"
      

This connects the container to the repository route (where the docker-compose.yml file is) to the file path specified within the container. So, when you update your code, or add more data, it will be updating the same file as the one seen in your container. This means you can work on your code as normal in an IDE and run it in your container without restarting it to make a copy.

        
- "/your/file/path/:/tf/notebooks/data/"
      

If you have a huge dataset that you don't want to keep in your repository (perhaps it's on another hard drive) then you can add more volumes as required using the from:to format

7. Running the code

7.1 Jupyter Notebooks

To start up a jupyter server, simply run: docker-compose up and you will be presented with tokens for accessing the notebooks.

        
tensorflow-gpu  | [I 11:53:33.697 NotebookApp] Writing notebook server cookie secret to /root/.local/share/jupyter/runtime/notebook_cookie_secret
tensorflow-gpu  | [I 11:53:33.822 NotebookApp] Serving notebooks from local directory: /tf
tensorflow-gpu  | [I 11:53:33.822 NotebookApp] Jupyter Notebook 6.5.3 is running at:
tensorflow-gpu  | [I 11:53:33.822 NotebookApp] http://4973a39b65ae:8888/?token=8d4bba37dab29f0dc930937b3714f01bb78e6fc788370d36
tensorflow-gpu  | [I 11:53:33.822 NotebookApp]  or http://127.0.0.1:8888/?token=8d4bba37dab29f0dc930937b3714f01bb78e6fc788370d36
tensorflow-gpu  | [I 11:53:33.822 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
tensorflow-gpu  | [C 11:53:33.824 NotebookApp]
tensorflow-gpu  |
tensorflow-gpu  |     To access the notebook, open this file in a browser:
tensorflow-gpu  |         file:///root/.local/share/jupyter/runtime/nbserver-1-open.html
tensorflow-gpu  |     Or copy and paste one of these URLs:
tensorflow-gpu  |         http://4973a39b65ae:8888/?token=8d4bba37dab29f0dc930937b3714f01bb78e6fc788370d36
tensorflow-gpu  |      or http://127.0.0.1:8888/?token=8d4bba37dab29f0dc930937b3714f01bb78e6fc788370d36

      

7.2 Python Scripts

Although notebooks can be useful for rapid prototyping, sometimes you just can't do without a script. To run them you'll need to do so through the terminal, which you can attach to using the exec command. First list your running containers using docker ps -a and then find the name of the container you want to run code in. Then use the docker exec -it [CONTAINER_NAME] bash command. Here the -it flag means interactive, and the bash is required to select the type of shell you want to use.

        
$ docker exec -it tensorflow-gpu bash

________                               _______________
___  __/__________________________________  ____/__  /________      __
__  /  _  _ \_  __ \_  ___/  __ \_  ___/_  /_   __  /_  __ \_ | /| / /
_  /   /  __/  / / /(__  )/ /_/ /  /   _  __/   _  / / /_/ /_ |/ |/ /
/_/    \___//_/ /_//____/ \____//_/    /_/      /_/  \____/____/|__/


WARNING: You are running this container as root, which can cause new files in
mounted volumes to be created as the root user on your host machine.

To avoid this, run the container by specifying your user's userid:

$ docker run -u $(id -u):$(id -g) args...

root@4973a39b65ae:/tf/notebooks# ls
Dockerfile  README.md  docker-compose.yml  requirements.txt  src
      

You will be greeted with the tensorflow logo and can run python scripts from within the /src directory

7.3 Managing Containers

Rebuilding. docker compose up will run the container as it is by caching the state it was last in. When you need to build the container from scratch, use the --build flag, this is handy when adding new requirements

Detached. If you are only using python scrips and don't need to access the jupyter server, you can run the container with a -d flag. This lets you close the terminal and not stop the container