No-Nonsense Tensorflow GPU Setup
Updated: 13 Mar 2024
1. Bottom Line Up Front
You're trying to train a model, you have a large dataset, you want to use your GPU. But you've spent several
hours researching, reading blog posts and being overwhelmed with documentation. tl;dr go to the
template repo here https://github.com/S010MON/tensorflow-gpu
and follow the steps in the README.md.
Now that those who are in a rush to just get their model training are satisfied, read on for an explanation of the repository
2. Overview
Training on a GPU is more difficult that simply running your code on a single thread on your machine because of the increased complexity of IO (you need to interface with a GPU) and parallelism (your work needs to be chunked, processed, and reduced). This guide is not a one-stop shop, this covers only the most common setup I have encountered, a Debian Linux host machine running Python code for Tensorflow.
Yes, you can run this on Windows with WSL, and I'm sure there are overlaps with Apple OS, but they are outside the scope of this guide.
2.1 Hardware Requirements
You will need a GPU that is compatible, you can check out the CUDA version required for the python/tensorflow version that you are using here: https://www.tensorflow.org/install/source#gpu
2.2 Software Requirements
There are a ton of software requirements required for running Tensorflow on a GPU. The number of device drivers, dependencies, and configurations required to set up correctly are just too much to handle. So Tensorflow and NVIDIA helpfully containerise and publish their products through docker to avoid the hassle. This leaves us with three software dependencies:
- NVIDIA Drivers. Not all linux distros come with them, so you need to ensure you have them installed
- Docker. For running and managing containers
- The NVIDIA container toolkit. Software to enable docker containers to interface with your GPU hardware
3. NVIDIA Drivers
I personally find Pop!_OS to have the best driver support for NVIDIA GPUs, they even provide a disk image with drivers installed. Although there are hundreds of guides out there for both Pop!_OS and Ubuntu to install NVIDIA drivers, I recommend you to first try the official guides listed.
- Ubuntu https://ubuntu.com/server/docs/nvidia-drivers-installation
- Pop!_OS https://support.system76.com/articles/system76-driver/#installing-the-system76-nvidia-driver-for-systems-with-nvidia-gpus
To test that the GPU is working use the nvidia smi command: nvidia-smi which should give you
output with information on your hardware.
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.02 Driver Version: 545.29.02 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3060 Off | 00000000:2B:00.0 On | N/A |
| 0% 44C P8 16W / 170W | 909MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 2986 G /usr/lib/xorg/Xorg 280MiB |
| 0 N/A N/A 3099 G /usr/bin/gnome-shell 56MiB |
| 0 N/A N/A 6074 G firefox 558MiB |
| 0 N/A N/A 185656 G ...ures=SpareRendererForSitePerProcess 2MiB |
+---------------------------------------------------------------------------------------+
4. Docker
If you are already using docker and docker-compose, you can skip this step, otherwise follow the instructions from the official docker documentation here: https://docs.docker.com/engine/install/ubuntu/. I strongly recommend you don't use the APT install method without getting the official gpt key from the docker website. As the repository version can often be quite out of date.
After installation, add docker to the user group so you don't have to use sudo before all your
commands. I have omitted sudo from this guide, so if you keep getting denied permissions when copying
commands, that's likely why.
https://docs.docker.com/engine/install/linux-postinstall/
5. NVIDIA Container Toolkit
The container toolkit allows docker containers to interface with the GPU, to install follow the instructions here: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html.
To test the setup, use the nvidia-smi container to check that you have access to the gpu within a container:
docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
Source: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/sample-workload.html
6. Setting Up Your Project
Now that you have all the requirements, to run your project make a copy of the template repository here: https://github.com/S010MON/tensorflow-gpu the project is structured as follows:
.
├── docker-compose.yml
├── Dockerfile
├── README.md
├── requirements.txt
└── src
I'll run through each file and will highlight what they do and where there are options you might want to change.
6.1 Dockerfile
The dockerfile's first three lines pull a container image from Tensorflow's repository and updates it. There is
the option to use the :latest-gpu-jupyter tag. Just be mindful that this will always pull the
latest version and might break code you have written for a specific version.
# Update tensorflow version if required from here: https://hub.docker.com/r/tensorflow/tensorflow/tags
# or use :latest-gpu-jupyter
FROM tensorflow/tensorflow:2.12.0-gpu-jupyter
RUN apt-get update -y # Update container
RUN apt-get install ffmpeg libsm6 libxext6 -y # Required for CV2
WORKDIR /tf/notebooks/ # Set working dir to default notebook dir
COPY requirements.txt . # Copy and install requirements
RUN python3 -m pip install --upgrade pip
RUN pip3 install --no-cache-dir --upgrade -r ./requirements.txt
Line 5 is only required if you are using the CV2 library to do image processing, otherwise, just remove it and enjoy a faster startup time. Next we set the working directory and copy all of our requirements over. Note, we don't actually copy over our source code, it is accessed inside the container by mounting a volume, more on that later.
6.2 Requirements
The requirements.txt file lists all the requirements that you need for you code. I have added
common project dependencies, but feel free to change them to suit your needs. You should note that these
dependencies are installed when you build the container and are cached when you run it. So if
you add a new dependency to the requirements, you will get an error until you stop the container and restart it
with the rebuild flag set. See section 7 on how to do that.
Jupyter
numpy
tensorflow==2.12.0
keras==2.12.0
Pillow==9.3
scikit-learn
matplotlib
scipy
opencv-python==4.7.0.72
pandas==2.0.1
If you want to share your project with others, or are thinking of publishing your work and code along-side, it
is best practice to include a version number or minimum version i.e. >=1.2.3. The number of times I
have found code from a published paper that does almost exactly what I want, but can't be run because I keep
getting deprecation errors is far higher than it should. Be a good researcher, version your code!
6.3 Docker-compose
Docker compose is a utility for building sets of composed containers that can communicate between each other. We are using the utility to set up the GPU communication, and it also reduces the amount of environment variables we need to type out when we run the container. The container name can be changed, which can be helpful if you are running more than one of the containers for different projects.
version: "3.1"
services:
jupyter:
container_name: tensorflow-gpu
build: .
ports:
- "8888:8888"
volumes:
- "./:/tf/notebooks"
# - "/your/file/path/:/tf/notebooks/data/" # Only include this line if you have data stored somewhere else
on
your
# machine that you want to access (like a different drive). This is
# handy if you have a lot of data or a network share.
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
The other point to note is the volume
volumes:
- "./:/tf/notebooks"
This connects the container to the repository route (where the docker-compose.yml file is) to the
file path specified within the container. So, when you update your code, or add more data, it will be updating
the same file as the one seen in your container. This means you can work on your code as normal in an IDE and
run it in your container without restarting it to make a copy.
- "/your/file/path/:/tf/notebooks/data/"
If you have a huge dataset that you don't want to keep in your repository (perhaps it's on another hard drive) then you can add more volumes as required using the from:to format
7. Running the code
7.1 Jupyter Notebooks
To start up a jupyter server, simply run: docker compose up and you will be presented with tokens
for
accessing the notebooks. Note: on older versions of docker-compose the command is docker-compose up
tensorflow-gpu | [I 11:53:33.697 NotebookApp] Writing notebook server cookie secret to
/root/.local/share/jupyter/runtime/notebook_cookie_secret
tensorflow-gpu | [I 11:53:33.822 NotebookApp] Serving notebooks from local directory: /tf
tensorflow-gpu | [I 11:53:33.822 NotebookApp] Jupyter Notebook 6.5.3 is running at:
tensorflow-gpu | [I 11:53:33.822 NotebookApp]
http://4973a39b65ae:8888/?token=8d4bba37dab29f0dc930937b3714f01bb78e6fc788370d36
tensorflow-gpu | [I 11:53:33.822 NotebookApp] or
http://127.0.0.1:8888/?token=8d4bba37dab29f0dc930937b3714f01bb78e6fc788370d36
tensorflow-gpu | [I 11:53:33.822 NotebookApp] Use Control-C to stop this server and shut down all kernels
(twice to skip confirmation).
tensorflow-gpu | [C 11:53:33.824 NotebookApp]
tensorflow-gpu |
tensorflow-gpu | To access the notebook, open this file in a browser:
tensorflow-gpu | file:///root/.local/share/jupyter/runtime/nbserver-1-open.html
tensorflow-gpu | Or copy and paste one of these URLs:
tensorflow-gpu | http://4973a39b65ae:8888/?token=8d4bba37dab29f0dc930937b3714f01bb78e6fc788370d36
tensorflow-gpu | or http://127.0.0.1:8888/?token=8d4bba37dab29f0dc930937b3714f01bb78e6fc788370d36
7.2 Python Scripts
Although notebooks can be useful for rapid prototyping, sometimes you just can't do without a script. To run
them you'll need to do so through the terminal, which you can attach to using the exec command.
First list your running containers using docker ps -a and then find the name of the container you
want to run code in. Then use the docker exec -it [CONTAINER_NAME] bash command. Here the
-it flag means interactive, and the bash is required to select the type of shell you
want to use.
$ docker exec -it tensorflow-gpu bash
________ _______________
___ __/__________________________________ ____/__ /________ __
__ / _ _ \_ __ \_ ___/ __ \_ ___/_ /_ __ /_ __ \_ | /| / /
_ / / __/ / / /(__ )/ /_/ / / _ __/ _ / / /_/ /_ |/ |/ /
/_/ \___//_/ /_//____/ \____//_/ /_/ /_/ \____/____/|__/
WARNING: You are running this container as root, which can cause new files in
mounted volumes to be created as the root user on your host machine.
To avoid this, run the container by specifying your user's userid:
$ docker run -u $(id -u):$(id -g) args...
root@4973a39b65ae:/tf/notebooks# ls
Dockerfile README.md docker-compose.yml requirements.txt src
You will be greeted with the tensorflow logo and can run python scripts from within the /src
directory
7.3 Managing Containers
Rebuilding. docker compose up will run the container as it is by caching the
state it was last in. When you need to build the container from scratch, use the --build flag,
this is handy when adding new requirements
Detached. If you are only using python scrips and don't need to access the jupyter server,
you can run the container with a -d flag. This lets you close the terminal and not stop the
container