Skip to main content

Configure NVLink

Use NVLink to accelerate your training workloads. Learn more about NVLink here.

Overview

Follow these steps to enable NVLink on your A100-80Gx8 Ubuntu machines. NVLink is not available on Windows or CentOS 7.

Script-based Setup

Download and run our helper script.

wget http://softupdate.paperspace.io/configure-nvlink.sh -O - | bash

Restart your machine.

After reboot you should see your GPUs connected with NVLink when running nvidia-smi topo -m:

paperspace@pspe7ld5x:~$ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity
GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 0-95 N/A
GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 0-95 N/A
GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 0-95 N/A
GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 0-95 N/A
GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 0-95 N/A
GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 0-95 N/A
GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 0-95 N/A
GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X 0-95 N/A

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

Step by Step Setup

Update your drivers to the latest version, currently 515.

sudo apt update
sudo apt install nvidia-driver-515

Install NVIDIA Fabric Manager (learn more here). The version installed must match your driver version.

sudo apt install cuda-drivers-fabricmanager-515

Enable persistence mode for your GPUs.

At this point your nvidia-smi output should look like this:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:00:05.0 Off | 0 |
| N/A 32C P0 54W / 400W | 136MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM... On | 00000000:00:06.0 Off | 0 |
| N/A 29C P0 53W / 400W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM... On | 00000000:00:07.0 Off | 0 |
| N/A 28C P0 52W / 400W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM... On | 00000000:00:08.0 Off | 0 |
| N/A 31C P0 50W / 400W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA A100-SXM... On | 00000000:00:09.0 Off | 0 |
| N/A 31C P0 53W / 400W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA A100-SXM... On | 00000000:00:0A.0 Off | 0 |
| N/A 28C P0 53W / 400W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA A100-SXM... On | 00000000:00:0B.0 Off | 0 |
| N/A 29C P0 53W / 400W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA A100-SXM... On | 00000000:00:0C.0 Off | 0 |
| N/A 31C P0 52W / 400W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+

And looking at nvidia-smi topo -m, you'll see everything is connected with PHB:

    GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    CPU Affinity    NUMA Affinity
GPU0 X PHB PHB PHB PHB PHB PHB PHB 0-95 N/A
GPU1 PHB X PHB PHB PHB PHB PHB PHB 0-95 N/A
GPU2 PHB PHB X PHB PHB PHB PHB PHB 0-95 N/A
GPU3 PHB PHB PHB X PHB PHB PHB PHB 0-95 N/A
GPU4 PHB PHB PHB PHB X PHB PHB PHB 0-95 N/A
GPU5 PHB PHB PHB PHB PHB X PHB PHB 0-95 N/A
GPU6 PHB PHB PHB PHB PHB PHB X PHB 0-95 N/A
GPU7 PHB PHB PHB PHB PHB PHB PHB X 0-95 N/A

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

Enable the nvidia-persistenced and nvidia-fabricmanager daemons to start on boot:

sudo systemctl enable nvidia-persistenced
sudo systemctl enable nvidia-fabricmanager

Restart your machine.

After reboot you should see your GPUs connected with NVLink when running nvidia-smi topo -m:

paperspace@pspe7ld5x:~$ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity
GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 0-95 N/A
GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 0-95 N/A
GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 0-95 N/A
GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 0-95 N/A
GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 0-95 N/A
GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 0-95 N/A
GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 0-95 N/A
GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X 0-95 N/A

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks