Configure NVLink
Use NVLink to accelerate your training workloads. Learn more about NVLink here.
Overview
Follow these steps to enable NVLink on your A100-80Gx8 Ubuntu machines. NVLink is not available on Windows or CentOS 7.
Script-based Setup
Download and run our helper script.
wget http://softupdate.paperspace.io/configure-nvlink.sh -O - | bash
Restart your machine.
After reboot you should see your GPUs connected with NVLink when running nvidia-smi topo -m
:
paperspace@pspe7ld5x:~$ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity
GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 0-95 N/A
GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 0-95 N/A
GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 0-95 N/A
GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 0-95 N/A
GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 0-95 N/A
GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 0-95 N/A
GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 0-95 N/A
GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X 0-95 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
Step by Step Setup
Update your drivers to the latest version, currently 515.
sudo apt update
sudo apt install nvidia-driver-515
Install NVIDIA Fabric Manager (learn more here). The version installed must match your driver version.
sudo apt install cuda-drivers-fabricmanager-515
Enable persistence mode for your GPUs.
At this point your nvidia-smi
output should look like this:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:00:05.0 Off | 0 |
| N/A 32C P0 54W / 400W | 136MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM... On | 00000000:00:06.0 Off | 0 |
| N/A 29C P0 53W / 400W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM... On | 00000000:00:07.0 Off | 0 |
| N/A 28C P0 52W / 400W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM... On | 00000000:00:08.0 Off | 0 |
| N/A 31C P0 50W / 400W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA A100-SXM... On | 00000000:00:09.0 Off | 0 |
| N/A 31C P0 53W / 400W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA A100-SXM... On | 00000000:00:0A.0 Off | 0 |
| N/A 28C P0 53W / 400W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA A100-SXM... On | 00000000:00:0B.0 Off | 0 |
| N/A 29C P0 53W / 400W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA A100-SXM... On | 00000000:00:0C.0 Off | 0 |
| N/A 31C P0 52W / 400W | 4MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
And looking at nvidia-smi topo -m
, you'll see everything is connected with PHB
:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity
GPU0 X PHB PHB PHB PHB PHB PHB PHB 0-95 N/A
GPU1 PHB X PHB PHB PHB PHB PHB PHB 0-95 N/A
GPU2 PHB PHB X PHB PHB PHB PHB PHB 0-95 N/A
GPU3 PHB PHB PHB X PHB PHB PHB PHB 0-95 N/A
GPU4 PHB PHB PHB PHB X PHB PHB PHB 0-95 N/A
GPU5 PHB PHB PHB PHB PHB X PHB PHB 0-95 N/A
GPU6 PHB PHB PHB PHB PHB PHB X PHB 0-95 N/A
GPU7 PHB PHB PHB PHB PHB PHB PHB X 0-95 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
Enable the nvidia-persistenced
and nvidia-fabricmanager
daemons to start on boot:
sudo systemctl enable nvidia-persistenced
sudo systemctl enable nvidia-fabricmanager
Restart your machine.
After reboot you should see your GPUs connected with NVLink when running nvidia-smi topo -m
:
paperspace@pspe7ld5x:~$ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity
GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 0-95 N/A
GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 0-95 N/A
GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 0-95 N/A
GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 0-95 N/A
GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 0-95 N/A
GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 0-95 N/A
GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 0-95 N/A
GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X 0-95 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks