戴尔GPU工作站安装Ubuntu深度学习环境

说明

  • 服务器配置:

    • 戴尔Precision 7920 Tower
    • 双路英特尔至强银牌 4210R,20核40线程
    • 两条DDR4-3200 REG-ECC 32GB内存
    • 单块Nvidia RTX3090显卡
    • 1400W电源
  • 操作系统:ubuntu-18.04.5-server-amd64.iso

安装所需软件

  • 这里是后面会用到的软件
1
[root@ubuntu:~]# apt-get install apt-transport-https bash-completion ca-certificates curl git htop iotop jq nethogs tree vim
  • 安装编译环境
1
[root@ubuntu:~]# apt-get install build-essential dkms

安装Nvidia驱动

参考文档

NVIDIA CUDA Installation Guide for Linux

手动添加源

1
[root@ubuntu:~]# add-apt-repository "deb https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu1804/x86_64 /"

添加源公钥

1
[root@ubuntu:~]# apt-key adv --fetch-keys http://developer.download.nvidia.cn/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub

更新APT缓存

1
[root@ubuntu:~]# apt-get update

安装驱动程序

  • 先确认一下软件包的版本
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
[root@ubuntu:~]# apt-cache madison cuda-drivers-460
cuda-drivers-460 | 460.73.01-1 | https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu1804/x86_64 Packages
cuda-drivers-460 | 460.32.03-1 | https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu1804/x86_64 Packages
cuda-drivers-460 | 460.27.04-1 | https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu1804/x86_64 Packages
cuda-drivers-460 | 460.27.04-1 | https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu1804/x86_64 Packages
cuda-drivers-460 | 460.27.04-1 | https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu1804/x86_64 Packages
[root@ubuntu:~]# apt-cache madison nvidia-driver-460
nvidia-driver-460 | 460.73.01-0ubuntu1 | https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu1804/x86_64 Packages
nvidia-driver-460 | 460.73.01-0ubuntu0.18.04.1 | https://opentuna.cn/ubuntu bionic-updates/restricted amd64 Packages
nvidia-driver-460 | 460.73.01-0ubuntu0.18.04.1 | https://opentuna.cn/ubuntu bionic-security/restricted amd64 Packages
nvidia-driver-460 | 460.32.03-0ubuntu1 | https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu1804/x86_64 Packages
nvidia-driver-460 | 460.27.04-0ubuntu1 | https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu1804/x86_64 Packages
nvidia-driver-460 | 460.27.04-0ubuntu1 | https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu1804/x86_64 Packages
nvidia-driver-460 | 460.27.04-0ubuntu1 | https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu1804/x86_64 Packages
[root@ubuntu:~]# apt-cache madison cuda-tools-11-2
cuda-tools-11-3 | 11.3.0-1 | https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu1804/x86_64 Packages
cuda-tools-11-3 | 11.3.0-1 | https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu1804/x86_64 Packages
[root@ubuntu:~]# apt-cache madison libcudnn8
libcudnn8 | 8.2.0.53-1+cuda11.3 | https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu1804/x86_64 Packages
libcudnn8 | 8.2.0.53-1+cuda10.2 | https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu1804/x86_64 Packages
libcudnn8 | 8.2.0.53-1+cuda10.2 | https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu1804/x86_64 Packages
libcudnn8 | 8.1.1.33-1+cuda11.2 | https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu1804/x86_64 Packages
libcudnn8 | 8.1.1.33-1+cuda10.2 | https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu1804/x86_64 Packages
libcudnn8 | 8.1.1.33-1+cuda10.2 | https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu1804/x86_64 Packages
libcudnn8 | 8.1.0.77-1+cuda11.2 | https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu1804/x86_64 Packages
libcudnn8 | 8.1.0.77-1+cuda10.2 | https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu1804/x86_64 Packages
libcudnn8 | 8.1.0.77-1+cuda10.2 | https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu1804/x86_64 Packages
libcudnn8 | 8.0.5.39-1+cuda11.1 | https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu1804/x86_64 Packages
libcudnn8 | 8.0.5.39-1+cuda11.0 | https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu1804/x86_64 Packages
libcudnn8 | 8.0.5.39-1+cuda11.0 | https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu1804/x86_64 Packages
libcudnn8 | 8.0.5.39-1+cuda10.2 | https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu1804/x86_64 Packages
libcudnn8 | 8.0.5.39-1+cuda10.1 | https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu1804/x86_64 Packages
  • 安装指定版本驱动

注意: 如果电脑的BIOS里面开启了SecureBoot功能,在安装驱动时会要求添加key到UEFI,并且设置密码

1
[root@ubuntu:~]# apt-get install cuda-drivers-460=460.73.01-1 nvidia-driver-460=460.73.01-0ubuntu1 cuda-tools-11-3=11.3.0-1 libcudnn8=8.2.0.53-1+cuda11.3

禁用nouveau

1
2
3
4
[root@ubuntu:~]# cat > /usr/lib/modprobe.d/blacklist-nouveau.conf <<EOF
blacklist nouveau
options nouveau modeset=0
EOF

重新生成内核initramfs

1
[root@ubuntu:~]# update-initramfs -u

重启操作系统

  • 重新启动操作系统,加载dkms Nvidia驱动模块,顺带禁用nouveau
1
[root@ubuntu:~]# reboot

注意: 第一次重启时,屏幕会提示”Perform MOK management”,选择”Enroll MOK”,选择”Continue”,选择”Yes”

屏幕提示”Enroll the key(s)?”,输入安装驱动时设置的密码,选择”OK”,重启电脑

检查驱动

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
[root@ubuntu:~]# nvidia-smi
Fri May 14 11:26:32 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 3090 On | 00000000:D5:00.0 Off | N/A |
| 0% 51C P8 19W / 350W | 1MiB / 24259MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

检查GPU持久化服务

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
[root@ubuntu:~]# systemctl status nvidia-persistenced.service 
● nvidia-persistenced.service - NVIDIA Persistence Daemon
Loaded: loaded (/lib/systemd/system/nvidia-persistenced.service; enabled; vendor preset: enabled)
Active: active (running) since Wed 2021-05-12 15:13:29 CST; 1 day 20h ago
Main PID: 1305 (nvidia-persiste)
Tasks: 1 (limit: 6143)
CGroup: /system.slice/nvidia-persistenced.service
└─1305 /usr/bin/nvidia-persistenced --verbose

May 12 15:13:27 ubuntu systemd[1]: Starting NVIDIA Persistence Daemon...
May 12 15:13:27 ubuntu nvidia-persistenced[1305]: Verbose syslog connection opened
May 12 15:13:27 ubuntu nvidia-persistenced[1305]: Started (1305)
May 12 15:13:27 ubuntu nvidia-persistenced[1305]: device 0000:d5:00.0 - registered
May 12 15:13:29 ubuntu nvidia-persistenced[1305]: device 0000:d5:00.0 - persistence mode enabled.
May 12 15:13:29 ubuntu nvidia-persistenced[1305]: device 0000:d5:00.0 - NUMA memory onlined.
May 12 15:13:29 ubuntu nvidia-persistenced[1305]: Local RPC services initialized
May 12 15:13:29 ubuntu systemd[1]: Started NVIDIA Persistence Daemon.

验证CUDA

  • 安装cuda-samples
1
[root@ubuntu:~]# apt-get install cuda-samples-11-3
  • 运行deviceQuery
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
[root@ubuntu:~]# cd /usr/local/cuda-11.3/samples/1_Utilities/deviceQuery
[root@ubuntu:~]# make && ./deviceQuery
...
Detected 1 CUDA Capable device(s)

Device 0: "GeForce RTX 3090"
CUDA Driver Version / Runtime Version 11.2 / 11.3
CUDA Capability Major/Minor version number: 8.6
Total amount of global memory: 24260 MBytes (25438322688 bytes)
(082) Multiprocessors, (128) CUDA Cores/MP: 10496 CUDA Cores
GPU Max Clock rate: 1695 MHz (1.70 GHz)
Memory Clock rate: 9751 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 6291456 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 102400 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device supports Managed Memory: Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 213 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.2, CUDA Runtime Version = 11.3, NumDevs = 1
Result = PASS
  • 运行bandwidthTest
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
[root@ubuntu:~]# cd /usr/local/cuda-11.3/samples/1_Utilities/bandwidthTest
[root@ubuntu:~]# make && ./bandwidthTest
...
[CUDA Bandwidth Test] - Starting...
Running on...

Device 0: GeForce RTX 3090
Quick Mode

Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 12.3

Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 9.0

Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(GB/s)
32000000 787.5

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

安装Nvidia-Docker2

参考文档

Installation Guide

添加Docker-CE源

1
2
[root@ubuntu:~]# curl -fsSL https://mirrors.aliyun.com/docker-ce/linux/ubuntu/gpg | apt-key add -
[root@ubuntu:~]# add-apt-repository "deb [arch=amd64] https://mirrors.aliyun.com/docker-ce/linux/ubuntu $(lsb_release -cs) stable"

安装Docker-CE

1
2
[root@ubuntu:~]# apt-get update
[root@ubuntu:~]# apt-get install docker-ce docker-ce-cli containerd.io

添加Nvidia-Docker源

1
2
3
[root@ubuntu:~]# distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
[root@ubuntu:~]# curl -sSL https://nvidia.github.io/nvidia-docker/gpgkey | apt-key add -
[root@ubuntu:~]# curl -sSL https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | tee /etc/apt/sources.list.d/nvidia-docker.list

安装Nvidia-Docker2

1
2
[root@ubuntu:~]# apt-get update
[root@ubuntu:~]# apt-get nvidia-docker2

编辑docker配置文件

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
[root@ubuntu:~]# cat /etc/docker/daemon.json
{
"data-root": "/var/lib/docker",
"exec-opts": [
"native.cgroupdriver=systemd"
],
"insecure-registries": [],
"log-driver": "json-file",
"log-opts": {
"max-file": "3",
"max-size": "100m"
},
"max-concurrent-downloads": 10,
"registry-mirrors": [
"https://pqbap4ya.mirror.aliyuncs.com",
"https://mirror.ccs.tencentyun.com"
],
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
},
"storage-driver": "overlay2",
"storage-opts": [
"overlay2.override_kernel_check=true"
]
}

重启Docker服务

1
[root@ubuntu:~]# systemctl restart docker.service

验证Nvidia-Docker

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
[root@ubuntu:~]# docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
Unable to find image 'nvidia/cuda:11.0-base' locally
11.0-base: Pulling from nvidia/cuda
54ee1f796a1e: Pull complete
f7bfea53ad12: Pull complete
46d371e02073: Pull complete
b66c17bbf772: Pull complete
3642f1a6dfb3: Pull complete
e5ce55b8b4b9: Pull complete
155bc0332b0a: Pull complete
Digest: sha256:774ca3d612de15213102c2dbbba55df44dc5cf9870ca2be6c6e9c627fa63d67a
Status: Downloaded newer image for nvidia/cuda:11.0-base
Fri May 14 03:39:41 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 3090 On | 00000000:D5:00.0 Off | N/A |
| 0% 51C P8 18W / 350W | 1MiB / 24259MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+