Difference between revisions of "DIGITS DevBox"
Line 1: | Line 1: | ||
==Introduction== | ==Introduction== | ||
− | This page details the build of our [[DIGITS DevBox]]. | + | This page details the build of our [[DIGITS DevBox]], affectionately called "Bastard". [[Using the DevBox]] provides other information. |
The documentation from NVIDIA is here: | The documentation from NVIDIA is here: |
Revision as of 15:36, 15 May 2019
Contents
Introduction
This page details the build of our DIGITS DevBox, affectionately called "Bastard". Using the DevBox provides other information.
The documentation from NVIDIA is here:
- https://docs.nvidia.com/dgx/digits-devbox-user-guide/index.html
- https://developer.nvidia.com/devbox
- https://www.azken.com/download/DIGITS_DEVBOX_DESIGN_GUIDE.pdf
However, unfortunately, the form to get help from NVIDIA is closed [1][2][3]. And most of the other specs are limited to just the hardware [4][5][6][7]. The best instructions that I could find were:
The DevBox is currently unavailable from Amazon [8], and at around $15k buying one is prohibitive for most people. Some firms, including Lamdba Labs [9], Bizon-tech [10], are selling variants on them, but their prices are high too and the details on their specs are limited (the MoBo and config details are missing entirely).
But the parts cost is perhaps $4-5k now for the original spec! So this page goes through everything required to put one together and get it up and running.
Hardware
Description
We mostly followed the original hardware spec from NVIDIA, updating the capacity of the drives and other minor things, as we had many of these parts available as salvage from other boxes. We had to buy the ASUS X99-E WS motherboard (we got the ASUS X99-E WS/USB variant as the original wasn't available and this one has USB3.1), as well as some new drives, just for this project.
We opted to use a Xeon e5-2620v3 processor, rather than the Core i7-5930K. We had both available and both support 40 channels and mount in the LGA 2011-v3 socket, and have 6 cores, 15mb caches, etc. Although the i7 has a faster clock speed but the Xeon takes registered (buffered), ECC DDR4 RDIMMs, which means we can put 256Gb on the board, rather than just 64Gb. For the GPUs we have a TITAN RTX and an older TITAN Xp available to start, and we can add a 1080Ti later, or buy some additional GPUs if needed. We also put the whole thing in a Rosewill RSV-L4000 case.
Parts List
Quantity | Part |
---|---|
1 | ASUS X99-E WS/USB 3.1 LGA 2011-v3 Intel X99 SATA 6Gb/s USB 3.1 USB 3.0 CEB Intel Motherboard |
1 | Intel Haswell Xeon e5-2620v3, 6 core @ 2.4ghz, 6x256k level 1 cache, 15mb level 2 cache, socket LGA 2011-v3 |
8 | Crucial DDR4 RDIMM, 2133Mhz , Registered (buffered) and ECC, 32GB |
1 | NVIDIA TITAN RTX DirectX 12 900-1G150-2500-000 SB 24GB 384-Bit GDDR6 HDCP Ready Video Card |
1 | NVIDIA TITAN Xp Graphics Card (900-1G611-2530-000) |
1 | SAMSUNG 970 EVO PLUS 500GB Internal Solid State Drive (SSD) MZ-V7S500B/AM |
1 | Samsung 850 EVO 500GB 2.5-Inch SATA III Internal SSD (MZ-75E500/EU) |
3 | WD Red 4TB NAS Hard Disk Drive - 5400 RPM Class SATA 6Gb/s 64MB Cache 3.5 Inch - WD40EFRX |
1 | DVDRW: Asus 24x DVD-RW Serial-ATA Internal OEM Optical Drive DRW-24B1ST |
1 | EVGA SuperNOVA 1600 T2 220-T2-1600-X1 80+ TITANIUM 1600W Fully Modular EVGA ECO Mode Power Supply |
1 | Rosewill RSV-L4000 - 4U Rackmount Server Case / Chassis - 8 Internal Bays, 7 Cooling Fans Included |
1 | Rosewill RSV-SATA-Cage-34 - Hard Disk Drives - Black, 3 x 5.25" to 4 x 3.5" Hot-Swap - SATA III / SAS - Cage |
1 | Rosewill RDRD-11003 2.5" SSD / HDD Mounting Kit for 3.5" Drive Bay w/ 60mm Fan |
3 | Corsair ML120 PRO LED CO-9050043-WW 120mm Blue LED 120mm Premium Magnetic Levitation PWM Fan |
2 | ARCTIC F8 PWM Fluid Dynamic Bearing Case Fan, 80mm PWM Speed Control, 31 CFM at 22dBA |
Build notes
Old notes on a prior look at a GPU Build are on the wiki too.
There weren't any particularly noteworthy things about the hardware build. The GPUs need to go in slots 1 and 3, which means they sit tight on each other. We put the Titan XP in slot 1 (and plugged the monitor into its HDMI port), because then the fans for the Titan RTX (which we expect will get heavier use) are in the clear. The fans were set up on a push and pull arrangement, and the hot-swap bay was put in the center position to allow as much airflow past the GPUs as possible.
BIOS
The initial BIOS boot was weird - the machine ran at full power for a short period then powered off multiple times before finally giving a single system beep and loading the BIOS. It may have been memory checking or some such.
We did NOT update the BIOS. It didn't need it. The m.2 drive is visible in the BIOS and will be used as a cache for the RAID 5 array (using bcache). The GPUs are recognized as PCIe devices in the tool section. And all of the SATA drives are being recognized.
We then made the following changes:
- Set the three hard disks to hotswapable enable
- Set the fans to PWM, which drastically cuts down the noise, and set the lower thresholds to 200 (not that it seemed to matter, they seem to be idling at around 1k)
- List the OS as "Other OS" rather than windows, and set enhanced mode to disabled
- Delete the PK to disable secure boot
- Change the boot order to be CD first (not as UEFI, and then the Samsung 850)
Notes:
- We will do RAID 5 array in software, rather using X99 through the BIOS
Software
Main OS Install
Install Ubuntu 18.04 (note that the original DiGIT DevBox ran 14.04), not the live version, from a freshly burnt DVD. If you install the HWE version, you don't need to run apt-get install --install-recommends linux-generic-hwe-18.04 at the end.
In the installer
Choose the first network hardware option and make sure that the second (right most) network port is connected to a DHCP broadcasting router.
Under partitions:
- Put one large partition, formatted as ext4, mounted as /, bootable on the 850
- Partition each SATA drive as RAID
- Put one large partition, formatted as ext4, not mounted on the 970 (for later)
- Put software RAID5 over the 3 SATA drives, format the RAID as ext4 and mount as /bulk
Install SSH and Samba. When prompted, add the MBR to the front of the 850.
First boot
After a reboot, the screen freezes if you didn't install HWE. Either change the bootloader, adding nomodeset (see https://www.pugetsystems.com/labs/hpc/The-Best-Way-To-Install-Ubuntu-18-04-with-NVIDIA-Drivers-and-any-Desktop-Flavor-1178/#step-4-potential-problem-number-1), or just SSH onto the box and fix that now.
Run as root:
apt-get update apt-get dist-upgrade apt-get install --install-recommends linux-generic-hwe-18.04
Check the release:
lsb_release -a
Give the box a reboot!
Video Drivers
Hardware check
Check that the hardware is being seen:
lspci -vk 05:00.0 VGA compatible controller: NVIDIA Corporation GP102 [TITAN Xp] (rev a1) (prog-if 00 [VGA controller]) Subsystem: NVIDIA Corporation GP102 [TITAN Xp] Flags: bus master, fast devsel, latency 0, IRQ 78, NUMA node 0 Memory at fa000000 (32-bit, non-prefetchable) [size=16M] Memory at c0000000 (64-bit, prefetchable) [size=256M] Memory at d0000000 (64-bit, prefetchable) [size=32M] I/O ports at d000 [size=128] Expansion ROM at 000c0000 [disabled] [size=128K] Capabilities: [60] Power Management version 3 Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+ Capabilities: [78] Express Legacy Endpoint, MSI 00 Capabilities: [100] Virtual Channel Capabilities: [250] Latency Tolerance Reporting Capabilities: [128] Power Budgeting <?> Capabilities: [420] Advanced Error Reporting Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 Capabilities: [900] #19 Kernel driver in use: nouveau Kernel modules: nvidiafb, nouveau 06:00.0 VGA compatible controller: NVIDIA Corporation Device 1e02 (rev a1) (prog -if 00 [VGA controller]) Subsystem: NVIDIA Corporation Device 12a3 Flags: fast devsel, IRQ 24, NUMA node 0 Memory at f8000000 (32-bit, non-prefetchable) [size=16M] Memory at a0000000 (64-bit, prefetchable) [size=256M] Memory at b0000000 (64-bit, prefetchable) [size=32M] I/O ports at c000 [size=128] Expansion ROM at f9000000 [disabled] [size=512K] Capabilities: [60] Power Management version 3 Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+ Capabilities: [78] Express Legacy Endpoint, MSI 00 Capabilities: [100] Virtual Channel Capabilities: [250] Latency Tolerance Reporting Capabilities: [258] L1 PM Substates Capabilities: [128] Power Budgeting <?> Capabilities: [420] Advanced Error Reporting Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 Capabilities: [900] #19 Capabilities: [bb0] #15 Kernel modules: nvidiafb, nouveau
This looks good. The second card is the Titan RTX (see https://devicehunt.com/view/type/pci/vendor/10DE/device/1E02).
Currently we are using the nouveau driver for the Xp, and have no driver loaded for the RTX.
You can also list the driver using ubuntu-drivers, which is supposed to tell you which NVIDIA driver is recommended:
apt-get install ubuntu-drivers-common ubuntu-drivers devices == /sys/devices/pci0000:00/0000:00:03.0/0000:03:00.0/0000:04:10.0/0000:05:00.0 == modalias : pci:v000010DEd00001B02sv000010DEsd000011DFbc03sc00i00 vendor : NVIDIA Corporation model : GP102 [TITAN Xp] driver : nvidia-driver-390 - distro non-free recommended driver : xserver-xorg-video-nouveau - distro free builtin
But the 390 is the only driver available from the main repo. Add the experimental repo for more options:
add-apt-repository ppa:graphics-drivers/ppa apt update ubuntu-drivers devices == /sys/devices/pci0000:00/0000:00:03.0/0000:03:00.0/0000:04:10.0/0000:05:00.0 == modalias : pci:v000010DEd00001B02sv000010DEsd000011DFbc03sc00i00 vendor : NVIDIA Corporation model : GP102 [TITAN Xp] driver : nvidia-driver-418 - third-party free driver : nvidia-driver-415 - third-party free driver : nvidia-driver-430 - third-party free recommended driver : nvidia-driver-396 - third-party free driver : nvidia-driver-390 - distro non-free driver : nvidia-driver-410 - third-party free driver : xserver-xorg-video-nouveau - distro free builtin
You could install the driver directly now using, say, apt install nvidia-430. But don't!
CUDA
Get CUDA 10.1 and have it install its preferred driver (418.67):
- The installation instructions are here: https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html
- You can down load CUDA from here: https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1804&target_type=runfilelocal
Essentially, first install build-essential, which gets you gcc. Then blacklist the nouveau driver (see https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#runfile-nouveau) and reboot (to a text terminal, if you have deviated from these instructions and already installed X Windows) so that it isn't loaded.
apt-get install build-essential gcc --version wget https://developer.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda_10.1.168_418.67_linux.run vi /etc/modprobe.d/blacklist-nouveau.conf blacklist nouveau options nouveau modeset=0 update-initramfs -u shutdown -r now lspci -vk Shows no kernel driver in use!
Then run the installer script.
sh cuda_10.1.168_418.67_linux.run =========== = Summary = =========== Driver: Installed Toolkit: Installed in /usr/local/cuda-10.1/ Samples: Installed in /home/ed/, but missing recommended libraries Please make sure that - PATH includes /usr/local/cuda-10.1/bin - LD_LIBRARY_PATH includes /usr/local/cuda-10.1/lib64, or, add /usr/local/cuda-10.1/lib64 to /etc/ld.so.conf and run ldconfig as root To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-10.1/bin To uninstall the NVIDIA Driver, run nvidia-uninstall Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-10.1/doc/pdf for detailed information on setting up CUDA. Logfile is /var/log/cuda-installer.log
Fix the paths:
export PATH=/usr/local/cuda-10.1/bin:/usr/local/cuda-10.1/NsightCompute-2019.1${PATH:+:${PATH}} export LD_LIBRARY_PATH=/usr/local/cuda-10.1/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
Start the persistence daemon:
/usr/bin/nvidia-persistenced --verbose
This should be run at boot, so:
vi /etc/rc.local #!/bin/sh -e /usr/bin/nvidia-persistenced --verbose exit 0 chmod +x /etc/rc.local
Verify the driver:
cat /proc/driver/nvidia/version
Test the installation
Make the samples in:
cd /usr/local/cuda-10.1/samples make
Change into the sample directory and run the tests:
cd /usr/local/cuda-10.1/samples/bin/x86_64/linux/release ./deviceQuery ./bandwidthTest
And yes, it's a thing of beauty!
X Windows
Now install the X window system. The easiest way is:
tasksel And choose your favorite. We used Ubuntu Desktop.
And reboot again to make sure that everything is working nicely.
Bcache
The RAID5 array is set up and mounted as /bulk. We need to add the cache on the m.2 drive. Begin by installing bcache:
apt-get install bcache-tools It was already installed and the newest version
See what we have:
fdisk -l
This gives us:
- /dev/nvme0n1p1 m.2
- /dev/sda RAID disk
- /dev/sdb RAID disk
- /dev/sdc RAID disk
- /dev/md0 RAID array
- /dev/sdd 870
The m.2 is not mounted. This can be seen by checking lsblk (or mount or df):
lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 3.7T 0 disk └─sda1 8:1 0 3.7T 0 part └─md0 9:0 0 7.3T 0 raid5 /bulk sdb 8:16 0 3.7T 0 disk └─sdb1 8:17 0 3.7T 0 part └─md0 9:0 0 7.3T 0 raid5 /bulk sdc 8:32 0 3.7T 0 disk └─sdc1 8:33 0 3.7T 0 part └─md0 9:0 0 7.3T 0 raid5 /bulk sdd 8:48 0 465.8G 0 disk └─sdd1 8:49 0 465.8G 0 part / sr0 11:0 1 1024M 0 rom nvme0n1 259:0 0 465.8G 0 disk └─nvme0n1p1 259:1 0 465.8G 0 part
Check the mdadm.conf file and fstab:
cat /etc/mdadm/mdadm.conf ... ARRAY /dev/md/0 metadata=1.2 UUID=af515d37:8a0e05a1:59338d18:23f5af21 name=bastard:0 cat /etc/fstab UUID=475ad41e-3d64-4c90-8fbc-9289c050acea / ext4 errors=remount-ro 0 1 UUID=aa65554a-24d9-450a-b10c-63c5c6a4b48a /bulk ext4 defaults 0 2 /swapfile none swap sw 0 0
Note that the second UUID refers to /dev/md0, whereas the UUID in the contents of mdadm.conf is the UUID of the 3 RAID5 drives together:
blkid /dev/md0 /dev/md0: UUID="aa65554a-24d9-450a-b10c-63c5c6a4b48a" TYPE="ext4"
Note we have an active RAID5 array:
cat /proc/mdstat
Instructions for taking apart and/or (re-)creating a RAID array are here:
Instructions on building a bcache are here:
Unmount the RAID array:
umount /dev/md0
Wipe the both m.2 and the RAID5 array:
wipefs -a /dev/nvme0n1p1 wipefs -a /dev/md0
Make the bcache, formatting both drives (md0 as backing, m.2 as cache). Note that when you do it one command the assignment is automatic.
make-bcache -B /dev/md0 -C /dev/nvme0n1p1
If you screw up, cd to /sys/fs/bcache/whatever and then ls -l cache0. If there is an entry in there echo 1 > stop. This unregisters the cache and should let you start over.
Check the new bcache array is there, format it and mount it:
ls /dev/bcache* mkfs.ext4 /dev/bcache0 mount /dev/bcache0 /bulk
Now we need to update fstab (see https://help.ubuntu.com/community/Fstab) with the right UUID and spec:
blkid /dev/bcache0 UUID="4c63f20b-ad35-477d-bfaa-82571beba841" TYPE="ext4" cp /etc/fstab /etc/fstab.org vi /etc/fstab Comment out old RAID array entry Add new entry: UUID=4c63f20b-ad35-477d-bfaa-82571beba841 /bulk ext4 rw 0 0
And update your boot image and give it a reboot to check the new bcache array comes back up ok:
update-initramfs -u shutdown -r now
Samba
These instructions are taken from the Research_Computing_Configuration#Samba page with only minor modifications. This guide is helpful: https://linuxconfig.org/how-to-configure-samba-server-share-on-ubuntu-18-04-bionic-beaver-linux
Check samba is running
samba --version
Then fix the conf file:
cp /etc/samba/smb.conf /etc/samba/smb.conf.bak vi /etc/samba/smb.conf workgroup=BASTARDGROUP usershare allow guests = no ;comment out the [printers] and [print$] sections [bulk] comment = Bulk RAID Array path = /bulk browseable = yes create mask= 0775 directory mask = 0775 read only = no guest ok = no
Test the parameters, change the permissions and ownership:
testparm /etc/samba/smb.conf chmod 770 /bulk groupadd smbusers chown :smbusers /bulk
Now create the researcher account, and add it to the samba share group
cat /etc/group groupadd -g 1002 researcher useradd -g researcher -G smbusers -s /bin/bash -p 1234 -d /home/researcher -m researcher passwd researcher hint: littleamount smbpasswd -a researcher
Finally restart samba:
systemctl restart smbd systemctl restart nmbd
Check it works:
smbclient -L localhost (no root password)
And add users to the samba group (if not already):
usermod -G smbusers researcher //Note that this sets the group and will overwrite sudo or other group assignments useradd ed smbusers
Dev Tools
DIGITS
This section follows https://developer.nvidia.com/rdp/digits-download. Install Docker CE first, following https://docs.docker.com/install/linux/docker-ce/ubuntu/
Then follow https://github.com/NVIDIA/nvidia-docker#quick-start to install docker2, but change the last command to use cuda 10.1
... sudo apt-get install -y nvidia-docker2 sudo pkill -SIGHUP dockerd # Test nvidia-smi with the latest official CUDA image docker run --runtime=nvidia --rm nvidia/cuda:10.1-base nvidia-smi
Then pull DIGITS using docker (https://hub.docker.com/r/nvidia/digits/):
docker pull nvidia/digits
Finally run DIGITS inside a docker container (see https://github.com/NVIDIA/nvidia-docker/wiki/DIGITS for other options):
docker run --runtime=nvidia --name digits -d -p 5000:5000 nvidia/digits
And open a browser to http://localhost:5000/ to see DIGITS Documentation:
- https://github.com/NVIDIA/DIGITS/blob/digits-6.0/docs/GettingStarted.md
- https://developer.nvidia.com/digits
cuDNN
Documentation on installing cuDNN is here:
- https://developer.nvidia.com/cuDNN
- https://docs.nvidia.com/deeplearning/sdk/cudnn-install/index.html
First, make an installs directory in bulk and copy the installation files over from the RDP (E:\installs\DIGITS DevBox). Then:
cd /bulk/install/ dpkg -i libcudnn7_7.5.1.10-1+cuda10.1_amd64.deb dpkg -i libcudnn7-dev_7.5.1.10-1+cuda10.1_amd64.deb dpkg -i libcudnn7-doc_7.5.1.10-1+cuda10.1_amd64.deb
And test it:
cp -r /usr/src/cudnn_samples_v7/ $HOME cd $HOME/cudnn_samples_v7/mnistCUDNN make clean && make ./mnistCUDNN Test passed!
Python Based
Now install Anaconda, so that we have python 3, and can pip install everything else. Instructions for installing Anaconda on Ubuntu 18.04LTS (e.g., https://docs.anaconda.com/anaconda/install/linux/) all recommend using the shell script.
From https://www.anaconda.com/distribution/ the latest version is 3.7, so:
cd /bulk/install curl -O https://repo.anaconda.com/archive/Anaconda3-2019.03-Linux-x86_64.sh sha256sum Anaconda3-2019.03-Linux-x86_64.sh
As user researcher, run the installation (this installs python 3.7.3):
bash Anaconda3-2019.03-Linux-x86_64.sh accept the install location: /home/researcher/anaconda3 accept the initialization by running conda init Flush the local env: source ~/.bashrc
Tensorflow
Now install tensorflow using pip (see https://www.tensorflow.org/install/pip):
As root: apt install python3-pip apt install virtualenv pip3 install -U virtualenv
As researcher: cd /home/researcher virtualenv --system-site-packages -p python3 ./venv source ./venv/bin/activate # sh, bash, ksh, or zsh pip install --upgrade tensorflow-gpu python -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))"
And this doesn't work. It turns out that tensorflow 1.13.1 doesn't work with CUDA 10.1! But there is a work around, which is to install cuda10 in conda only (see https://github.com/tensorflow/tensorflow/issues/26182). Still as researcher (and in the venv):
conda install cudatoolkit conda install cudnn conda install tensorflow-gpu export LD_LIBRARY_PATH=/home/researcher/anaconda3/lib/${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}} python -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))" AND IT WORKS!
Note: to deactivate virtual environment:
deactivate
Theano
Theano v.1 requires python >=3.4 and <3.6. We are currently running 3.7. If we decide to install theano, we'll need to set up another version of python and another virtual environment. See:
PyTorch and SciKit
Run the following as researcher (in venv):
conda install -c anaconda numpy conda install pytorch torchvision cudatoolkit=10.0 -c pytorch conda install -c anaconda scikit-learn
Refs:
Other packages
The following are not yet installed: