Difference between revisions of "DIGITS DevBox"

From edegan.com
Jump to navigation Jump to search
 
(43 intermediate revisions by the same user not shown)
Line 1: Line 1:
This page details the build of our [[DIGITS DevBox]]. There's also a page giving information on [[Using the DevBox]].
+
This page details the build of our [[DIGITS DevBox]]. There's also a page giving information on [[Using the DevBox]]. nVIDIA, famous for their incredibly poor supply-chain and inventory management, have been saying [https://developer.nvidia.com/devbo "Please note that we are sold out of our inventory of the DIGITS DevBox, and no new systems are being built"] since shortly after the [https://en.wikipedia.org/wiki/GeForce_10_series Titax X] was the latest and greatest thing (i.e., somewhere around 2016). But it's pretty straight forward to update [https://www.azken.com/download/DIGITS_DEVBOX_DESIGN_GUIDE.pdf their spec].
  
 
==Introduction==
 
==Introduction==
Line 5: Line 5:
 
===Specification===
 
===Specification===
  
[:File:Top1000.jpg|right|300px]
+
<onlyinclude>[[File:Top1000.jpg|right|300px]] Our [[DIGITS DevBox]], affectionately named after Lois McMaster Bujold's fifth God, has a XEON e5-2620v3 processor, 256GB of DDR4 RAM, two GPUs - one Titan RTX and one Titan Xp - with room for two more, a 500GB SSD hard drive (mounting /), and an 8TB RAID5 array bcached with a 512GB m.2 drive (mounting the /bulk share, which is available over samba). It runs Ubuntu 18.04, CUDA 10.0, cuDNN 7.6.1, Anaconda3-2019.03, python 3.7, tensorflow 1.13, digits 6, and other useful machine learning tools/libraries.</onlyinclude>
<onlyinclude>Our [[DIGITS DevBox]], affectionately named "Bastard", has a XEON e5-2620v3 processor, 256GB of DDR4 RAM, two GPUs - one Titan RTX and one Titan XP - with room for two more, a 500GB SSD hard drive (mounting /), and an 8TB RAID5 array bcached with a 512GB m.2 drive (mounting the /bulk share, which is available over samba). It runs Ubuntu 18.04, CUDA 10.1 (and CUDA 10 under conda), cuDNN 7.5.1, Anaconda3-2019.03, python 3.7, tensorflow 1.13, digits 6, and other useful machine learning tools/libraries.</onlyinclude>
 
  
 
===Documentation===
 
===Documentation===
Line 21: Line 20:
 
The DevBox is currently unavailable from Amazon [https://www.amazon.com/Lambda-Deep-Learning-DevBox-Preinstalled/dp/B01BCDK1KC], and at around $15k buying one is prohibitive for most people. Some firms, including Lamdba Labs [https://lambdalabs.com/deep-learning/workstations/4-gpu], Bizon-tech [https://bizon-tech.com/us/bizon-g3000], are selling variants on them, but their prices are high too and the details on their specs are limited (the MoBo and config details are missing entirely).
 
The DevBox is currently unavailable from Amazon [https://www.amazon.com/Lambda-Deep-Learning-DevBox-Preinstalled/dp/B01BCDK1KC], and at around $15k buying one is prohibitive for most people. Some firms, including Lamdba Labs [https://lambdalabs.com/deep-learning/workstations/4-gpu], Bizon-tech [https://bizon-tech.com/us/bizon-g3000], are selling variants on them, but their prices are high too and the details on their specs are limited (the MoBo and config details are missing entirely).
  
But the parts cost is perhaps $4-5k now for the original spec! So this page goes through everything required to put one together and get it up and running.
+
But the parts' cost is perhaps $4-5k now for a massive update to the original spec! So this page goes through everything required to put one together and get it up and running.
  
 
==Hardware==
 
==Hardware==
Line 29: Line 28:
 
We mostly followed the original hardware spec from NVIDIA, updating the capacity of the drives and other minor things, as we had many of these parts available as salvage from other boxes. We had to buy the ASUS X99-E WS motherboard (we got the ASUS X99-E WS/USB variant as the original wasn't available and this one has USB3.1), as well as some new drives, just for this project.
 
We mostly followed the original hardware spec from NVIDIA, updating the capacity of the drives and other minor things, as we had many of these parts available as salvage from other boxes. We had to buy the ASUS X99-E WS motherboard (we got the ASUS X99-E WS/USB variant as the original wasn't available and this one has USB3.1), as well as some new drives, just for this project.
  
We opted to use a Xeon e5-2620v3 processor, rather than the Core i7-5930K. We had both available and both support 40 channels, mount in the LGA 2011-v3 socket, have 6 cores, 15mb caches, etc. Although the i7 has a faster clock speed, the Xeon takes registered (buffered), ECC DDR4 RDIMMs, which means we can put 256Gb on the board, rather than just 64Gb. For the GPUs, we have a TITAN RTX and an older TITAN Xp available to start, and we can add a 1080Ti later, or buy some additional GPUs if needed. We also put the whole thing in a Rosewill RSV-L4000 case.
+
[[File:Front1000.jpg|right|300px]] We opted to use a Xeon e5-2620v3 processor, rather than the Core i7-5930K. We had both available and both support 40 channels, mount in the LGA 2011-v3 socket, have 6 cores, 15mb caches, etc. Although the i7 has a faster clock speed, the Xeon takes registered (buffered), ECC DDR4 RDIMMs, which means we can put 256Gb on the board, rather than just 64Gb. For the GPUs, we have a TITAN RTX and an older TITAN Xp available to start, and we can add a 1080Ti later, or buy some additional GPUs if needed. We also put the whole thing in a Rosewill RSV-L4000 case.
  
 
===Parts List===
 
===Parts List===
Line 71: Line 70:
 
Old notes on a prior look at a [[GPU Build]] are on the wiki too.
 
Old notes on a prior look at a [[GPU Build]] are on the wiki too.
  
There weren't any particularly noteworthy things about the hardware build. The GPUs need to go in slots 1 and 3, which means they sit tight on each other. We put the Titan Xp in slot 1 (and plugged the monitor into its HDMI port), because then the fans for the Titan RTX (which we expect will get heavier use) are in the clear for now. The case fans were set up in a push-and-pull arrangement, and the hot-swap bay was put in the center position to allow as much airflow past the GPUs as possible.
+
[[File:Back1000.jpg|right|300px]] There weren't any particularly noteworthy things about the hardware build. The GPUs need to go in slots 1 and 3, which means they sit tight on each other. We put the Titan Xp in slot 1 (and plugged the monitor into its HDMI port), because then the fans for the Titan RTX (which we expect will get heavier use) are in the clear for now. The case fans were set up in a push-and-pull arrangement, and the hot-swap bay was put in the center position to allow as much airflow past the GPUs as possible.
  
 
===BIOS===
 
===BIOS===
Line 88: Line 87:
 
Notes:
 
Notes:
 
*We will do RAID 5 array in software, rather using X99 through the BIOS
 
*We will do RAID 5 array in software, rather using X99 through the BIOS
 +
 +
What's really crucial is that all the hardware is visible and that we are NOT using UEFI. With UEFI, there is an issue with the drivers not being properly signed under secure boot.
  
 
==Software==
 
==Software==
Line 93: Line 94:
 
===Main OS Install===
 
===Main OS Install===
  
Install Ubuntu 18.04 (note that the original DiGIT DevBox ran 14.04), '''not the live version''', from a freshly burnt DVD. If you install the HWE version, you don't need to run apt-get install --install-recommends linux-generic-hwe-18.04 at the end.
+
Install [http://cdimage.ubuntu.com/releases/18.04.2/release/?_ga=2.30548799.1041204444.1558044875-2114387110.1558044875 Ubuntu 18.04] (note that the original DiGIT DevBox ran 14.04), '''not the live version''', from a freshly burnt DVD. If you install the HWE version, you don't need to run apt-get install --install-recommends linux-generic-hwe-18.04 at the end.
  
 
====In the installer====
 
====In the installer====
Line 99: Line 100:
 
Choose the first network hardware option and make sure that the second (right most) network port is connected to a DHCP broadcasting router.
 
Choose the first network hardware option and make sure that the second (right most) network port is connected to a DHCP broadcasting router.
  
Under partitions:
+
Under partitions:  
 +
[[File:Partitions1000.jpg|right|300px]]
 
# Put one large partition, formatted as ext4, mounted as /, bootable on the 850
 
# Put one large partition, formatted as ext4, mounted as /, bootable on the 850
 
# Partition each SATA drive as RAID
 
# Partition each SATA drive as RAID
Line 120: Line 122:
  
 
Give the box a reboot!
 
Give the box a reboot!
 +
 +
===X Windows===
 +
 +
If you install the video driver before installing Xwindows, you will need to manually edit the Xwindows config files. So, now install the X window system. The easiest way is:
 +
tasksel
 +
  And choose your favorite. We used Ubuntu Desktop.
 +
 +
And reboot again to make sure that everything is working nicely.
  
 
===Video Drivers===
 
===Video Drivers===
  
====Hardware check====
+
The first build of this box was done with an installation of CUDA 10.1, which automatically installed version 418.67 of the NVIDIA driver. We then installed CUDA 10.0 under conda to support Tensorflow 1.13. All went mostly well, and the history of this page contains the instructions. However, at some point, likely because of an OS update, the video driver(s) stopped working. This page now describes the second build (as if it were a build from scratch). [[Addressing Ubuntu NVIDIA Issues]] provides additional information.
  
Check that the hardware is being seen:
+
===Hardware and Drivers===
lspci -vk
 
 
05:00.0 VGA compatible controller: NVIDIA Corporation GP102 [TITAN Xp] (rev a1)              (prog-if 00 [VGA controller])
 
        Subsystem: NVIDIA Corporation GP102 [TITAN Xp]
 
        Flags: bus master, fast devsel, latency 0, IRQ 78, NUMA node 0
 
        Memory at fa000000 (32-bit, non-prefetchable) [size=16M]
 
        Memory at c0000000 (64-bit, prefetchable) [size=256M]
 
        Memory at d0000000 (64-bit, prefetchable) [size=32M]
 
        I/O ports at d000 [size=128]
 
        Expansion ROM at 000c0000 [disabled] [size=128K]
 
        Capabilities: [60] Power Management version 3
 
        Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
 
        Capabilities: [78] Express Legacy Endpoint, MSI 00
 
        Capabilities: [100] Virtual Channel
 
        Capabilities: [250] Latency Tolerance Reporting
 
        Capabilities: [128] Power Budgeting <?>
 
        Capabilities: [420] Advanced Error Reporting
 
        Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024
 
        Capabilities: [900] #19
 
        Kernel driver in use: nouveau
 
        Kernel modules: nvidiafb, nouveau
 
 
 
06:00.0 VGA compatible controller: NVIDIA Corporation Device 1e02 (rev a1) (prog            -if 00 [VGA controller])
 
        Subsystem: NVIDIA Corporation Device 12a3
 
        Flags: fast devsel, IRQ 24, NUMA node 0
 
        Memory at f8000000 (32-bit, non-prefetchable) [size=16M]
 
        Memory at a0000000 (64-bit, prefetchable) [size=256M]
 
        Memory at b0000000 (64-bit, prefetchable) [size=32M]
 
        I/O ports at c000 [size=128]
 
        Expansion ROM at f9000000 [disabled] [size=512K]
 
        Capabilities: [60] Power Management version 3
 
        Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
 
        Capabilities: [78] Express Legacy Endpoint, MSI 00
 
        Capabilities: [100] Virtual Channel
 
        Capabilities: [250] Latency Tolerance Reporting
 
        Capabilities: [258] L1 PM Substates
 
        Capabilities: [128] Power Budgeting <?>
 
        Capabilities: [420] Advanced Error Reporting
 
        Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024
 
        Capabilities: [900] #19
 
        Capabilities: [bb0] #15
 
        Kernel modules: nvidiafb, nouveau
 
  
This looks good. The second card is the Titan RTX (see https://devicehunt.com/view/type/pci/vendor/10DE/device/1E02).
+
Check the hardware is being seen and what driver is being used with:
 +
  lspci -vk
  
 
Currently we are using the nouveau driver for the Xp, and have no driver loaded for the RTX.
 
Currently we are using the nouveau driver for the Xp, and have no driver loaded for the RTX.
Line 200: Line 169:
 
   driver  : xserver-xorg-video-nouveau - distro free builtin
 
   driver  : xserver-xorg-video-nouveau - distro free builtin
  
You could install the driver directly now using, say, apt install nvidia-430. But don't!
+
Then blacklist the nouveau driver (see https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#runfile-nouveau) and reboot to a text terminal so that it isn't loaded.  
 
 
====CUDA====
 
 
 
Get CUDA 10.1 and have it install its preferred driver (418.67):
 
*The installation instructions are here: https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html
 
*You can down load CUDA from here: https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1804&target_type=runfilelocal
 
 
 
Essentially, first install build-essential, which gets you gcc. Then blacklist the nouveau driver (see https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#runfile-nouveau) and reboot (to a text terminal, if you have deviated from these instructions and already installed X Windows) so that it isn't loaded.  
 
  
 
  apt-get install build-essential
 
  apt-get install build-essential
 
  gcc --version
 
  gcc --version
wget https://developer.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda_10.1.168_418.67_linux.run
 
 
  vi /etc/modprobe.d/blacklist-nouveau.conf
 
  vi /etc/modprobe.d/blacklist-nouveau.conf
 
   blacklist nouveau
 
   blacklist nouveau
Line 218: Line 178:
 
  update-initramfs -u
 
  update-initramfs -u
 
  shutdown -r now
 
  shutdown -r now
 +
  Reboot to a text terminal
 
  lspci -vk
 
  lspci -vk
 
   Shows no kernel driver in use!
 
   Shows no kernel driver in use!
  
Then run the installer script.
+
Install the driver!
  sh cuda_10.1.168_418.67_linux.run
+
 
 +
apt install nvidia-driver-430
 +
 
 +
====CUDA====
 +
 
 +
Get CUDA 10.0, rather than 10.1. Although 10.1 is the latest version at the time of writing, it won't work with Tensorflow 1.13, so you'll just end up installing 10.0 under conda anyway.
 +
 
 +
*The installation instructions are here: https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html
 +
*You can down load CUDA 10.0 from here: https://developer.nvidia.com/cuda-10.0-download-archive?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1804&target_type=runfilelocal
 +
Essentially, first install build-essential, which gets you gcc.
 +
 
 +
Then run the installer script and DO NOT install the driver (don't worry about the warning, it will work fine!):
 +
  sh cuda_10.0.130_410.48_linux.run
 +
 
 +
Do you accept the previously read EULA?
 +
accept/decline/quit: accept
 +
 +
Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 410.48?
 +
(y)es/(n)o/(q)uit: n
 +
 +
Install the CUDA 10.0 Toolkit?
 +
(y)es/(n)o/(q)uit: y
 +
 +
Enter Toolkit Location
 +
[ default is /usr/local/cuda-10.0 ]:
 +
 +
Do you want to install a symbolic link at /usr/local/cuda?
 +
(y)es/(n)o/(q)uit: y
 +
 +
Install the CUDA 10.0 Samples?
 +
(y)es/(n)o/(q)uit: y
 +
 +
Enter CUDA Samples Location
 +
[ default is /home/ed ]:
 +
 +
Installing the CUDA Toolkit in /usr/local/cuda-10.0 ...
 +
Missing recommended library: libGLU.so
 +
Missing recommended library: libX11.so
 +
Missing recommended library: libXi.so
 +
Missing recommended library: libXmu.so
 +
Missing recommended library: libGL.so
 
   
 
   
===========
+
Installing the CUDA Samples in /home/ed ...
= Summary =
+
Copying samples to /home/ed/NVIDIA_CUDA-10.0_Samples now...
===========
+
Finished copying samples.
 
   
 
   
Driver:  Installed
+
===========
Toolkit:  Installed in /usr/local/cuda-10.1/
+
= Summary =
Samples:  Installed in /home/ed/, but missing recommended libraries
+
===========
 
   
 
   
  Please make sure that
+
Driver:  Not Selected
  -  PATH includes /usr/local/cuda-10.1/bin
+
Toolkit: Installed in /usr/local/cuda-10.0
  -  LD_LIBRARY_PATH includes /usr/local/cuda-10.1/lib64, or, add /usr/local/cuda-10.1/lib64 to /etc/ld.so.conf and run ldconfig as root
+
Samples:  Installed in /home/ed, but missing recommended libraries
 
   
 
   
To uninstall the CUDA Toolkit, run cuda-uninstaller in /usr/local/cuda-10.1/bin
+
Please make sure that
To uninstall the NVIDIA Driver, run nvidia-uninstall
+
-   PATH includes /usr/local/cuda-10.0/bin
 +
-  LD_LIBRARY_PATH includes /usr/local/cuda-10.0/lib64, or, add /usr/local/cuda-10.0/lib64 to /etc/ld.so.conf and run ldconfig as root
 
   
 
   
  Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-10.1/doc/pdf for detailed information on setting up CUDA.
+
To uninstall the CUDA Toolkit, run the uninstall script in /usr/local/cuda-10.0/bin
  Logfile is /var/log/cuda-installer.log
+
   
 +
Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-10.0/doc/pdf for detailed information on setting up CUDA.
 +
   
 +
***WARNING: Incomplete installation! This installation did not install the CUDA Driver. A driver of version at least 384.00 is required
 +
for CUDA 10.0 functionality to work.
 +
To install the driver using this installer, run the following command, replacing <CudaInstaller> with the name of this run file:
 +
    sudo <CudaInstaller>.run -silent -driver
 +
 +
Logfile is /tmp/cuda_install_2807.log
 +
 
 +
Now fix the paths. To do this for a single user do:
 +
export PATH=/usr/local/cuda-10.0/bin:/usr/local/cuda-10.0${PATH:+:${PATH}}
 +
export LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
  
Fix the paths:
+
But it is better to fix it for everyone by editing your environment file:
  export PATH=/usr/local/cuda-10.1/bin:/usr/local/cuda-10.1/NsightCompute-2019.1${PATH:+:${PATH}}
+
  vi /etc/environment
export LD_LIBRARY_PATH=/usr/local/cuda-10.1/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
+
  PATH="/usr/local/cuda-10.0/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games"
 +
  LD_LIBRARY_PATH="/usr/local/cuda-10.0/lib64"
  
Start the persistence daemon:
+
With version cuda 10.0, you don't need to edit rc.local to start the persistence daemon:
 
  /usr/bin/nvidia-persistenced --verbose
 
  /usr/bin/nvidia-persistenced --verbose
  
This should be run at boot, so:
+
Instead, nvidia-persistenced runs as a service.  
vi /etc/rc.local
 
  #!/bin/sh -e
 
  /usr/bin/nvidia-persistenced --verbose
 
  exit 0
 
chmod +x /etc/rc.local
 
 
 
Verify the driver:
 
cat /proc/driver/nvidia/version
 
  
 
====Test the installation====
 
====Test the installation====
  
Make the samples in:
+
Make the samples...
  cd /usr/local/cuda-10.1/samples
+
  cd /usr/local/cuda-10.0/samples
 
  make
 
  make
 +
 +
And change into the sample directory and run the tests:
  
Change into the sample directory and run the tests:
+
  cd /usr/local/cuda-10.0/samples/bin/x86_64/linux/release
  cd /usr/local/cuda-10.1/samples/bin/x86_64/linux/release
 
 
  ./deviceQuery
 
  ./deviceQuery
 
  ./bandwidthTest  
 
  ./bandwidthTest  
  
And yes, it's a thing of beauty!
+
Everything should be good at this point!
 
 
===X Windows===
 
 
 
Now install the X window system. The easiest way is:
 
tasksel
 
  And choose your favorite. We used Ubuntu Desktop.
 
 
 
And reboot again to make sure that everything is working nicely.
 
  
 
===Bcache===
 
===Bcache===
Line 425: Line 425:
 
This section follows https://developer.nvidia.com/rdp/digits-download. Install Docker CE first, following https://docs.docker.com/install/linux/docker-ce/ubuntu/
 
This section follows https://developer.nvidia.com/rdp/digits-download. Install Docker CE first, following https://docs.docker.com/install/linux/docker-ce/ubuntu/
  
Then follow https://github.com/NVIDIA/nvidia-docker#quick-start to install docker2, but change the last command to use cuda 10.1
+
Then follow https://github.com/NVIDIA/nvidia-docker#quick-start to install docker2, but change the last command to use cuda 10.0
 
  ...
 
  ...
 
  sudo apt-get install -y nvidia-docker2
 
  sudo apt-get install -y nvidia-docker2
 
  sudo pkill -SIGHUP dockerd
 
  sudo pkill -SIGHUP dockerd
 
  # Test nvidia-smi with the latest official CUDA image
 
  # Test nvidia-smi with the latest official CUDA image
  docker run --runtime=nvidia --rm nvidia/cuda:10.1-base nvidia-smi
+
  docker run --runtime=nvidia --rm nvidia/cuda:10.0-base nvidia-smi
  
 
Then pull DIGITS using docker (https://hub.docker.com/r/nvidia/digits/):
 
Then pull DIGITS using docker (https://hub.docker.com/r/nvidia/digits/):
Line 444: Line 444:
 
*https://developer.nvidia.com/digits
 
*https://developer.nvidia.com/digits
  
 +
Note: you can kill docker containers with
 +
docker system prune
 +
 
====cuDNN====
 
====cuDNN====
  
Line 452: Line 455:
 
First, make an installs directory in bulk and copy the installation files over from the RDP (E:\installs\DIGITS DevBox). Then:
 
First, make an installs directory in bulk and copy the installation files over from the RDP (E:\installs\DIGITS DevBox). Then:
 
  cd /bulk/install/
 
  cd /bulk/install/
  dpkg -i libcudnn7_7.5.1.10-1+cuda10.1_amd64.deb
+
  dpkg -i libcudnn7_7.6.1.34-1+cuda10.0_amd64.deb
  dpkg -i libcudnn7-dev_7.5.1.10-1+cuda10.1_amd64.deb
+
  dpkg -i libcudnn7-dev_7.6.1.34-1+cuda10.0_amd64.deb
  dpkg -i libcudnn7-doc_7.5.1.10-1+cuda10.1_amd64.deb
+
  dpkg -i libcudnn7-doc_7.6.1.34-1+cuda10.0_amd64.deb
  
 
And test it:
 
And test it:
Line 493: Line 496:
 
   pip install --upgrade tensorflow-gpu
 
   pip install --upgrade tensorflow-gpu
 
   python -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))"
 
   python -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))"
 
And this doesn't work. It turns out that tensorflow 1.13.1 doesn't work with CUDA 10.1! But there is a work around, which is to install cuda10 in conda only (see https://github.com/tensorflow/tensorflow/issues/26182). We are also going to leave the installation of CUDA 10.1 because tensorflow will catch up at some point.
 
 
Still as researcher (and in the venv):
 
conda install cudatoolkit
 
conda install cudnn
 
conda install tensorflow-gpu
 
export LD_LIBRARY_PATH=/home/researcher/anaconda3/lib/${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
 
python -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))"
 
  AND IT WORKS!
 
  
 
Note: to deactivate the virtual environment:
 
Note: to deactivate the virtual environment:
 
  deactivate
 
  deactivate
 +
 +
Note that adding the anaconda path to /etc/environment makes the virtual environment redundant.
  
 
=====PyTorch and SciKit=====
 
=====PyTorch and SciKit=====
Line 529: Line 524:
 
Theano v.1 requires python >=3.4 and <3.6. We are currently running 3.7. If we decide to install theano, we'll need to set up another version of python and another virtual environment. See:
 
Theano v.1 requires python >=3.4 and <3.6. We are currently running 3.7. If we decide to install theano, we'll need to set up another version of python and another virtual environment. See:
 
*http://deeplearning.net/software/theano/install_ubuntu.html
 
*http://deeplearning.net/software/theano/install_ubuntu.html
 +
 +
===VNC===
 +
 +
In order to use the graphical interface for Matlab and other applications, we need a VNC server.
 +
 +
First, install the VNC client remotely. We use the standalone exe from TigerVNC.
 +
 +
Now install TightVNC, following the instructions: https://www.digitalocean.com/community/tutorials/how-to-install-and-configure-vnc-on-ubuntu-18-04
 +
 +
cd /root
 +
apt-get install xfce4 xfce4-goodies
 +
 +
As user
 +
sudo apt-get install tightvncserver
 +
vncserver
 +
  set password for user (ailia)
 +
vncserver -kill :1
 +
mv ~/.vnc/xstartup ~/.vnc/xstartup.bak
 +
vi ~/.vnc/xstartup
 +
  #!/bin/bash
 +
  xrdb $HOME/.Xresources
 +
  startxfce4 &
 +
vncserver
 +
sudo vi /etc/systemd/system/vncserver@.service
 +
  [Unit]
 +
  Description=Start TightVNC server at startup
 +
  After=syslog.target network.target 
 +
 
 +
  [Service]
 +
  Type=forking
 +
  User=uname
 +
  Group=uname
 +
  WorkingDirectory=/home/uname
 +
 
 +
  PIDFile=/home/ed/.vnc/%H:%i.pid
 +
  ExecStartPre=-/usr/bin/vncserver -kill :%i > /dev/null 2>&1
 +
  ExecStart=/usr/bin/vncserver -depth 24 -geometry 1280x800 :%i
 +
  ExecStop=/usr/bin/vncserver -kill :%i
 +
 
 +
  [Install]
 +
  WantedBy=multi-user.target
 +
 +
Note that changing the color depth breaks it!
 +
 +
To make changes (or after the edit)
 +
sudo systemctl daemon-reload
 +
sudo systemctl enable vncserver@2.service
 +
vncserver -kill :2
 +
sudo systemctl start vncserver@2
 +
sudo systemctl status vncserver@2
 +
 +
Stop the server with
 +
sudo systemctl stop vncserver@2
 +
 +
Note that we are using :2 because :1 is running our regular Xwindows GUI.
 +
 +
Instrucions on how to set up an IP tunnel using PuTTY:
 +
https://helpdeskgeek.com/how-to/tunnel-vnc-over-ssh/
 +
 +
====Connection Issues====
 +
 +
Coming back to this, I had issues connecting. I set up the tunnel using the saved profile in puTTY.exe and checked to see which local port was listening (it was 5901) and not firewalled using the listening ports tab under network on resmon.exe (it said allowed, not restricted under firewall status). VNC seemed to be running fine on Bastard, and I tried connecting to localhost::1 (that is 5901 on the localhost, through the tunnel to 5902 on Bastard) using VNC Connect by RealVNC. The connection was refused.
 +
 +
I checked it was listening and there was no firewall:
 +
netstat -tlpn
 +
  tcp        0      0 0.0.0.0:5902            0.0.0.0:*              LISTEN      2025/Xtightvnc
 +
ufw status
 +
  Status: inactive
 +
 +
The localhost port seems to be open and listening just fine:
 +
Test-NetConnection 127.0.0.1 -p 5901
 +
 +
So, presumably, there must be something wrong with the tunnel itself.
 +
 +
'''Ignoring the SSH tunnel worked fine: Connect to 192.168.2.202::5902 using the TightVNC (or RealVNC, etc.) client.'''
 +
 +
====Later Notes====
 +
 +
=====Change the resolution=====
 +
 +
I came back and changed the resolution to make it work on one of my portrait desktop monitors.
 +
See https://www.tightvnc.com/vncserver.1.php
 +
 +
As root:
 +
vi /etc/systemd/system/vncserver@.service
 +
  Change line:
 +
  ExecStart=/usr/bin/vncserver -depth 24 -geometry 1440x2560 :%i
 +
  (Note that the size is 2160x3840 divide by 150%). Leave the color depth as it says elsewhere that changes are bad.
 +
systemctl daemon-reload
 +
systemctl enable vncserver@2.service
 +
 +
As Ed:
 +
vncserver -kill :2
 +
sudo systemctl start vncserver@2
 +
sudo systemctl status vncserver@2
 +
 +
Exit full screen with ctrl-alt-shift-f.
 +
 +
=====Cut And Paste=====
 +
 +
Also, try to fix the cut-and-paste issue. See, for example, https://unix.stackexchange.com/questions/35030/how-can-i-copy-paste-data-to-and-from-the-windows-clipboard-to-an-opensuse-clipb
 +
 +
As root:
 +
apt-get install autocutsel
 +
vi ~/.vnc/xstartup
 +
  #!/bin/bash
 +
  xrdb $HOME/.Xresources
 +
  autocutsel -fork 
 +
  startxfce4 &
 +
 +
Though this might have been working fine anyway. Just change the terminal and all will be well.
 +
 +
=====Use XFCE terminal=====
 +
 +
Change Settings: Preferred Applications -> Utilities -> Terminal to XFCE
 +
 +
Note that this seems to fix everything but the instructions for customizing the menu are here: https://wiki.xfce.org/howto/customize-menu
 +
cat /etc/xdg/menus/xfce-applications.menu
 +
 +
===RDP===
 +
 +
I also installed xrdp:
 +
apt install xrdp
 +
adduser xrdp ssl-cert
 +
#Check the status and that it is listening on 3389
 +
systemctl status xrd
 +
netstat -tln
 +
  #It is listening...
 +
vi /etc/xrdp/xrdp.ini
 +
  #See https://linux.die.net/man/5/xrdp.ini
 +
systemctl restart xrdp
 +
 +
This gave a dead session (a flat light blue screen with nothing on it), which finally yielded a connection log which said "login successful for display 10, start connecting, connection problems, giving up, some problem."
 +
  cat /var/log/xrdp-sesman.log
 +
 +
There could be some conflict between VNC and RDP. systemctl status xrdp shows "xrdp_wm_log_msg: connection problem, giving up".
 +
 +
I tried without success:
 +
gsettings set org.gnome.Vino require-encryption false
 +
  https://askubuntu.com/questions/797973/error-problem-connecting-windows-10-rdp-into-xrdp
 +
vi /etc/X11/Xwrapper.config
 +
  allowed_users = anybody
 +
  This was promising as it was previously set to consol.
 +
  https://www.linuxquestions.org/questions/linux-software-2/xrdp-under-debian-9-connection-problem-4175623357/#post5817508
 +
apt-get install xorgxrdp-hwe-18.04
 +
  Couldn't find the package... This lead was promising as it applies to 18.04.02 HWE, which is what I'm running
 +
  https://www.nakivo.com/blog/how-to-use-remote-desktop-connection-ubuntu-linux-walkthrough/
 +
dpkg -l |grep xserver-xorg-core
 +
  ii  xserver-xorg-core                          2:1.19.6-1ubuntu4.3                          amd64        Xorg X server - core server
 +
  Which seems ok, despite having a problem with XRDP and Ubuntu 18.04 HWE documented very clearly here: http://c-nergy.be/blog/?p=13972
 +
 +
There is clearly an issue with Ubuntu 18.04 and XRDP. The solution seems to be to downgrade xserver-xorg-core and some related packages, which can be done with an install script (https://c-nergy.be/blog/?p=13933) or manually. But I don't want to do that, so I removed xrdp and went back to VNC!
 +
apt remove xrdp
 +
 +
===Other Software===
 +
 +
I installed the community edition of PyCharm:
 +
snap install pycharm-community --classic
 +
  #Restart the local terminal so that it has updated paths (after a snap install, etc.)
 +
/snap/pycharm-community/214/bin/pycharm.sh
 +
 +
On launch, you get some config options. I chose to install and enable:
 +
*IdeaVim (a VI editor emulator)
 +
*R
 +
*AWS Toolkit
 +
 +
Make a launcher: In /usr/share/applications:
 +
vi pycharm.desktop
 +
  [Desktop Entry]
 +
  Version=2020.2.3
 +
  Type=Application
 +
  Name=PyCharm
 +
  Icon=/snap/pycharm-community/214/bin/pycharm.png
 +
  Exec="/snap/pycharm-community/214/bin/pycharm.sh" %f
 +
  Comment=The Drive to Develop
 +
  Categories=Development;IDE;
 +
  Terminal=false
 +
  StartupWMClass=jetbrains-pycharm
 +
 +
Also, create a launcher on the desktop with the same info.
 +
 +
Note that when I came back to the box the launcher didn't work...
 +
 +
==== MATLAB ====
 +
 +
I installed MATLAB R2024a by downloading the zip, running
 +
sudo ./install
 +
 +
and using the defaults of /usr/local/MATLAB/R2024 etc. The license number is 41201644.
 +
 +
===Upgrading the nVIDIA Drivers===
 +
 +
In MATLAB, I ran:
 +
gpuDevice
 +
  Error using gpuDevice (line 26)
 +
  Graphics driver is out of date. Download and install the latest graphics driver for your GPU from NVIDIA.
 +
 +
Some quick checks showed that I was using driver version 430.26 on ubuntu 18.04.02.
 +
nvidia-smi
 +
lsb_release -a
 +
 +
I couldn't quite get MATLAB to tell me what I needed:
 +
* https://www.mathworks.com/help/parallel-computing/gpu-computing-requirements.html
 +
* https://www.mathworks.com/help/parallel-computing/run-mex-functions-containing-cuda-code.html#mw_20acaa78-994d-4695-ab4b-bca1cfc3dbac
 +
 +
For MEX, I have 10.2 and need 12.2 of the CUDA toolkit:
 +
MATLAB Release CUDA Toolkit Version
 +
R2024a 12.2
 +
...
 +
R2020b 10.2
 +
 +
However:
 +
* nVidia said the latest version was https://www.nvidia.com/Download/driverResults.aspx/230357/en-us/
 +
* The repo said the highest version for 18.04 is 545: https://launchpad.net/~graphics-drivers/+archive/ubuntu/ppa
 +
 +
As root:
 +
runlevel
 +
  #5
 +
systemctl get-default
 +
  #graphical.target
 +
systemctl set-default multi-user.target
 +
systemctl reboot
 +
 +
As ed:
 +
vncserver -kill :2
 +
Killing Xtightvnc process ID 1844
 +
 +
As root:
 +
#sh ./NVIDIA-Linux-x86_64-550.107.02.run
 +
# The distribution-provided pre-install script failed!
 +
#cat /var/log/nvidia-installer.log
 +
 +
apt-get update
 +
apt install nvidia-driver-545
 +
systemctl set-default graphical.target
 +
systemctl reboot
 +
 +
Run MATLAB
 +
gpuDevice
 +
  Name: 'NVIDIA TITAN RTX'
 +
                    Index: 1
 +
        ComputeCapability: '7.5'
 +
    GraphicsDriverVersion: '545.29.06'
 +
            ToolkitVersion: 12.2000
 +
 +
gpuDevice(2)
 +
                      Name: 'NVIDIA TITAN Xp'
 +
                    Index: 2
 +
        ComputeCapability: '6.1'
 +
            SupportsDouble: 1
 +
    GraphicsDriverVersion: '545.29.06'
 +
            ToolkitVersion: 12.2000
 +
 +
The messages were:
 +
apt install nvidia-driver-545
 +
The following additional packages will be installed:
 +
  libnvidia-cfg1-545 libnvidia-common-545 libnvidia-compute-545 libnvidia-compute-545:i386 libnvidia-decode-545
 +
  libnvidia-decode-545:i386 libnvidia-encode-545 libnvidia-encode-545:i386 libnvidia-extra-545 libnvidia-fbc1-545
 +
  libnvidia-fbc1-545:i386 libnvidia-gl-545 libnvidia-gl-545:i386 nvidia-compute-utils-545 nvidia-dkms-545
 +
  nvidia-firmware-545-545.29.06 nvidia-kernel-common-545 nvidia-kernel-source-545 nvidia-utils-545
 +
  xserver-xorg-video-nvidia-545
 +
The following packages will be REMOVED:
 +
  libnvidia-cfg1-430 libnvidia-common-430 libnvidia-compute-430 libnvidia-compute-430:i386 libnvidia-decode-430
 +
  libnvidia-decode-430:i386 libnvidia-encode-430 libnvidia-encode-430:i386 libnvidia-fbc1-430 libnvidia-fbc1-430:i386
 +
  libnvidia-gl-430 libnvidia-gl-430:i386 libnvidia-ifr1-430 libnvidia-ifr1-430:i386 nvidia-compute-utils-430 nvidia-dkms-430
 +
    nvidia-driver-430 nvidia-kernel-common-430 nvidia-kernel-source-430 nvidia-utils-430 xserver-xorg-video-nvidia-430
 +
The following NEW packages will be installed:
 +
  libnvidia-cfg1-545 libnvidia-common-545 libnvidia-compute-545 libnvidia-compute-545:i386 libnvidia-decode-545
 +
  libnvidia-decode-545:i386 libnvidia-encode-545 libnvidia-encode-545:i386 libnvidia-extra-545 libnvidia-fbc1-545
 +
  libnvidia-fbc1-545:i386 libnvidia-gl-545 libnvidia-gl-545:i386 nvidia-compute-utils-545 nvidia-dkms-545 nvidia-driver-545
 +
  nvidia-firmware-545-545.29.06 nvidia-kernel-common-545 nvidia-kernel-source-545 nvidia-utils-545
 +
  xserver-xorg-video-nvidia-545
 +
0 upgraded, 21 newly installed, 21 to remove and 2 not upgraded.

Latest revision as of 17:07, 9 August 2024

This page details the build of our DIGITS DevBox. There's also a page giving information on Using the DevBox. nVIDIA, famous for their incredibly poor supply-chain and inventory management, have been saying "Please note that we are sold out of our inventory of the DIGITS DevBox, and no new systems are being built" since shortly after the Titax X was the latest and greatest thing (i.e., somewhere around 2016). But it's pretty straight forward to update their spec.

Introduction

Specification

Top1000.jpg

Our DIGITS DevBox, affectionately named after Lois McMaster Bujold's fifth God, has a XEON e5-2620v3 processor, 256GB of DDR4 RAM, two GPUs - one Titan RTX and one Titan Xp - with room for two more, a 500GB SSD hard drive (mounting /), and an 8TB RAID5 array bcached with a 512GB m.2 drive (mounting the /bulk share, which is available over samba). It runs Ubuntu 18.04, CUDA 10.0, cuDNN 7.6.1, Anaconda3-2019.03, python 3.7, tensorflow 1.13, digits 6, and other useful machine learning tools/libraries.

Documentation

The documentation from NVIDIA is here:

However, unfortunately, the form to get help from NVIDIA is closed [1][2][3]. And most of the other specs are limited to just the hardware [4][5][6][7]. The best instructions that I could find were:

The DevBox is currently unavailable from Amazon [8], and at around $15k buying one is prohibitive for most people. Some firms, including Lamdba Labs [9], Bizon-tech [10], are selling variants on them, but their prices are high too and the details on their specs are limited (the MoBo and config details are missing entirely).

But the parts' cost is perhaps $4-5k now for a massive update to the original spec! So this page goes through everything required to put one together and get it up and running.

Hardware

Description

We mostly followed the original hardware spec from NVIDIA, updating the capacity of the drives and other minor things, as we had many of these parts available as salvage from other boxes. We had to buy the ASUS X99-E WS motherboard (we got the ASUS X99-E WS/USB variant as the original wasn't available and this one has USB3.1), as well as some new drives, just for this project.

Front1000.jpg

We opted to use a Xeon e5-2620v3 processor, rather than the Core i7-5930K. We had both available and both support 40 channels, mount in the LGA 2011-v3 socket, have 6 cores, 15mb caches, etc. Although the i7 has a faster clock speed, the Xeon takes registered (buffered), ECC DDR4 RDIMMs, which means we can put 256Gb on the board, rather than just 64Gb. For the GPUs, we have a TITAN RTX and an older TITAN Xp available to start, and we can add a 1080Ti later, or buy some additional GPUs if needed. We also put the whole thing in a Rosewill RSV-L4000 case.

Parts List

Quantity Part
1 ASUS X99-E WS/USB 3.1 LGA 2011-v3 Intel X99 SATA 6Gb/s USB 3.1 USB 3.0 CEB Intel Motherboard
1 Intel Haswell Xeon e5-2620v3, 6 core @ 2.4ghz, 6x256k level 1 cache, 15mb level 2 cache, socket LGA 2011-v3
8 Crucial DDR4 RDIMM, 2133Mhz , Registered (buffered) and ECC, 32GB
1 NVIDIA TITAN RTX DirectX 12 900-1G150-2500-000 SB 24GB 384-Bit GDDR6 HDCP Ready Video Card
1 NVIDIA TITAN Xp Graphics Card (900-1G611-2530-000)
1 SAMSUNG 970 EVO PLUS 500GB Internal Solid State Drive (SSD) MZ-V7S500B/AM
1 Samsung 850 EVO 500GB 2.5-Inch SATA III Internal SSD (MZ-75E500/EU)
3 WD Red 4TB NAS Hard Disk Drive - 5400 RPM Class SATA 6Gb/s 64MB Cache 3.5 Inch - WD40EFRX
1 DVDRW: Asus 24x DVD-RW Serial-ATA Internal OEM Optical Drive DRW-24B1ST
1 EVGA SuperNOVA 1600 T2 220-T2-1600-X1 80+ TITANIUM 1600W Fully Modular EVGA ECO Mode Power Supply
1 Rosewill RSV-L4000 - 4U Rackmount Server Case / Chassis - 8 Internal Bays, 7 Cooling Fans Included
1 Rosewill RSV-SATA-Cage-34 - Hard Disk Drives - Black, 3 x 5.25" to 4 x 3.5" Hot-Swap - SATA III / SAS - Cage
1 Rosewill RDRD-11003 2.5" SSD / HDD Mounting Kit for 3.5" Drive Bay w/ 60mm Fan
3 Corsair ML120 PRO LED CO-9050043-WW 120mm Blue LED 120mm Premium Magnetic Levitation PWM Fan
2 ARCTIC F8 PWM Fluid Dynamic Bearing Case Fan, 80mm PWM Speed Control, 31 CFM at 22dBA

Build notes

Old notes on a prior look at a GPU Build are on the wiki too.

Back1000.jpg

There weren't any particularly noteworthy things about the hardware build. The GPUs need to go in slots 1 and 3, which means they sit tight on each other. We put the Titan Xp in slot 1 (and plugged the monitor into its HDMI port), because then the fans for the Titan RTX (which we expect will get heavier use) are in the clear for now. The case fans were set up in a push-and-pull arrangement, and the hot-swap bay was put in the center position to allow as much airflow past the GPUs as possible.

BIOS

The initial BIOS boot was weird - the machine ran at full power for a short period then powered off multiple times before finally giving a single system beep and loading the BIOS. It may have been memory checking or some such.

We did NOT update the BIOS. It didn't need it. The m.2 drive is visible in the BIOS and will be used as a cache for the RAID 5 array (using bcache). The GPUs are recognized as PCIe devices in the tool section. And all of the SATA drives are being recognized.

We then made the following changes:

  • Set the three hard disks to hot-swap enable
  • Set the fans to PWM, which drastically cuts down the noise, and set the lower thresholds to 200 (not that it seemed to matter, they seem to be idling at around 1k)
  • List the OS as "Other OS" rather than windows, and set enhanced mode to disabled
  • Delete the PK to disable secure boot
  • Change the boot order to be CD first (not as UEFI, and then the Samsung 850)

Notes:

  • We will do RAID 5 array in software, rather using X99 through the BIOS

What's really crucial is that all the hardware is visible and that we are NOT using UEFI. With UEFI, there is an issue with the drivers not being properly signed under secure boot.

Software

Main OS Install

Install Ubuntu 18.04 (note that the original DiGIT DevBox ran 14.04), not the live version, from a freshly burnt DVD. If you install the HWE version, you don't need to run apt-get install --install-recommends linux-generic-hwe-18.04 at the end.

In the installer

Choose the first network hardware option and make sure that the second (right most) network port is connected to a DHCP broadcasting router.

Under partitions:

Partitions1000.jpg
  1. Put one large partition, formatted as ext4, mounted as /, bootable on the 850
  2. Partition each SATA drive as RAID
  3. Put one large partition, formatted as ext4, not mounted on the 970 (for later)
  4. Put software RAID5 over the 3 SATA drives, format the RAID as ext4 and mount as /bulk

Install SSH and Samba. When prompted, add the MBR to the front of the 850.

First boot

After a reboot, the screen freezes if you didn't install HWE. Either change the bootloader, adding nomodeset (see https://www.pugetsystems.com/labs/hpc/The-Best-Way-To-Install-Ubuntu-18-04-with-NVIDIA-Drivers-and-any-Desktop-Flavor-1178/#step-4-potential-problem-number-1), or just SSH onto the box and fix that now.

Run as root:

apt-get update
apt-get dist-upgrade
apt-get install --install-recommends linux-generic-hwe-18.04 

Check the release:

lsb_release -a

Give the box a reboot!

X Windows

If you install the video driver before installing Xwindows, you will need to manually edit the Xwindows config files. So, now install the X window system. The easiest way is:

tasksel
 And choose your favorite. We used Ubuntu Desktop.

And reboot again to make sure that everything is working nicely.

Video Drivers

The first build of this box was done with an installation of CUDA 10.1, which automatically installed version 418.67 of the NVIDIA driver. We then installed CUDA 10.0 under conda to support Tensorflow 1.13. All went mostly well, and the history of this page contains the instructions. However, at some point, likely because of an OS update, the video driver(s) stopped working. This page now describes the second build (as if it were a build from scratch). Addressing Ubuntu NVIDIA Issues provides additional information.

Hardware and Drivers

Check the hardware is being seen and what driver is being used with:

 lspci -vk

Currently we are using the nouveau driver for the Xp, and have no driver loaded for the RTX.

You can also list the driver using ubuntu-drivers, which is supposed to tell you which NVIDIA driver is recommended:

apt-get install ubuntu-drivers-common
ubuntu-drivers devices
 == /sys/devices/pci0000:00/0000:00:03.0/0000:03:00.0/0000:04:10.0/0000:05:00.0 ==
 modalias : pci:v000010DEd00001B02sv000010DEsd000011DFbc03sc00i00
 vendor   : NVIDIA Corporation
 model    : GP102 [TITAN Xp]
 driver   : nvidia-driver-390 - distro non-free recommended
 driver   : xserver-xorg-video-nouveau - distro free builtin

But the 390 is the only driver available from the main repo. Add the experimental repo for more options:

add-apt-repository ppa:graphics-drivers/ppa
apt update
ubuntu-drivers devices
 == /sys/devices/pci0000:00/0000:00:03.0/0000:03:00.0/0000:04:10.0/0000:05:00.0 ==
 modalias : pci:v000010DEd00001B02sv000010DEsd000011DFbc03sc00i00
 vendor   : NVIDIA Corporation
 model    : GP102 [TITAN Xp]
 driver   : nvidia-driver-418 - third-party free
 driver   : nvidia-driver-415 - third-party free
 driver   : nvidia-driver-430 - third-party free recommended
 driver   : nvidia-driver-396 - third-party free
 driver   : nvidia-driver-390 - distro non-free
 driver   : nvidia-driver-410 - third-party free
 driver   : xserver-xorg-video-nouveau - distro free builtin

Then blacklist the nouveau driver (see https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#runfile-nouveau) and reboot to a text terminal so that it isn't loaded.

apt-get install build-essential
gcc --version
vi /etc/modprobe.d/blacklist-nouveau.conf
 blacklist nouveau
 options nouveau modeset=0
update-initramfs -u
shutdown -r now
 Reboot to a text terminal
lspci -vk
 Shows no kernel driver in use!

Install the driver!

apt install nvidia-driver-430

CUDA

Get CUDA 10.0, rather than 10.1. Although 10.1 is the latest version at the time of writing, it won't work with Tensorflow 1.13, so you'll just end up installing 10.0 under conda anyway.

Essentially, first install build-essential, which gets you gcc.

Then run the installer script and DO NOT install the driver (don't worry about the warning, it will work fine!):

sh cuda_10.0.130_410.48_linux.run
	Do you accept the previously read EULA?
	accept/decline/quit: accept

	Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 410.48?
	(y)es/(n)o/(q)uit: n

	Install the CUDA 10.0 Toolkit?
	(y)es/(n)o/(q)uit: y

	Enter Toolkit Location
	 [ default is /usr/local/cuda-10.0 ]:

	Do you want to install a symbolic link at /usr/local/cuda?
	(y)es/(n)o/(q)uit: y

	Install the CUDA 10.0 Samples?
	(y)es/(n)o/(q)uit: y

	Enter CUDA Samples Location
	 [ default is /home/ed ]:

	Installing the CUDA Toolkit in /usr/local/cuda-10.0 ...
	Missing recommended library: libGLU.so
	Missing recommended library: libX11.so
	Missing recommended library: libXi.so
	Missing recommended library: libXmu.so
	Missing recommended library: libGL.so

	Installing the CUDA Samples in /home/ed ...
	Copying samples to /home/ed/NVIDIA_CUDA-10.0_Samples now...
	Finished copying samples.

	===========
	= Summary =
	===========

	Driver:   Not Selected
	Toolkit:  Installed in /usr/local/cuda-10.0
	Samples:  Installed in /home/ed, but missing recommended libraries

	Please make sure that
	 -   PATH includes /usr/local/cuda-10.0/bin
	 -   LD_LIBRARY_PATH includes /usr/local/cuda-10.0/lib64, or, add /usr/local/cuda-10.0/lib64 to /etc/ld.so.conf and run ldconfig as root

	To uninstall the CUDA Toolkit, run the uninstall script in /usr/local/cuda-10.0/bin

	Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-10.0/doc/pdf for detailed information on setting up CUDA.

	***WARNING: Incomplete installation! This installation did not install the CUDA Driver. A driver of version at least 384.00 is required 
for CUDA 10.0 functionality to work.
	To install the driver using this installer, run the following command, replacing <CudaInstaller> with the name of this run file:
	    sudo <CudaInstaller>.run -silent -driver

	Logfile is /tmp/cuda_install_2807.log

Now fix the paths. To do this for a single user do:

export PATH=/usr/local/cuda-10.0/bin:/usr/local/cuda-10.0${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

But it is better to fix it for everyone by editing your environment file:

vi /etc/environment
 PATH="/usr/local/cuda-10.0/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games"
 LD_LIBRARY_PATH="/usr/local/cuda-10.0/lib64"

With version cuda 10.0, you don't need to edit rc.local to start the persistence daemon:

/usr/bin/nvidia-persistenced --verbose

Instead, nvidia-persistenced runs as a service.

Test the installation

Make the samples...

cd /usr/local/cuda-10.0/samples
make

And change into the sample directory and run the tests:

cd /usr/local/cuda-10.0/samples/bin/x86_64/linux/release
./deviceQuery
./bandwidthTest 

Everything should be good at this point!

Bcache

The RAID5 array is set up and mounted as /bulk. We need to add the cache on the m.2 drive. Begin by installing bcache:

apt-get install bcache-tools
It was already installed and the newest version

See what we have:

fdisk -l

This gives us:

  • /dev/nvme0n1p1 m.2
  • /dev/sda RAID disk
  • /dev/sdb RAID disk
  • /dev/sdc RAID disk
  • /dev/md0 RAID array
  • /dev/sdd 870

The m.2 is not mounted. This can be seen by checking lsblk (or mount or df):

lsblk
NAME        MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda           8:0    0   3.7T  0 disk
└─sda1        8:1    0   3.7T  0 part
  └─md0       9:0    0   7.3T  0 raid5 /bulk
sdb           8:16   0   3.7T  0 disk
└─sdb1        8:17   0   3.7T  0 part
  └─md0       9:0    0   7.3T  0 raid5 /bulk
sdc           8:32   0   3.7T  0 disk
└─sdc1        8:33   0   3.7T  0 part
  └─md0       9:0    0   7.3T  0 raid5 /bulk
sdd           8:48   0 465.8G  0 disk
└─sdd1        8:49   0 465.8G  0 part  /
sr0          11:0    1  1024M  0 rom
nvme0n1     259:0    0 465.8G  0 disk
└─nvme0n1p1 259:1    0 465.8G  0 part

Check the mdadm.conf file and fstab:

cat /etc/mdadm/mdadm.conf
 ...
 ARRAY /dev/md/0  metadata=1.2 UUID=af515d37:8a0e05a1:59338d18:23f5af21 name=bastard:0

cat /etc/fstab
 UUID=475ad41e-3d64-4c90-8fbc-9289c050acea /               ext4    errors=remount-ro 0 1
 UUID=aa65554a-24d9-450a-b10c-63c5c6a4b48a /bulk           ext4    defaults 0 2
 /swapfile                                 none            swap    sw 0 0

Note that the second UUID refers to /dev/md0, whereas the UUID in the contents of mdadm.conf is the UUID of the 3 RAID5 drives together:

blkid /dev/md0
/dev/md0: UUID="aa65554a-24d9-450a-b10c-63c5c6a4b48a" TYPE="ext4"

Note we have an active RAID5 array:

cat /proc/mdstat

Instructions for taking apart and/or (re-)creating a RAID array are here:

Instructions on building a bcache are here:

Unmount the RAID array:

umount /dev/md0

Wipe the both m.2 and the RAID5 array:

wipefs -a /dev/nvme0n1p1
wipefs -a /dev/md0

Make the bcache, formatting both drives (md0 as backing, m.2 as cache). Note that when you do it one command the assignment is automatic.

make-bcache -B /dev/md0 -C /dev/nvme0n1p1

If you screw up, cd to /sys/fs/bcache/whatever and then ls -l cache0. If there is an entry in there echo 1 > stop. This unregisters the cache and should let you start over.

Check the new bcache array is there, format it and mount it:

ls /dev/bcache*
mkfs.ext4 /dev/bcache0
mount /dev/bcache0 /bulk

Now we need to update fstab (see https://help.ubuntu.com/community/Fstab) with the right UUID and spec:

blkid /dev/bcache0
  UUID="4c63f20b-ad35-477d-bfaa-82571beba841" TYPE="ext4"
cp /etc/fstab /etc/fstab.org
vi /etc/fstab
 Comment out old RAID array entry
 Add new entry:
  UUID=4c63f20b-ad35-477d-bfaa-82571beba841 /bulk ext4 rw 0 0

And update your boot image and give it a reboot to check the new bcache array comes back up ok:

update-initramfs -u
shutdown -r now

Samba

These instructions are taken from the Research_Computing_Configuration#Samba page with only minor modifications. This guide is helpful: https://linuxconfig.org/how-to-configure-samba-server-share-on-ubuntu-18-04-bionic-beaver-linux

Check samba is running

samba --version

Then fix the conf file:

cp /etc/samba/smb.conf /etc/samba/smb.conf.bak
vi /etc/samba/smb.conf
	workgroup=BASTARDGROUP
 	usershare allow guests = no
	;comment out the [printers] and [print$] sections
    
	[bulk]
	comment = Bulk RAID Array
	path = /bulk
	browseable = yes
	create mask= 0775
	directory mask = 0775
	read only = no
	guest ok = no

Test the parameters, change the permissions and ownership:

testparm /etc/samba/smb.conf
chmod 770 /bulk
groupadd smbusers
chown :smbusers /bulk

Now create the researcher account, and add it to the samba share group

cat /etc/group
groupadd -g 1002 researcher
useradd -g researcher -G smbusers -s /bin/bash -p 1234 -d /home/researcher -m 
researcher
passwd researcher
	hint: littleamount
smbpasswd -a researcher

Finally restart samba:

systemctl restart smbd
systemctl restart nmbd

Check it works:

smbclient -L localhost
(no root password)

And add users to the samba group (if not already):

usermod -G smbusers researcher #Note that this sets the group and will overwrite sudo or other group assignments, so don't do it with your main account. Instead just:
 useradd ed smbusers

Dev Tools

DIGITS

This section follows https://developer.nvidia.com/rdp/digits-download. Install Docker CE first, following https://docs.docker.com/install/linux/docker-ce/ubuntu/

Then follow https://github.com/NVIDIA/nvidia-docker#quick-start to install docker2, but change the last command to use cuda 10.0

...
sudo apt-get install -y nvidia-docker2
sudo pkill -SIGHUP dockerd
# Test nvidia-smi with the latest official CUDA image
docker run --runtime=nvidia --rm nvidia/cuda:10.0-base nvidia-smi

Then pull DIGITS using docker (https://hub.docker.com/r/nvidia/digits/):

docker pull nvidia/digits

Finally run DIGITS inside a docker container (see https://github.com/NVIDIA/nvidia-docker/wiki/DIGITS for other options):

docker run --runtime=nvidia --name digits -d -p 5000:5000 nvidia/digits

And open a browser to http://localhost:5000/ to see DIGITS.

Documentation:

Note: you can kill docker containers with

docker system prune

cuDNN

Documentation on installing cuDNN is here:

First, make an installs directory in bulk and copy the installation files over from the RDP (E:\installs\DIGITS DevBox). Then:

cd /bulk/install/
dpkg -i libcudnn7_7.6.1.34-1+cuda10.0_amd64.deb
dpkg -i libcudnn7-dev_7.6.1.34-1+cuda10.0_amd64.deb
dpkg -i libcudnn7-doc_7.6.1.34-1+cuda10.0_amd64.deb

And test it:

cp -r /usr/src/cudnn_samples_v7/ $HOME
cd  $HOME/cudnn_samples_v7/mnistCUDNN
make clean && make
./mnistCUDNN
 Test passed!

Python Based

Now install Anaconda, so that we have python 3, and can pip and conda install things. Instructions for installing Anaconda on Ubuntu 18.04LTS (e.g., https://docs.anaconda.com/anaconda/install/linux/) all recommend using the shell script.

From https://www.anaconda.com/distribution/ the latest version is 3.7, so:

cd /bulk/install
curl -O https://repo.anaconda.com/archive/Anaconda3-2019.03-Linux-x86_64.sh
sha256sum Anaconda3-2019.03-Linux-x86_64.sh

As user researcher, run the installation (this installs python 3.7.3):

bash Anaconda3-2019.03-Linux-x86_64.sh
 accept the install location: /home/researcher/anaconda3
 accept the initialization by running conda init
Flush the local env:
 source ~/.bashrc
Tensorflow

Now install tensorflow using pip (see https://www.tensorflow.org/install/pip):

As root:
 apt install python3-pip
 apt install virtualenv
 pip3 install -U virtualenv
As researcher:
 cd /home/researcher
 virtualenv --system-site-packages -p python3 ./venv
 source ./venv/bin/activate  # sh, bash, ksh, or zsh
 pip install --upgrade tensorflow-gpu
 python -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))"

Note: to deactivate the virtual environment:

deactivate

Note that adding the anaconda path to /etc/environment makes the virtual environment redundant.

PyTorch and SciKit

Run the following as researcher (in venv):

conda install -c anaconda numpy
conda install pytorch torchvision cudatoolkit=10.0 -c pytorch
conda install -c anaconda scikit-learn

Refs:

Other packages

The following are not yet installed:

Theano

Theano v.1 requires python >=3.4 and <3.6. We are currently running 3.7. If we decide to install theano, we'll need to set up another version of python and another virtual environment. See:

VNC

In order to use the graphical interface for Matlab and other applications, we need a VNC server.

First, install the VNC client remotely. We use the standalone exe from TigerVNC.

Now install TightVNC, following the instructions: https://www.digitalocean.com/community/tutorials/how-to-install-and-configure-vnc-on-ubuntu-18-04

cd /root
apt-get install xfce4 xfce4-goodies

As user

sudo apt-get install tightvncserver
vncserver
 set password for user (ailia)
vncserver -kill :1
mv ~/.vnc/xstartup ~/.vnc/xstartup.bak
vi ~/.vnc/xstartup
 #!/bin/bash
 xrdb $HOME/.Xresources
 startxfce4 &
vncserver
sudo vi /etc/systemd/system/vncserver@.service
 [Unit]
 Description=Start TightVNC server at startup
 After=syslog.target network.target  
 
 [Service]
 Type=forking
 User=uname
 Group=uname
 WorkingDirectory=/home/uname
 
 PIDFile=/home/ed/.vnc/%H:%i.pid
 ExecStartPre=-/usr/bin/vncserver -kill :%i > /dev/null 2>&1
 ExecStart=/usr/bin/vncserver -depth 24 -geometry 1280x800 :%i
 ExecStop=/usr/bin/vncserver -kill :%i
 
 [Install]
 WantedBy=multi-user.target

Note that changing the color depth breaks it!

To make changes (or after the edit)

sudo systemctl daemon-reload
sudo systemctl enable vncserver@2.service
vncserver -kill :2
sudo systemctl start vncserver@2
sudo systemctl status vncserver@2

Stop the server with

sudo systemctl stop vncserver@2

Note that we are using :2 because :1 is running our regular Xwindows GUI.

Instrucions on how to set up an IP tunnel using PuTTY:

https://helpdeskgeek.com/how-to/tunnel-vnc-over-ssh/

Connection Issues

Coming back to this, I had issues connecting. I set up the tunnel using the saved profile in puTTY.exe and checked to see which local port was listening (it was 5901) and not firewalled using the listening ports tab under network on resmon.exe (it said allowed, not restricted under firewall status). VNC seemed to be running fine on Bastard, and I tried connecting to localhost::1 (that is 5901 on the localhost, through the tunnel to 5902 on Bastard) using VNC Connect by RealVNC. The connection was refused.

I checked it was listening and there was no firewall:

netstat -tlpn
 tcp        0      0 0.0.0.0:5902            0.0.0.0:*               LISTEN      2025/Xtightvnc
ufw status
 Status: inactive

The localhost port seems to be open and listening just fine:

Test-NetConnection 127.0.0.1 -p 5901

So, presumably, there must be something wrong with the tunnel itself.

Ignoring the SSH tunnel worked fine: Connect to 192.168.2.202::5902 using the TightVNC (or RealVNC, etc.) client.

Later Notes

Change the resolution

I came back and changed the resolution to make it work on one of my portrait desktop monitors. See https://www.tightvnc.com/vncserver.1.php

As root:

vi /etc/systemd/system/vncserver@.service
 Change line:
  ExecStart=/usr/bin/vncserver -depth 24 -geometry 1440x2560 :%i
 (Note that the size is 2160x3840 divide by 150%). Leave the color depth as it says elsewhere that changes are bad.
systemctl daemon-reload
systemctl enable vncserver@2.service

As Ed:

vncserver -kill :2
sudo systemctl start vncserver@2
sudo systemctl status vncserver@2

Exit full screen with ctrl-alt-shift-f.

Cut And Paste

Also, try to fix the cut-and-paste issue. See, for example, https://unix.stackexchange.com/questions/35030/how-can-i-copy-paste-data-to-and-from-the-windows-clipboard-to-an-opensuse-clipb

As root:

apt-get install autocutsel
vi ~/.vnc/xstartup
 #!/bin/bash
 xrdb $HOME/.Xresources
 autocutsel -fork  
 startxfce4 &

Though this might have been working fine anyway. Just change the terminal and all will be well.

Use XFCE terminal

Change Settings: Preferred Applications -> Utilities -> Terminal to XFCE

Note that this seems to fix everything but the instructions for customizing the menu are here: https://wiki.xfce.org/howto/customize-menu

cat /etc/xdg/menus/xfce-applications.menu

RDP

I also installed xrdp:

apt install xrdp
adduser xrdp ssl-cert
#Check the status and that it is listening on 3389
systemctl status xrd
netstat -tln
 #It is listening... 
vi /etc/xrdp/xrdp.ini
 #See https://linux.die.net/man/5/xrdp.ini
systemctl restart xrdp

This gave a dead session (a flat light blue screen with nothing on it), which finally yielded a connection log which said "login successful for display 10, start connecting, connection problems, giving up, some problem."

 cat /var/log/xrdp-sesman.log

There could be some conflict between VNC and RDP. systemctl status xrdp shows "xrdp_wm_log_msg: connection problem, giving up".

I tried without success:

gsettings set org.gnome.Vino require-encryption false
 https://askubuntu.com/questions/797973/error-problem-connecting-windows-10-rdp-into-xrdp
vi /etc/X11/Xwrapper.config
 allowed_users = anybody
 This was promising as it was previously set to consol.
 https://www.linuxquestions.org/questions/linux-software-2/xrdp-under-debian-9-connection-problem-4175623357/#post5817508
apt-get install xorgxrdp-hwe-18.04
 Couldn't find the package... This lead was promising as it applies to 18.04.02 HWE, which is what I'm running
 https://www.nakivo.com/blog/how-to-use-remote-desktop-connection-ubuntu-linux-walkthrough/
dpkg -l |grep xserver-xorg-core
 ii  xserver-xorg-core                          2:1.19.6-1ubuntu4.3                          amd64        Xorg X server - core server
 Which seems ok, despite having a problem with XRDP and Ubuntu 18.04 HWE documented very clearly here: http://c-nergy.be/blog/?p=13972

There is clearly an issue with Ubuntu 18.04 and XRDP. The solution seems to be to downgrade xserver-xorg-core and some related packages, which can be done with an install script (https://c-nergy.be/blog/?p=13933) or manually. But I don't want to do that, so I removed xrdp and went back to VNC!

apt remove xrdp

Other Software

I installed the community edition of PyCharm:

snap install pycharm-community --classic
 #Restart the local terminal so that it has updated paths (after a snap install, etc.)
/snap/pycharm-community/214/bin/pycharm.sh

On launch, you get some config options. I chose to install and enable:

  • IdeaVim (a VI editor emulator)
  • R
  • AWS Toolkit

Make a launcher: In /usr/share/applications:

vi pycharm.desktop
 [Desktop Entry]
 Version=2020.2.3
 Type=Application
 Name=PyCharm
 Icon=/snap/pycharm-community/214/bin/pycharm.png
 Exec="/snap/pycharm-community/214/bin/pycharm.sh" %f
 Comment=The Drive to Develop
 Categories=Development;IDE;
 Terminal=false
 StartupWMClass=jetbrains-pycharm

Also, create a launcher on the desktop with the same info.

Note that when I came back to the box the launcher didn't work...

MATLAB

I installed MATLAB R2024a by downloading the zip, running

sudo ./install

and using the defaults of /usr/local/MATLAB/R2024 etc. The license number is 41201644.

Upgrading the nVIDIA Drivers

In MATLAB, I ran:

gpuDevice
 Error using gpuDevice (line 26)
 Graphics driver is out of date. Download and install the latest graphics driver for your GPU from NVIDIA.

Some quick checks showed that I was using driver version 430.26 on ubuntu 18.04.02.

nvidia-smi
lsb_release -a

I couldn't quite get MATLAB to tell me what I needed:

For MEX, I have 10.2 and need 12.2 of the CUDA toolkit:

MATLAB Release	CUDA Toolkit Version
R2024a	12.2
...
R2020b	10.2

However:

As root:

runlevel
 #5
systemctl get-default
 #graphical.target
systemctl set-default multi-user.target
systemctl reboot

As ed:

vncserver -kill :2

Killing Xtightvnc process ID 1844

As root:

#sh ./NVIDIA-Linux-x86_64-550.107.02.run
# The distribution-provided pre-install script failed!
#cat /var/log/nvidia-installer.log
apt-get update
apt install nvidia-driver-545
systemctl set-default graphical.target
systemctl reboot

Run MATLAB

gpuDevice
  Name: 'NVIDIA TITAN RTX'
                    Index: 1
        ComputeCapability: '7.5'
    GraphicsDriverVersion: '545.29.06'
           ToolkitVersion: 12.2000
gpuDevice(2)
                     Name: 'NVIDIA TITAN Xp'
                    Index: 2
        ComputeCapability: '6.1'
           SupportsDouble: 1
    GraphicsDriverVersion: '545.29.06'
           ToolkitVersion: 12.2000

The messages were:

apt install nvidia-driver-545
	The following additional packages will be installed:
	  libnvidia-cfg1-545 libnvidia-common-545 libnvidia-compute-545 libnvidia-compute-545:i386 libnvidia-decode-545
	  libnvidia-decode-545:i386 libnvidia-encode-545 libnvidia-encode-545:i386 libnvidia-extra-545 libnvidia-fbc1-545
	  libnvidia-fbc1-545:i386 libnvidia-gl-545 libnvidia-gl-545:i386 nvidia-compute-utils-545 nvidia-dkms-545
	  nvidia-firmware-545-545.29.06 nvidia-kernel-common-545 nvidia-kernel-source-545 nvidia-utils-545
	  xserver-xorg-video-nvidia-545
	The following packages will be REMOVED:
	  libnvidia-cfg1-430 libnvidia-common-430 libnvidia-compute-430 libnvidia-compute-430:i386 libnvidia-decode-430
	  libnvidia-decode-430:i386 libnvidia-encode-430 libnvidia-encode-430:i386 libnvidia-fbc1-430 libnvidia-fbc1-430:i386
	  libnvidia-gl-430 libnvidia-gl-430:i386 libnvidia-ifr1-430 libnvidia-ifr1-430:i386 nvidia-compute-utils-430 nvidia-dkms-430
 	  nvidia-driver-430 nvidia-kernel-common-430 nvidia-kernel-source-430 nvidia-utils-430 xserver-xorg-video-nvidia-430
	The following NEW packages will be installed:
	  libnvidia-cfg1-545 libnvidia-common-545 libnvidia-compute-545 libnvidia-compute-545:i386 libnvidia-decode-545
	  libnvidia-decode-545:i386 libnvidia-encode-545 libnvidia-encode-545:i386 libnvidia-extra-545 libnvidia-fbc1-545
	  libnvidia-fbc1-545:i386 libnvidia-gl-545 libnvidia-gl-545:i386 nvidia-compute-utils-545 nvidia-dkms-545 nvidia-driver-545
	  nvidia-firmware-545-545.29.06 nvidia-kernel-common-545 nvidia-kernel-source-545 nvidia-utils-545
	  xserver-xorg-video-nvidia-545
	0 upgraded, 21 newly installed, 21 to remove and 2 not upgraded.