Give the box a reboot!
===X Windows===
If you install the video driver before installing Xwindows, you will need to manually edit the Xwindows config files. So, now install the X window system. The easiest way is:
tasksel
And choose your favorite. We used Ubuntu Desktop.
And reboot again to make sure that everything is working nicely.
===Video Drivers===
driver : xserver-xorg-video-nouveau - distro free builtin
Then blacklist the nouveau driver (see https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#runfile-nouveau) and reboot (to a text terminal, if you have deviated from these instructions and already installed X Windows) so that it isn't loaded.
apt-get install build-essential
gcc --version
wget https://developer.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda_10.1.168_418.67_linux.run
vi /etc/modprobe.d/blacklist-nouveau.conf
blacklist nouveau
update-initramfs -u
shutdown -r now
Reboot to a text terminal
lspci -vk
Shows no kernel driver in use!
You can install Install the driver directly either now or after installing Xwindows. If you do it before installing Xwindows, you will need to manually edit the Xwindows config files. !
apt install nvidia-driver-430
Everything should be good at this point!
===X Windows===
Now install the X window system. The easiest way is:
tasksel
And choose your favorite. We used Ubuntu Desktop.
And reboot again to make sure that everything is working nicely.
===Bcache===
This section follows https://developer.nvidia.com/rdp/digits-download. Install Docker CE first, following https://docs.docker.com/install/linux/docker-ce/ubuntu/
Then follow https://github.com/NVIDIA/nvidia-docker#quick-start to install docker2, but change the last command to use cuda 10.10
...
sudo apt-get install -y nvidia-docker2
sudo pkill -SIGHUP dockerd
# Test nvidia-smi with the latest official CUDA image
docker run --runtime=nvidia --rm nvidia/cuda:10.10-base nvidia-smi
Then pull DIGITS using docker (https://hub.docker.com/r/nvidia/digits/):
*https://developer.nvidia.com/digits
Note: you can kill docker containers with
docker system prune
====cuDNN====
First, make an installs directory in bulk and copy the installation files over from the RDP (E:\installs\DIGITS DevBox). Then:
cd /bulk/install/
dpkg -i libcudnn7_7.56.1.1034-1+cuda10.1_amd640_amd64.deb dpkg -i libcudnn7-dev_7.56.1.1034-1+cuda10.1_amd640_amd64.deb dpkg -i libcudnn7-doc_7.56.1.1034-1+cuda10.1_amd640_amd64.deb
And test it:
pip install --upgrade tensorflow-gpu
python -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))"
And this doesn't work. It turns out that tensorflow 1.13.1 doesn't work with CUDA 10.1! But there is a work around, which is to install cuda10 in conda only (see https://github.com/tensorflow/tensorflow/issues/26182). We are also going to leave the installation of CUDA 10.1 because tensorflow will catch up at some point.
Still as researcher (and in the venv):
conda install cudatoolkit
conda install cudnn
conda install tensorflow-gpu
export LD_LIBRARY_PATH=/home/researcher/anaconda3/lib/${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
python -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))"
AND IT WORKS!
Note: to deactivate the virtual environment:
deactivate
Note that adding the anaconda path to /etc/environment makes the virtual environment redundant.
=====PyTorch and SciKit=====
*http://deeplearning.net/software/theano/install_ubuntu.html
==Video Driver Issue=VNC= After logging into the box sometime later, it seems that the video drivers are no longer loading, presumably as a consequence of some update or something. ===Testing=== nvidia-settings --query FlatpanelNativeResolution ERROR: NVIDIA driver is not loaded cd /usr/local/cuda-10.1/samples/bin/x86_64/linux/release ./deviceQuery CUDA Device Query (Runtime API) version (CUDART static linking) cudaGetDeviceCount returned 100 -> no CUDA-capable device is detected Result = FAIL ./mnistCUDNN cudnnGetVersion() : 7501 , CUDNN_VERSION from cudnn.h : 7501 (7.5.1) Cuda failurer version : GCC 7.4.0 Error: no CUDA-capable device is detected error_util.h:93 Aborting... And as researcher: cd /home/researcher/ source ./venv/bin/activate python -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))" ... failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected ...kernel driver does not appear to be running on this host (bastard): /proc/driver/nvidia/version does not exist
lspci -vk shows Kernel modules: nvidiafb, nouveau and no Kernel driver in In order to use. It looks like nouveau is still blacklisted in /etc/modprobe.d/blacklist-nouveau.conf and /usr/bin/nvidia-persistenced --verbose is still being called in /etc/rc.local. ubuntu-drivers devicesreturns exactly what it did before we installed CUDA 10.1 too... There is no /proc/driver/nvidia folder, and therefore no /proc/driver/nvidia/version file found. We get the following: /usr/bin/nvidia-persistenced --verbose nvidia-persistenced failed to initialize. Check syslog for more details. tail /var/log/syslog ...Jul 9 13:35:56 bastard kernel: [ 5314.526960] pcieport 0000:00:02.0: [12] Replay Timer Timeout ...Jul 9 13:35:56 bastard nvidia-persistenced: Failed to query NVIDIA devices. Please ensure that the NVIDIA device files (/dev/nvidia*) exist, and that user 0 has read and write permissions graphical interface for those files. ls /dev/ ...reveals no nvidia devices nvidia-smi ...NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running. grep nvidia /etc/modprobe.d/* /lib/modprobe.d/* .../etc/modprobe.d/blacklist-framebuffer.conf:blacklist nvidiafb .../etc/modprobe.d/nvidia-installer-disable-nouveau.conf:# generated by nvidia-installer ===Uninstall/Reinstall=== Am going to try uninstalling CUDA 10.1 and the current Nvidia driver, Matlab and then reinstalling CUDA 10.0 /usr/local/cuda-10.1/bin/cuda-uninstaller nvidia-uninstall WARNING: Your driver installation has been altered since it was initially installed; this may happen, for example, if you have since installed the NVIDIA driver through a mechanism other than nvidia-installer (such as your distribution's native package management system). nvidia-installer will attempt to uninstall as best it can. Please see the file '/var/log/nvidia-uninstall.log' for details. WARNING: Failed to delete some directories. See /var/log/nvidia-uninstall.log for details. Uninstallation of existing driver: NVIDIA Accelerated Graphics Driver for Linux-x86_64 (418.67) is complete. Then download cuda_10.0.130_410.48_linux.run from https://developer.nvidia.com/cuda-10.0-download-archive?target_os=Linux&target_arch=x86_64&target_distro=Ubuntu&target_version=1804&target_type=runfilelocalapplications, as well as cuda_10.0.130.1_linux.run. sudo su cd /bulk/install ./cuda_10.0.130_410.48_linux.run accept all defaults and install everything (including 410.something NVIDIA driver) =========== Driver: Installed Toolkit: Installed in /usr/local/cuda-10.0 Samples: Installed in /home/ed, but missing recommended libraries Please make sure that - PATH includes /usr/local/cuda-10.0/bin - LD_LIBRARY_PATH includes /usr/local/cuda-10.0/lib64, or, add /usr/local/cuda-10.0/lib64 to /etc/ld.so.conf and run ldconfig as root To uninstall the CUDA Toolkit, run the uninstall script in /usr/local/cuda-10.0/bin To uninstall the NVIDIA Driver, run nvidia-uninstall Please see CUDA_Installation_Guide_Linux.pdf in /usr/local/cuda-10.0/doc/pdf for detailed information on setting up CUDA. Logfile is /tmp/cuda_install_8524.log Fix the paths: export PATH=/usr/local/cuda-10.0/bin${PATH:+:${PATH}} export LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}} Also vi /etc/ld.so.conf.d/cuda.conf /usr/local/cuda-10.0/lib64 ldconfig Finally: ./cuda_10.0.130.1.run accept all defaults Unfortunately this didn't work. After we need a reboot: nvidia-settings --query FlatpanelNativeResolution Unable to init VNC server: Could not connect: Connection refused (message was same as before on the box, this was over ssh) ./deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) cudaGetDeviceCount returned 35 -> CUDA driver version is insufficient for CUDA runtime version Result = FAIL python -c "import tensorflow as tf; tf.enabl e_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))" 2019-07-09 15:20:40.085877: E tensorflow/stream_executor/cuda/cuda_driver.cc:300 ] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detecte d 2019-07-09 15:20:40.085978: I tensorflow/stream_executor/cuda/cuda_diagnostics.c c:148] kernel driver does not appear to be running on this host (bastard): /proc /driver/nvidia/version does not exist /usr/bin/nvidia-persistenced --verbose nvidia-persistenced failed to initialize. Check syslog for more details. lspci -vk also returned the same as before. This is really frustrating! Did the following: apt-get install nvidia-prime prime-select nvidia Info: the nvidia profile is already set update-initramfs -u
For next time (as root): lshw -c video First, install the VNC client remotely.We use the standalone exe from TigerVNC.. shows configuration without driver
modprobe Now install TightVNC, following the instructions: https://www.digitalocean.com/community/tutorials/how-to-resolveinstall-alias nvidiafbmodinfo $(modprobe and-configure-resolvevnc-alias nvidiafb)on-ubuntu-18-04
lsof +D cd /usr/lib/xorg/modules/drivers/ COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME Xorg 2488 root mem REG 8,49 23624 26346422 /usr/lib/xorg/modules/drivers/fbdev_drv.so Xorg 2488 root mem REG 8,49 90360 26347089 /usr/lib/xorg/modules/drivers/modesetting_drv.so Xorg 2488 root mem REG 8,49 217104 26346424 /usr/lib/xorg/modules/drivers/nouveau_drv.so Xorg 2488 root mem REG 8,49 7813904 26346043 /usr/lib/xorg/modules/drivers/nvidia_drv.soapt-get install xfce4 xfce4-goodies
As user cat sudo apt-get install tightvncserver vncserver set password for user vncserver -kill :1 mv ~/.vnc/xstartup ~/.vnc/xstartup.bak vi ~/.vnc/xstartup #!/varbin/logbash xrdb $HOME/Xorg.0Xresources startxfce4 & vncserver sudo vi /etc/systemd/system/vncserver@.service [Unit] Description=Start TightVNC server at startup After=syslog.target network.target [Service] Type=forking User=uname Group=uname WorkingDirectory=/home/uname PIDFile=/home/ed/.vnc/%H:%i.pid ExecStartPre=-/usr/bin/vncserver -kill :%i > /dev/null 2>&1 ExecStart=/usr/bin/vncserver -depth 24 -geometry 1280x800 :%i ExecStop=/usr/bin/vncserver -kill :%i [Install] WantedBy=multi-user.logtarget
] (II) LoadModule: "nvidia"[ 29.047] (II) Loading /usr/lib/xorg/modules/drivers/nvidia_drv.so[ 29.047] (II) Module nvidia: vendor="NVIDIA Corporation"[ 29.047] compiled for 4.0.2, module version = 1.0.0[ 29.047] Module class: X.Org Video Driver[ 29.047] (II) NVIDIA dlloader X Driver 410.48 Thu Sep 6 06:27:34 CDT 2018[ 29.047] (II) NVIDIA Unified Driver for all Supported NVIDIA GPUs[ 29.047] (II) Loading sub module "fb"[ 29.047] (II) LoadModule: "fb"[ 29.047] (II) Loading /usr/lib/xorg/modules/libfb.so[ 29.047] (II) Module fb: vendor="X.Org Foundation"[ 29.047] compiled for 1.19.6, module version = 1.0.0[ 29.047] ABI class: X.Org ANSI C Emulation, version 0.4[ 29.047] (II) Loading sub module "wfb"[ 29.047] (II) LoadModule: "wfb"[ 29.047] (II) Loading /usr/lib/xorg/modules/libwfb.so[ 29.048] (II) Module wfb: vendor="X.Org Foundation"[ 29.048] compiled for 1.19.6, module version = 1.0.0[ 29.048] ABI class: X.Org ANSI C Emulation, version 0.4[ 29.048] (II) Loading sub module "ramdac"[ 29.048] (II) LoadModule: "ramdac"[ 29.048] (II) Module "ramdac" already built-in[ 29.095] (EE) NVIDIA: Failed to initialize Note that changing the NVIDIA kernel module. Please see the[ 29.095] (EE) NVIDIA: system's kernel log for additional error messages and[ 29.095] (EE) NVIDIA: consult the NVIDIA README for details.[ 29.095] (EE) No devices detected.color depth breaks it!
vi To make changes (or after the edit) sudo systemctl daemon-reload /var/log/kernsudo systemctl enable vncserver@2.logservice ... it looks like we are back to an unsigned module tainting the kernel. vncserver -kill :2 sudo systemctl start vncserver@2 sudo systemctl status vncserver@2
vi /etc/default/grubStop the server withGRUB_DEFAULT=0GRUB_TIMEOUT_STYLE=hiddenGRUB_TIMEOUT=2GRUB_DISTRIBUTOR=`lsb_release -i -s sudo systemctl stop vncserver@2> /dev/null || echo Debian`GRUB_CMDLINE_LINUX_DEFAULT="nvidia-drm.modeset=1"GRUB_CMDLINE_LINUX=""
update-grubSourcing file `/etc/default/grub'Generating grub configuration file ...Found linux image: /boot/vmlinuz-4.18.0-25-genericFound initrd image: /boot/initrd.img-4.18.0-25-genericFound linux image: /boot/vmlinuz-4.18.0-20-genericFound initrd image: /boot/initrd.img-4.18.0-20-genericFound linux imageNote that we are using : /boot/vmlinuz-4.18.0-18-genericFound initrd image2 because : /boot/initrd1 is running our regular Xwindows GUI.img-4.18.0-18-genericFound linux image: /boot/vmlinuz-4.15.0-54-genericFound initrd image: /boot/initrd.img-4.15.0-54-genericFound memtest86+ image: /boot/memtest86+.elfFound memtest86+ image: /boot/memtest86+.bindevice-mapper: reload ioctl on osprober-linux-nvme0n1p1 failed: Device or resource busyCommand faileddone
Instrucions on how to set up an IP tunnel using PuTTY: https://askubuntuhelpdeskgeek.com/questions/1048274how-to/ubuntutunnel-18vnc-04over-stopped-working-with-nvidia-driversssh/