Addressing Ubuntu NVIDIA Issues
This page provides information on how to address NVIDIA driver issues under Ubuntu 18.04. However, the objective is not to provide a step-by-step set of instructions to address one particular issue. Instead, it lays out the general process and commonly-used command line tools to debug video (and other) driver issues.
Contents
The Boot Process
The following things happen at boot:
- BIOS enumerates/checks hardware -- make sure Secure Boot is disabled!
- The MBR (if you have one) contains GRUB, which is called. We shouldn't be using UEFI if we want unsigned NVIDIA drivers to work.
- Edit /etc/default/grub and run update-grub to alter /boot/grub/grub.cfg (note that there is also /etc/grub.d)
- Then it's the Kernel's turn! Kernel images are in /boot.
- Find which one is being used by running uname -r or doing cat /proc/version or dmesg | grep ubuntu.
- The Kernel loads initramfs first (a mini OS), which is also stored in /boot. It can be updated with update-initramfs.
- initramfs decides which kernel modules are going to be loaded. Use modprobe to alter the list then update-initramfs.
- /etc/modules-load.d/modules.conf can provide a list of modules or can be blank
- lsmod lists loaded modules, depmod -n lists module dependencies, modinfo provides info. Loaded modules are in cat /proc/modules.
- Kernel messages are in /var/log/kern.log (as well as in /var/log/syslog).
- View them with dmesg, cat /var/log/kern.log, and journalctl -b (messages since last boot). Note that cat /proc/kmsg shows nothing and var/log/syslog is a log for rsyslog.
- Init, or rather systemd, then takes over and the machine goes from runlevel to runlevel, eventually bringing up your Xwindowing system, if you have one.
- /etc/rc.local is called last. For later versions of NVIDIA drivers you'll need /usr/bin/nvidia-persistenced --verbose here.
The boot process matters because video drivers are loaded in the kernel phase. If you change which modules are loaded manually (dpkg, apt-get and most scripts will do it for you), you'll need to update-initramfs.
Useful Commands
Here's a list of useful commands to diagnose your issues:
vi /etc/default/grub update-grub uname -r cat /proc/version update-initramfs cat /var/log/kern.log dmesg cat /proc/kmsg less var/log/syslog journalctl -b lsmod | grep video modinfo asus_wmi find /lib/modules/$(uname -r) -type f -name '*.ko'shui modprobe nvidiafb
Finding hardware
If your videa card is installed in PCI slot and visible to OS it will show up in:
lspci -vk
and in
lshw -c video
Example Hardware Check
Checking that the hardware is being seen on the first build of the DIGITS DevBox:
lspci -vk 05:00.0 VGA compatible controller: NVIDIA Corporation GP102 [TITAN Xp] (rev a1) (prog-if 00 [VGA controller]) Subsystem: NVIDIA Corporation GP102 [TITAN Xp] Flags: bus master, fast devsel, latency 0, IRQ 78, NUMA node 0 Memory at fa000000 (32-bit, non-prefetchable) [size=16M] Memory at c0000000 (64-bit, prefetchable) [size=256M] Memory at d0000000 (64-bit, prefetchable) [size=32M] I/O ports at d000 [size=128] Expansion ROM at 000c0000 [disabled] [size=128K] Capabilities: [60] Power Management version 3 Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+ Capabilities: [78] Express Legacy Endpoint, MSI 00 Capabilities: [100] Virtual Channel Capabilities: [250] Latency Tolerance Reporting Capabilities: [128] Power Budgeting <?> Capabilities: [420] Advanced Error Reporting Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 Capabilities: [900] #19 Kernel driver in use: nouveau Kernel modules: nvidiafb, nouveau 06:00.0 VGA compatible controller: NVIDIA Corporation Device 1e02 (rev a1) (prog -if 00 [VGA controller]) Subsystem: NVIDIA Corporation Device 12a3 Flags: fast devsel, IRQ 24, NUMA node 0 Memory at f8000000 (32-bit, non-prefetchable) [size=16M] Memory at a0000000 (64-bit, prefetchable) [size=256M] Memory at b0000000 (64-bit, prefetchable) [size=32M] I/O ports at c000 [size=128] Expansion ROM at f9000000 [disabled] [size=512K] Capabilities: [60] Power Management version 3 Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+ Capabilities: [78] Express Legacy Endpoint, MSI 00 Capabilities: [100] Virtual Channel Capabilities: [250] Latency Tolerance Reporting Capabilities: [258] L1 PM Substates Capabilities: [128] Power Budgeting <?> Capabilities: [420] Advanced Error Reporting Capabilities: [600] Vendor Specific Information: ID=0001 Rev=1 Len=024 Capabilities: [900] #19 Capabilities: [bb0] #15 Kernel modules: nvidiafb, nouveau
This looks good. The second card is the Titan RTX (see https://devicehunt.com/view/type/pci/vendor/10DE/device/1E02).
Altering GRUB
If you didn't use the HWE version of Ubuntu, you might get the dreaded screen freeze or black screen on boot when the video drivers are loaded. Fix that by booting to a terminal and doing:
vi /etc/default/grub GRUB_DEFAULT=0 GRUB_TIMEOUT_STYLE=hidden GRUB_TIMEOUT=2 GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian` GRUB_CMDLINE_LINUX_DEFAULT="nvidia-drm.modeset=1" GRUB_CMDLINE_LINUX=""
update-grub Sourcing file `/etc/default/grub' Generating grub configuration file ... Found linux image: /boot/vmlinuz-4.18.0-25-generic Found initrd image: /boot/initrd.img-4.18.0-25-generic Found linux image: /boot/vmlinuz-4.18.0-20-generic Found initrd image: /boot/initrd.img-4.18.0-20-generic Found linux image: /boot/vmlinuz-4.18.0-18-generic Found initrd image: /boot/initrd.img-4.18.0-18-generic Found linux image: /boot/vmlinuz-4.15.0-54-generic Found initrd image: /boot/initrd.img-4.15.0-54-generic Found memtest86+ image: /boot/memtest86+.elf Found memtest86+ image: /boot/memtest86+.bin device-mapper: reload ioctl on osprober-linux-nvme0n1p1 failed: Device or resource busy Command failed done
See https://askubuntu.com/questions/1048274/ubuntu-18-04-stopped-working-with-nvidia-drivers
Secure Boot
Don't use UEFI and don't use Secure Boot, unless you can use only signed production level drivers (which isn't going to happen). To check that aren't using secure boot, in /boot:
grep CONFIG_MODULE_SIG_ALL config-4.18.0-25-generic CONFIG_MODULE_SIG_ALL=y grep CONFIG_MODULE_SIG_FORCE config-4.18.0-25-generic # CONFIG_MODULE_SIG_FORCE is not set
Check the Kernel log to see if an unsigned module is tainting the kernel and if that's an issue vi /var/log/kern.log
Drivers
Remember that Tensorflow uses CUDA, which in turn uses the video driver, so you should fix things in order!
If you want to install the latest driver from the ppm, you might need to update your repo list. Current repos are in
/etc/apt/sources.list.d/
If Launchpad ppm was already added as a repo then you can:
cat /etc/apt/sources.list.d/graphics-drivers-ubuntu-ppa-bionic.list
See drivers from all of your repos with:
ubuntu-drivers devices
Module Diagnostics
This section contains some notes from the depth of my problems:
lspci -vk shows Kernel modules: nvidiafb, nouveau and no Kernel driver in use.
It looks like nouveau is still blacklisted in /etc/modprobe.d/blacklist-nouveau.conf and /usr/bin/nvidia-persistenced --verbose is still being called in /etc/rc.local. ubuntu-drivers devices returns exactly what it did before we installed CUDA 10.1 too...
There is no /proc/driver/nvidia folder, and therefore no /proc/driver/nvidia/version file found. We get the following:
/usr/bin/nvidia-persistenced --verbose nvidia-persistenced failed to initialize. Check syslog for more details. tail /var/log/syslog ...Jul 9 13:35:56 bastard kernel: [ 5314.526960] pcieport 0000:00:02.0: [12] Replay Timer Timeout ...Jul 9 13:35:56 bastard nvidia-persistenced: Failed to query NVIDIA devices. Please ensure that the NVIDIA device files (/dev/nvidia*) exist, and that user 0 has read and write permissions for those files. ls /dev/ ...reveals no nvidia devices nvidia-smi ...NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
grep nvidia /etc/modprobe.d/* /lib/modprobe.d/* .../etc/modprobe.d/blacklist-framebuffer.conf:blacklist nvidiafb .../etc/modprobe.d/nvidia-installer-disable-nouveau.conf:# generated by nvidia-installer
modprobe --resolve-alias nvidiafb modinfo $(modprobe --resolve-alias nvidiafb)
lsof +D /usr/lib/xorg/modules/drivers/ COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME Xorg 2488 root mem REG 8,49 23624 26346422 /usr/lib/xorg/modules/drivers/fbdev_drv.so Xorg 2488 root mem REG 8,49 90360 26347089 /usr/lib/xorg/modules/drivers/modesetting_drv.so Xorg 2488 root mem REG 8,49 217104 26346424 /usr/lib/xorg/modules/drivers/nouveau_drv.so Xorg 2488 root mem REG 8,49 7813904 26346043 /usr/lib/xorg/modules/drivers/nvidia_drv.so
Determine and Set Driver
You can use the application NVIDIA prime:
apt-get install nvidia-prime prime-select nvidia Info: the nvidia profile is already set
Paths
You can also use ldconfig to add the LD Library Path:
vi /etc/ld.so.conf.d/cuda.conf /usr/local/cuda-10.0/lib64 ldconfig
Export paths by setting the global environment
vi /etc/environment PATH="/home/researcher/anaconda3/bin:/home/researcher/anaconda3/condabin:/usr/local/cuda-10.0/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games" LD_LIBRARY_PATH="/usr/local/cuda-10.0/lib64"
Now there is no need to activate a custom environment! Alternatively:
export PATH=/usr/local/cuda-10.0/bin:/usr/local/cuda-10.0${PATH:+:${PATH}} export LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}} export PYTHONPATH=/your/tensorflow/path:$PYTHONPATH
Persistenced
For CUDA 10.1, we need nvidia-persistenced to be run at boot, so:
vi /etc/rc.local #!/bin/sh -e /usr/bin/nvidia-persistenced --verbose exit 0 chmod +x /etc/rc.local
For CUDA 10.0, it runs as a service and is set up and launched by the script.
/usr/bin/nvidia-persistenced --verbose nvidia-persistenced failed to initialize. Check syslog for more details.
Jul 11 21:08:20 bastard sshd[3708]: Did not receive identification string from 94.190.53.14 port 3135 Jul 11 21:10:25 bastard nvidia-persistenced[3714]: Verbose syslog connection opened Jul 11 21:10:25 bastard nvidia-persistenced[3714]: Directory /var/run/nvidia-persistenced will not be removed on exit Jul 11 21:10:25 bastard nvidia-persistenced[3714]: Failed to lock PID file: Resource temporarily unavailable Jul 11 21:10:25 bastard nvidia-persistenced[3714]: Shutdown (3714)
To check it is already running:
ps aux | grep persistenced nvidia-+ 2183 0.0 0.0 17324 1552 ? Ss 10:09 0:00 /usr/bin/nvidia-persistenced --user nvidia-persistenced --no-persistence-mode --verbose root 3539 0.0 0.0 14428 1000 pts/0 S+ 10:10 0:00 grep --color=auto persistenced
And to see it is a service:
systemctl list-units --type service --all | grep nvidia nvidia-persistenced.service loaded active running NVIDIA Persistence Daemon
XWindows
To diagnose XWindow issues:
cat /var/log/Xorg.0.log ] (II) LoadModule: "nvidia" [ 29.047] (II) Loading /usr/lib/xorg/modules/drivers/nvidia_drv.so [ 29.047] (II) Module nvidia: vendor="NVIDIA Corporation" [ 29.047] compiled for 4.0.2, module version = 1.0.0 [ 29.047] Module class: X.Org Video Driver [ 29.047] (II) NVIDIA dlloader X Driver 410.48 Thu Sep 6 06:27:34 CDT 2018 [ 29.047] (II) NVIDIA Unified Driver for all Supported NVIDIA GPUs [ 29.047] (II) Loading sub module "fb" [ 29.047] (II) LoadModule: "fb" [ 29.047] (II) Loading /usr/lib/xorg/modules/libfb.so [ 29.047] (II) Module fb: vendor="X.Org Foundation" [ 29.047] compiled for 1.19.6, module version = 1.0.0 [ 29.047] ABI class: X.Org ANSI C Emulation, version 0.4 [ 29.047] (II) Loading sub module "wfb" [ 29.047] (II) LoadModule: "wfb" [ 29.047] (II) Loading /usr/lib/xorg/modules/libwfb.so [ 29.048] (II) Module wfb: vendor="X.Org Foundation" [ 29.048] compiled for 1.19.6, module version = 1.0.0 [ 29.048] ABI class: X.Org ANSI C Emulation, version 0.4 [ 29.048] (II) Loading sub module "ramdac" [ 29.048] (II) LoadModule: "ramdac" [ 29.048] (II) Module "ramdac" already built-in [ 29.095] (EE) NVIDIA: Failed to initialize the NVIDIA kernel module. Please see the [ 29.095] (EE) NVIDIA: system's kernel log for additional error messages and [ 29.095] (EE) NVIDIA: consult the NVIDIA README for details. [ 29.095] (EE) No devices detected.
To use nvidia settings, you have to be on the box. Doing it over SSH will get you:
nvidia-settings --query FlatpanelNativeResolution Unable to init server: Could not connect: Connection refused
Tensorflow in Conda
It turns out that tensorflow 1.13.1 doesn't work with CUDA 10.1! But there is a work around, which is to install cuda10 in conda only (see https://github.com/tensorflow/tensorflow/issues/26182). We are also going to leave the installation of CUDA 10.1 because tensorflow will catch up at some point.
Still as your user account (and in the venv):
conda install cudatoolkit conda install cudnn conda install tensorflow-gpu export LD_LIBRARY_PATH=/home/researcher/anaconda3/lib/${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}} python -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))"
Remove Everything
Sometimes the best options is to completely remove everything and start again.
To see what is currently installed (your output may look different):
dpkg -l|grep nvidia ii libnvidia-container-tools 1.0.2-1 amd64 NVIDIA container runtime library (command-line tools) ii libnvidia-container1:amd64 1.0.2-1 amd64 NVIDIA container runtime library ii nvidia-container-runtime 2.0.0+docker18.09.7-3 amd64 NVIDIA container runtime ii nvidia-container-runtime-hook 1.4.0-1 amd64 NVIDIA container runtime hook ii nvidia-docker2 2.0.3+docker18.09.7-3 all nvidia-docker CLI wrapper ii nvidia-prime 0.8.8.2 all Tools to enable NVIDIA's Prime ii nvidia-settings 418.56-0ubuntu0~gpu18.04.1 amd64 Tool for configuring the NVIDIA graphics driver
To completely remove all things nvidia (sub the correct version number and don't run the bits after the hash marks):
cd /usr/local/cuda-10.0/bin ./uninstall_cuda_10.0.pl cd /usr/local rm -r cuda-10.1/ nvidia-uninstall apt-get purge nvidia* #Removed nvidia-prime, nvidia-settings, nvidia-container-runtime, nvidia-container-runtime-hook, nvidia-docker2 apt-get purge *nvidia* #Removed libnvidia-container-tools (1.0.2-1), Removing libnvidia-container1:amd64 apt autoremove #Removed libllvm7 libvdpau1 libxnvctrl0 linux-headers-4.18.0-18 linux-headers-4.18.0-18-generic linux-image-4.18.0-18-generic linux-modules-4.18.0-18-generic linux-modules-extra-4.18.0-18-generic mesa-vdpau-drivers pkg-config screen-resolution-extra vdpau-driver-all cd /home/ed rm -r NVIDIA_CUDA-10.1_Samples/ rm -r NVIDIA_CUDA-10.0_Samples/ dpkg --remove libcudnn7 dpkg --remove libcudnn7-dev dpkg --remove libcudnn7-doc
COMMENT OUT LINES IN rc.local
vi /etc/rc.local
shutdown -r now