*The streaming multiprocessor and the CUDA core: https://i.stack.imgur.com/kvu4M.jpg
*CUDA memory hierarchy: https://www.researchgate.net/profile/Marco_Nobile/publication/261069154/figure/fig1/AS:296718735298563@1447754667270/Schematization-of-CUDA-architecture-Schematic-representation-of-CUDA-threads-and-memory.png
*Various slides from Cyril Zeller (nVIDIA Developer Technology)'s Tutorial CUDA:https://www.slideshare.net/angelamm2012/nvidia-cuda-tutorialnondaapr08
The key things that you need to know are:
* Blocks should have dimension >=32
* A GPU device is a set of SIMT multiprocessors
* At each clock cycle, a multiprocessor executes the same instruction on a warp (the number of threads in a warp is the "warp size". It's usually 32. You can find yours by running the deviceQuery utility provided in the samples folder. See [[DIGITS DevBox#Test the installation]].
==CUDA and Python==