Changes

Jump to navigation Jump to search
2,775 bytes added ,  15:09, 13 November 2020
===Compiling a Kernel===
In the language of GPU computing, we need to compile a [https://en.wikipedia.org/wiki/Compute_kernel kernel] to run on the GPU. Some packages (discussed later) abstract away how GPUs handle memory and processing, but you should be aware of the fundamentals as they are often very important to maximizing the code's performance: if you understand the hardware implementation, you can tune for it!
 
[https://scholars.duke.edu/person/cliburn.chan Chi Wei Cliburn Chan], an associate prof of Biostatistics and Bioinformatics at Duke, teaches [https://people.duke.edu/~ccc14/ lots of great classes], and provides a guide to [https://people.duke.edu/~ccc14/sta-663/CUDAPython.html Massively parallel programming with GPUs] as a part of his [https://people.duke.edu/~ccc14/sta-663/ Computational Statistics in Python] class (note that the 2018 version of his [http://people.duke.edu/~ccc14/sta-663-2018/ STA 663: Computational Statistics and Statistical Computing (2018)] class (under the same course number) has sections on Spark, Tensorflow, Cython, and more!). This guide has a pretty good walk-through of how a CUDA kernel runs, though it is missing some images. See also:
*The streaming multiprocessor and the CUDA core: https://i.stack.imgur.com/kvu4M.jpg
*CUDA memory hierarchy: https://www.researchgate.net/profile/Marco_Nobile/publication/261069154/figure/fig1/AS:296718735298563@1447754667270/Schematization-of-CUDA-architecture-Schematic-representation-of-CUDA-threads-and-memory.png
*Various slides from Cyril Zeller (nVIDIA Developer Technology)'s Tutorial CUDA:https://www.slideshare.net/angelamm2012/nvidia-cuda-tutorialnondaapr08
 
The key things that you need to know are:
* One kernel is executed at a time on a device
* Many threads execute each kernel - each thread runs the same code but on different data (based on its threadID)
* Threads are grouped into blocks and a kernel runs on a grid of blocks
* Blocks can't synchronize. They can run concurrently or sequentially.
* Threads have local memory (registers ~ 1clock cycle), blocks share memory (~10 clock cycles), and kernels have per-device global memory(~100s/1000 clock cycles)
* Per device memory can transfer data to/from the CPU, and includes global, local (for consecutive access by a thread), constant (much faster than other per device), and some specialized memories for graphics (texture and surface).
* Transfers from global memory to local registers is in 4,8 or 16 byte units (or can incur a penalty, which slows things down). Threads can talk to constant and texture memory.
* Blocks should have dimension >=32
* A GPU device is a set of SIMT multiprocessors
* At each clock cycle, a multiprocessor executes the same instruction on a warp (the number of threads in a warp is the "warp size".
==CUDA and Python==

Navigation menu