The key things that you need to know are:
* One '''kernel ''' is executed at a time on a device* Many '''threads ''' execute each kernel - each thread runs the same code but on different data (based on its threadID)* Threads are grouped into '''blocks ''' and a kernel runs on a '''grid ''' of blocks
* Blocks can't synchronize. They can run concurrently or sequentially.
* Threads have local memory ('''registers ''' ~ 1clock cycle), '''blocks share memory ''' (~10 clock cycles), and kernels have '''per-device global memory''' (~100s/1000 clock cycles)* Per device memory can transfer data to/from the CPU, and includes '''global''', '''local ''' (for consecutive access by a thread), '''constant ''' (much faster than other per device), and some specialized memories for graphics ('''texture ''' and surface).
* Transfers from global memory to local registers is in 4,8 or 16 byte units (or can incur a penalty, which slows things down). Threads can talk to constant and texture memory.
* Blocks should have dimension >=32(see warps below).* A GPU device is a set of '''[https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads SIMT multiprocessorsmultiprocessor]'''* At each clock cycle, a multiprocessor executes the same instruction on a warp (the The number of threads in a '''warp ''' is the "warp size". It's usually 32. You can find yours by running the deviceQuery utility provided in the samples folder. See [[DIGITS DevBox#Test the installation]]. Warps are then grouped into blocks.* At each clock cycle, a multiprocessor executes the same instruction on a warp. Threads within a warp are executed physically in parallel. Warps and blocks are executed logically in parallel.* Kernel launches are asynchronous - the CPU hands off the kernel and moves on. The kernel only executes and all previous CUDA calls have completed.
==CUDA and Python==