CUDA has a small amount of memory available for its threads called shared memory. As the name already suggests is that this memory is available to all threads within a block simultaneously. We want to use this property to make threads read memory from global memory to shared memory in a block, use the memory together, and afterwards write the result back into global memory to avoid multiple accesses to global memory. Nevertheless, there are some rules one need to respect for high performance.
## A word about warps
During launch time we specify our grid dimensions. During execution the threads of a block are then executed in groups of 32 at a time. Warps are never made beyond block boundaries. Each of those fixed size groups is called a warp and they always have a consecutive threadIdx. Those threads all execute together in SIMD style as long as there is no branching. For example an if statement with different outcome for warps units will lead to serial execution which can significantly slow down our calculation.
When we talk about shared memory access we always need to keep this structure in mind. All those 32 units of a warp should be capable of simultaneously reading from shared memory and avoiding serialization of the memory request. This is what the following description is about.
## Shared memory organization
The 64k byte of shared memory (more on recent GPUs) available to each SM is broken into 4 byte regions called words. A word could for example represent a float, 4 bytes, 2 shorts etc. There are 32 banks and each successive word belongs to a different bank so `word[0]` belongs to `bank0`, `word[1]` to `bank1`, `word[2]` to `bank[2]`, …, and `word[32]` to `bank0` again etc.
Therefore `bank0` contains all the words with indices dividable by 32, and `bankn` has all the indices which fulfill `(idx % 32) == n`. To find out to which bank a word belongs to you can therefore just use `% 32`. Shared memory always reads entire words, no matter if you read 1 byte of a word or 4 bytes of a word. Therefore `float` usually performs quite nicely on a GPU because it fully covers one word of a bank. Requesting a 2 byte datatype will just take as long as requesting a 4 byte type.
Theoretically a thread can access every word of every bank from its warp’s shared memory block. Shared memory nevertheless performs fastest when there is one request from each bank per thread. A bank conflict occurs when threads of a warp request different words from the same bank simultaneously, so for example `threadIdx.x == 0` requests `bank0, word0` and `threadIdx.x == 1` requests `bank0, word1`, and `threadIdx.x == n` requests `bank0, wordn`. Those memory access requests will be then serialized and executed after each other, delaying execution time. At the same time a `threadIdx.x == n` requesting `bankn, word0` pattern can be executed at full speed.
## Broadcast and Multicast
If all threads of a warp all request exactly the same value (for example all threads read `word30` of `bank30`), they will get a broadcast. The value will be read once from shared memory and broadcasted to the threads. This does not lead to any bank conflicts and is a trick to increase memory access speed.
If several threads request the same particular word from a particular bank (for example five threads request `word33` of `bank1`), a multicast occurs. The read from shared memory will also in this case only be read once and then be provided to all the threads requesting this.
## Bank conflict
The moment different words are requested from any single bank leads to a bank conflict. Requests will then be serialized and performance decreases. Bank conflicts can only occur within a block. There is no such thing as an inter-block bank conflict. Only warp level.
## Request patterns
One of the most natural ways to reference shared memory is to have threads read based on their `threadIdx` and each thread uses only its own bank without requesting things from other banks. For example request `arr[threadIdx.x]` but never use `arr[threadIdx.x * 2]` because then `threadIdx.x == 0` and `threadIdx.x == 16` will try to access different words of the same bank at the same time leading to a bank conflict. `arr[threadIdx.x * 3]` again leads to no bank conflicts again because we shift everything one step. Good step sizes: `threadIdx.x * 1`, `threadIdx.x * 3`, `threadIdx.x * 13`.
## Race conditions
If for example each thread reads for example a piece from global memory into shared memory and then all threads will use this piece of information together for example for a vector multiplication, a so called race condition occurs. To avoid this we can use a __syncthread();
which forces all threads to be synchronized before further continuing the execution of the kernel.
## Further reading
* [CUDA Programming Guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html) contains a few sections about shared memory
* [NVIDIA Developer Blog](https://developer.nvidia.com/blog/using-shared-memory-cuda-cc) simple introduction into using shared memory, thread synchronization, and bank conflicts.
* [Creel Youtube channel](https://www.youtube.com/watch?v=upGoZ00MlfI) series of videos introducing into advanced CUDA programming and how to optimize / speed up your code
* [jhuis github](https://jhui.github.io/2017/03/06/CUDA/) general introduction into CUDA programming containing a few examples about shared memory
* [CUDA Shared Memory NVIDIA Cooperation](https://www.olcf.ornl.gov/wp-content/uploads/2019/12/02-CUDA-Shared-Memory.pdf) presentation about shared memory