Efficient Image Transpose in CUDA: From Naive to Bandwidth-Limited
Transpose operations are needed in many different domains ranging from image augmentation during AI training to matrix manipulations in any numerical equations. Besides its use as a standalone kernel, the principle displayed below is frequently used as part of larger kernels where operations are performed along the striding dimension of a multi-dimensional array (matrix, volume, ...). The transpose operation behind it is very simple: a matrix of the size nx by ny pixels get transposed into a shape ny by nx with the elements being swapped:
for (int iy = 0; iy < ny; iy++) {
for (int ix = 0; ix < nx; ix++) {
out[iy + ny * ix] = in[ix + nx * iy];
}
}The challenge on GPUs is not the computation itself, but reorganizing memory access patterns to achieve high bandwidth utilization.
INFO
All performance metrics were performed on a RTX3070. The test image had a fixed size of 2048 by 1024 elements and the block size was fixed to 16 by 16 elements. While the kernels are templated, the current analysis is performed using T = float.
Naive implementation
While the mathematical operation is incredibly simple, this operations offers quite some nice playground for CUDA optimizations. A most simple kernel would look something like this:
template <typename T>
__global__ void TransposeNaive(T* __restrict__ d_out,
const T* __restrict__ d_in,
const int nx,
const int ny) {
const int ix = threadIdx.x + blockDim.x * blockIdx.x;
const int iy = threadIdx.y + blockDim.y * blockIdx.y;
if (ix >= nx || iy >= ny) return;
d_out[iy + ny * ix] = d_in[ix + nx * iy];
}The problem is the strided memory access during write to global memory which slows down the operation since only individual elements are written one by one at distributed memory locations resulting in a lot of overhead.
To transpose a 2048 by 1024 matrix, this kernel takes around 60 microseconds to execute using a 16 x 16 block size.
Shared memory based implementation
Luckily shared memory comes to our rescue here by offering a buffer in the L1 cache of the GPU where the strided memory access hurts much less. We load blocks of memory from the global memory into a local buffer. While writing into that buffer, we already transpose the data, so that when we write out into global memory again, neighboring threads can write to neighboring elements in the output data (non-strided fashion):

In code, it can be defined as below:
template <typename T>
__global__ void TransposeSharedMemNaive(T* __restrict__ d_out,
const T* __restrict__ d_in,
const int nx,
const int ny) {
extern __shared__ T sBuffer[];
// the offset of the (0, 0) element of this block in the whole matrix
const int tileOffsetX = blockIdx.x * blockDim.x;
const int tileOffsetY = blockIdx.y * blockDim.y;
// the global index of this worker
const int ix = threadIdx.x + tileOffsetX;
const int iy = threadIdx.y + tileOffsetY;
const bool inRange = (ix < nx) && (iy < ny);
// load the data into shared memory and transpose (indexing: y, ny * x)
sBuffer[threadIdx.y + blockDim.x * threadIdx.x] =
(inRange) ? d_in[ix + iy * nx] : T{0};
__syncthreads();
// write the data into the output matrix (buffer is already transposed --> x remains x)
if (!inRange) return;
d_out[tileOffsetY + threadIdx.x + ny * (tileOffsetX + threadIdx.y)] =
sBuffer[threadIdx.x + blockSize.x * threadIdx.y];
}
int main() {
// [...]
const dim3 blockSize{ 16, 16, 1}; // should be square
assert(blockSize.x == blockSize.y);
const dim3 gridSize{ (nx + blockSize.x - 1) / blockSize.x,
(ny + blockSize.y - 1) / blockSize.y,
1 };
const int sharedMemBytes = blockSize.x * blockSize.y * sizeof(T);
TransposeSharedMem<<<gridSize, blockSize, sharedMemBytes>>>(...);
// [...]
}Using the same matrix size and the same block size as for the previous implementation, this kernel now runs in around 45 microseconds, allowing a 25% speedup over our previous implementation.
Resolving bank conflicts
The shared memory is divided into 32 memory banks. For a continuous block of e.g. float values, the bank in which an alement resides can be determined by combining the index with the modulo operator:
iBank = idx % 32When multiple threads of the same warp try to access the different elements of the same bank simultaneously, we get a so-called bank conflict. The access operations cannot be performed in parallel for all threads but is serialized with each bank access taking a clock cycle. This hurts performance a lot.
Looking at the above kernel, during the write to shared memory operation (L19), the threads of the same warp (two consecutive block lines along threadIdx.x), write into a shared memory buffer with a 16-element stride (block size along x and y was chosen to be 16). This means that every other thread of a line is trying to write into the same bank creating an 8-fold bank conflict (for a 32 bit type). If the block size would be chosen as 32, all threads would attempt to access the same bank. The read from shared memory (L26) is already well defined since every thread of the warp accesses a different bank.
The problem can be easily resolved by introducing an offset into the shared memory buffer. This is typically demonstrated for a block size of 32 x 32 elements. We just have to pad the shared memory in the x direction by a single element to resolve the conflict. While we could here just also choose a 32 x 32 size, such large blocks might sometimes lead to reduced occupancy, hurting the performance especially if the transpose operation would be merged with some computationally heavy parts. So let's try to solve this for a 16 x 16 patch, where each warps spans over two lines/rows.
The goal is to re-arrange the indexing in a way that all elements of two consecutive lines, but also all elements of two consecutive rows never access the same bank twice. A solution to this problem is to pad the matrix every other row with two elements. Below this scheme is shown for a 8 x 8 matrix with 16 banks. The same principle will apply to a 16 x 16 matrix with 32 elements.

INFO
This padding scheme can indeed be generalized. For the 32 x 32 float matrix, the padding is often formulated as ix + (ny + 1) * iy but if we take our indexing ix + ny * iy + (iy / ls) * ls and define ls as a new variable called "lines to skip" or (nx / WARP_SIZE), the indexing works for both cases since ls == 1 will result in the same padding often mentioned in literature.
In the code, this solution can be implemented as follows:
template <typename T>
__global__ void TransposeSharedMemPadded(T* __restrict__ d_out,
const T* __restrict__ d_in,
const int nx,
const int ny) {
extern __shared__ T sBuffer[];
// the offset of the (0, 0) element of this block in the whole matrix
const int tileOffsetX = blockIdx.x * blockDim.x;
const int tileOffsetY = blockIdx.y * blockDim.y;
// the global index of this worker
const int ix = threadIdx.x + tileOffsetX;
const int iy = threadIdx.y + tileOffsetY;
const bool inRange = (ix < nx) && (iy < ny);
// load the data into shared memory and transpose
sBuffer[threadIdx.y + blockDim.x * threadIdx.x + (threadIdx.x / 2) * 2] =
(inRange) ? d_in[ix + iy * nx] : T{0};
__syncthreads();
// write the data into the output matrix
if (!inRange) return;
d_out[tileOffsetY + threadIdx.x + ny * (tileOffsetX + threadIdx.y)] =
sBuffer[threadIdx.x + blockSize.x * threadIdx.y + (threadIdx.y / 2) * 2];
}
int main() {
// [...]
const dim3 blockSize{ 16, 16, 1}; // should be square
assert(blockSize.x == blockSize.y);
const dim3 gridSize{ (nx + blockSize.x - 1) / blockSize.x,
(ny + blockSize.y - 1) / blockSize.y,
1 };
const int sharedMemBytes = (blockSize.x * blockSize.y + (blockSize.y / 2) * 2) * sizeof(T);
TransposeSharedMem<<<gridSize, blockSize, sharedMemBytes>>>(...);
// [...]
}Nevertheless, in the performance analysis with NVIDIA Compute, there was no performance boost from resolving the bank conflicts. During inspection of the memory workload analysis section, the device memory to L2 rate is peaking at around 90% by now. In the example we used a T = float matrix with a size of 2048 x 1024 elements (8 Mb). With the read and write, we transport 16 Mb from and to global memory. This results in a peak memory throughput of 355 Gb/s. The RTX3070 has a memory bandwidth of around 448 Gb/s. We are therefore operating at 80% of the GPUs memory bandwidth. The bank conflict optimization does not measurably elevate performance for this memory bound workload on the GPU but it will make the threads less busy. If there would be some computation performed within the kernel, it could become critical again.
Full code
Full example with compilable code and test
#include <cassert>
#include <cstdlib>
#include <cuda.h>
#include <cuda_runtime.h>
#include <iostream>
// --- KERNELS ---
__global__ void TransposeNaive(float *__restrict__ d_out,
const float *__restrict__ d_in,
const int nx,
const int ny) {
const int ix = threadIdx.x + blockIdx.x * blockDim.x;
const int iy = threadIdx.y + blockIdx.y * blockDim.y;
if (ix >= nx || iy >= ny) {
return;
}
const int linIdxIn = ix + nx * iy;
const int linIdxOut = iy + ny * ix;
d_out[linIdxOut] = d_in[linIdxIn];
}
__global__ void TransposeSharedMemPadded(float *__restrict__ d_out,
const float *__restrict__ d_in,
const int nx,
const int ny) {
extern __shared__ float sBuffer[];
const int tileOffsetX = blockIdx.x * blockDim.x;
const int tileOffsetY = blockIdx.y * blockDim.y;
const int ix = threadIdx.x + tileOffsetX;
const int iy = threadIdx.y + tileOffsetY;
const bool inRange = (ix < nx) && (iy < ny);
sBuffer[threadIdx.y + (blockDim.x + 1) * threadIdx.x] =
(inRange) ? d_in[ix + iy * nx] : 0.0f;
__syncthreads();
if (!inRange) return;
d_out[tileOffsetY + threadIdx.x + ny * (tileOffsetX + threadIdx.y)] =
sBuffer[threadIdx.x + (blockDim.x + 1) * threadIdx.y];
}
float GetValueForPos(const int ix, const int iy) {
return static_cast<float>(ix) * static_cast<float>(iy) *
static_cast<float>(iy);
}
// --- ERROR HANDLING ---
#define CUDA_SAFE_CALL(func) \
{ \
const cudaError_t err = func; \
if (err != cudaSuccess) { \
std::cout << "an error occured at " << __LINE__ << std::endl; \
std::abort(); \
} \
}
#define CUDA_CHECK_LOG() \
{ \
const cudaError_t err = cudaGetLastError(); \
if (err != cudaSuccess) { \
std::cout << "caught an error in line " << __LINE__ << ": " \
<< cudaGetErrorString(err) << std::endl; \
std::abort(); \
} \
}
// --- TEST SUITE ---
int main() {
float *d_in;
float *d_out;
constexpr int nx = 2048;
constexpr int ny = 1024;
constexpr int nElements = nx * ny;
CUDA_SAFE_CALL(cudaMalloc(&d_in, nElements * sizeof(float)));
CUDA_SAFE_CALL(cudaMalloc(&d_out, nElements * sizeof(float)));
float *h_ptr;
CUDA_SAFE_CALL(cudaMallocHost(&h_ptr, nElements * sizeof(float)));
for (int iy = 0; iy < ny; iy++) {
for (int ix = 0; ix < nx; ix++) {
h_ptr[ix + nx * iy] = GetValueForPos(ix, iy);
}
}
CUDA_SAFE_CALL(cudaMemcpy(d_in, h_ptr, nElements * sizeof(float),
cudaMemcpyHostToDevice));
const dim3 blockSize{16, 16, 1u};
assert(blockSize.x == blockSize.y);
const dim3 gridSize{(nx + blockSize.x - 1) / blockSize.x,
(ny + blockSize.y - 1) / blockSize.y, 1};
const uint sharedMemBytes = (blockSize.x + 1) * blockSize.y * sizeof(float);
TransposeSharedMem<<<gridSize, blockSize, sharedMemBytes>>>(d_out, d_in, nx,
ny);
CUDA_CHECK_LOG();
CUDA_SAFE_CALL(cudaDeviceSynchronize());
// copy data back and check for correctness
CUDA_SAFE_CALL(cudaMemcpy(h_ptr, d_out, nElements * sizeof(float),
cudaMemcpyDeviceToHost));
int nIncorrect = 0;
for (int ix = 0; ix < nx; ix++) {
for (int iy = 0; iy < ny; iy++) {
int idx = iy + ix * ny;
if (h_ptr[idx] != GetValueForPos(ix, iy)) {
nIncorrect++;
}
}
}
if (nIncorrect == 0) {
std::cout << "all elements correct" << std::endl;
} else {
std::cout << "errors occured for " << nIncorrect << " elements"
<< std::endl;
}
CUDA_SAFE_CALL(cudaFree(d_in));
CUDA_SAFE_CALL(cudaFree(d_out));
CUDA_SAFE_CALL(cudaFreeHost(h_ptr));
}To compile and run the code, use
nvcc <filename>.cu -o transpose_matrix && ./transpose_matrix