Skip to content

CUDA Histogram Example Implementation

Histograms are a common way to represent the distribution of data. In this example, we will implement a simple histogram calculation for a 2D image using CUDA to leverage the parallel processing capabilities of GPUs. This implementation will demonstrate how to compute a histogram for a large dataset efficiently. In this example, I show the histogram for 256 bins and a uint8_t datatype, which is typical for greyscale images, nevertheless, the approach can be adapted for different bin sizes.

Naive implementation

The most simple and naive implementation of a histogram calculation in CUDA would be to have each thread read one pixel and update the global histogram directly:

c++
__global__ void SimpleHistogram(
    uint32_t *__restrict__ d_hist, // should be init to 0 before calling
    const uint8_t *__restrict__ d_data,
    const int nx,
    const int ny) {

  const int ix = threadIdx.x + blockIdx.x * blockDim.x;
  const int iy = threadIdx.y + blockDim.y * blockIdx.y;

  if (ix >= nx || iy >= ny)
    return;

  const uint8_t currElem = d_data[ix + nx * iy];
  atomicAdd(&d_hist[currElem], 1);
}

However, this approach has several drawbacks:

  • High contention on global memory: Since many threads may try to update the same histogram bin simultaneously, this can lead to significant performance degradation due to serialization of atomic operations.
  • Poor memory access patterns: The random access pattern to the histogram bins can lead to inefficient memory accesses.

Shared memory implementation

We can combine several techniques to optimize the histogram calculation:

  1. each block accumulates into a histogram in shared memory to reduce global memory contention
  2. we use atomic operations to safely update the global histogram from each block's shared memory histogram
  3. we limit the number of global updates by having each thread process multiple data points.
c++
__global__ void HistogramKernel(uint32_t* __restrict__ d_histogram,
                                const uint8_t* __restrict__ d_data,
                                const uint32_t nx,
                                const uint32_t ny) {
  // shared memory for block-wise histogram
  __shared__ uint32_t s_hist[256];
  const uint32_t nThreads = blockDim.x * blockDim.y;
  const uint32_t threadId = threadIdx.y * blockDim.x + threadIdx.x;
  // initialize shared histogram
  for (uint32_t i = threadId; i < 256; i += nThreads) {
    s_hist[i] = 0;
  }
  __syncthreads();

  // calculate private histogram
  const uint32_t xStart = blockIdx.x * blockDim.x + threadIdx.x;
  const uint32_t yStart = blockIdx.y * blockDim.y + threadIdx.y;
  const uint32_t xStride = gridDim.x * blockDim.x;
  const uint32_t yStride = gridDim.y * blockDim.y;
  for (uint32_t y = yStart; y < ny; y += yStride) {
    for (uint32_t x = xStart; x < nx; x += xStride) {
      const uint8_t pixelValue = d_data[y * nx + x];
      atomicAdd(&s_hist[pixelValue], 1);
    }
  }
  __syncthreads();

  // merge private histogram into global histogram
  for (uint32_t i = threadId; i < 256; i += nThreads) {
    atomicAdd(&d_histogram[i], s_hist[i]);  
  }

}

The advantage of accumulating into the shared memory histogram first is that it reduces the number of atomic operations on global memory, which are more expensive. By using shared memory, we can perform many updates locally before committing the results to global memory. Furthermore, shared memory is less prone to slow downs if the access is not perfectly coalesced, as it is much faster than global memory.

INFO

I have tried to take it a step further and have each warp compute its own histogram in registers/shared memory and then use warp reduction to combine them into the block histogram. However, this approach turned out to be slower than the above version, likely due to the increased complexity and overhead of managing multiple histograms per warp. This also comes with increased shared memory usage, which can limit occupancy. ::: o

Created by Urs Hofmann