Skip to content

Efficient image convolution in CUDA

Convolution is a fundamental operation in image processing and a core building blocks of neural networks. GPUs can shine here due to the large amount of operations required and the highly parallelizable nature of the problem.

In a nutshell, an image convolution is defined as an operation where an image is convolved with a (usually static) kernel. A kernel can be for example a 3x3 or 5x5 sized matrix (usually uneven numbers). The value of each output pixel is now defined as the sum of the kernel values multiplied with the neighboring pixels values, or as code:

c++
int range = (kernelSize - 1) / 2;
float sum = 0.0f;
int kernelIdx = 0;
for (int rely = -range; rely <= range; rely++) {
  const int absy = iy + rely;
  for (int relx = -range; relx <= range; relx++) {
    const int absx = ix + relx;
    sum += imgIn[absx + nx * absy] * kernel[kernelIdx];
    kernelIdx++;
  }
}
return sum;

INFO

Boundary conditions are not handled correctly here, which may lead to out-of-bounds memory accesses.

This operation is then performed for every image pixel.

INFO

All performance analysis below are performed on a RTX3070. The results will differ for other GPUs, kernel sizes and matrix sizes. The block size was kept at 16 x 16 elements which was a good tradeoff but there were subtle optimizations possible at each step by choosing a different block size.

Naive implementation

As a baseline, we now want to define first the naive implementation of such a convolution kernel to implement a baseline performance. In our case, we will be using a 2048 x 1024 big float image and convolve it with a 5 x 5 kernel. The kernel of course does not have to be symmetric, but for us it simplifies the code a bit.

Before we jump into the kernel, we want to define a simple boundary handling: for a pixel located at an edge, we will try to convolve with pixel values that go beyond the boundary of the image. Common strategies here include mirroring and padding. We want to use a simple index mirroring which can be formulated as an inlined device function:

c++

__inline__ __device__ int MirrorIndex(const int requested, const int size) {
  return (requested < 0)
             ? -requested
             : ((requested >= size) ? (size - 1) - (requested - size)
                                    : requested);
}

WARNING

This function relies on the assumption that the requested index will not go out of bounds after mirroring. If we have a size of e.g. 100 elements and would request an element at -300, the mirroring would still send us out of bounds. One could introduce additional checks at the expense of runtime, but here we assume that the range of the kernel is smaller then the image size. Then the function is safe to use within our context.

The kernel itself then can be formulated as follows:

c++
__global__ void ConvolveImageNaive(float* __restrict__ d_imgOut, 
                                   const float* __restrict__ d_imgIn,
                                   const float* __restrict__ d_kernel, 
                                   const int nx,
                                   const int ny,
                                   const int range) {

  const int ix = threadIdx.x + blockIdx.x * blockDim.x;
  const int iy = threadIdx.y + blockIdx.y * blockDim.y;
  if (ix >= nx || iy >= ny) return;

  const int linIdxOut = ix + nx * iy;

  int kernelIdx = 0;
  float sum = 0.0f;
  for (int iny = -range; iny <= range; iny++) {
    int absy = MirrorIndex(iy + iny, ny);
    for (int inx = -range; inx <= range; inx++) {
      int absx = MirrorIndex(ix + inx, nx);
      sum += d_imgIn[absx + nx * absy] * d_kernel[kernelIdx];
      kernelIdx++;
    }
  }

  d_imgOut[linIdxOut] = sum;
}

For our given configuration, this runs in around 185 microseconds.

Move the kernel to constant memory

While there are convolution operations that have a non constant kernel that adapts to the data or to the location on the image, we have a simplistic version where the kernel is constant. While the kernel is already cached in L1/L2 for the naive implementation, using constant memory can give a further performance boost due to broadcasting nature (all threads access same kernel locations) and a reduced pressure on the global memory path. In code it can be formulated as follows:

c++
constexpr int maxKernelSize = 51 * 51;
__constant__ float c_kernel[maxKernelSize];

__global__ void ConvolveImageConstant(float* __restrict__ d_imgOut, 
                                      const float* __restrict__ d_imgIn,
                                      const int nx,
                                      const int ny,
                                      const int range) {

  const int ix = threadIdx.x + blockIdx.x * blockDim.x;
  const int iy = threadIdx.y + blockIdx.y * blockDim.y;
  if (ix >= nx || iy >= ny) return;

  const int linIdxOut = ix + nx * iy;

  int kernelIdx = 0;
  float sum = 0.0f;
  for (int iny = -range; iny <= range; iny++) {
    int absy = MirrorIndex(iy + iny, ny);
    for (int inx = -range; inx <= range; inx++) {
      int absx = MirrorIndex(ix + inx, nx);
      sum += d_imgIn[absx + nx * absy] * c_kernel[kernelIdx];
      kernelIdx++;
    }
  }

  d_imgOut[linIdxOut] = sum;
}

int main() {
  // [...]

  CUDA_SAFE_CALL_LOG(cudaMemcpyToSymbol(c_kernel, h_kernel,
                                        nKernel * nKernel * sizeof(float), 0,
                                        cudaMemcpyHostToDevice));

  // [...]
}

This already allows us to reduce the runtime by 16% to 155 microseconds. During inspection of the memory chart in Nsight Compute, we can see that the number of global memory requests was reduced by 50%. Nevertheless, those requests do not necessarily mean real accesses to global memory since the GPU is caching the data for us in the naive implementation.

Load data into shared memory

We can utilize the fact that neighboring elements are accessing the same data. A thread block will typically use the same neighborhood of elements. This is a perfect opportunity to utilize shared memory. We can take the block region, append it with a padding and first use all threads to load the image data into shared memory. Then the for loop is not performed directly on global memory but rather on the perfetched shared memory. Also the boundary check does not need to be executed for every element anymore since we can do it during load (which is executed once per element):

c++
__global__ void Convolution2DShared(float *__restrict__ d_imgOut,
                                    const float *__restrict__ d_imgIn,
                                    const int nx, const int ny,
                                    const int nKernel) {

  // load a local data patch to shared memory
  const int tid = threadIdx.x + blockDim.x * threadIdx.y;
  const int stride = blockDim.x * blockDim.y;

  // offset of the (0, 0) element of this thread block from image origin
  const int patchXOffset = blockIdx.x * blockDim.x;
  const int patchYOffset = blockIdx.y * blockDim.y;

  // size of the shared memory block to hold data and padding
  const int nxLocData = nKernel + blockDim.x - 1;
  const int nyLocData = nKernel + blockDim.y - 1;
  const int nTotalLoc = nxLocData * nyLocData;

  // one sided range of the kernel
  const int range = (nKernel - 1) / 2;

  // load into shared
  extern __shared__ float sLocalData[];
  for (int id = tid; id < nTotalLoc; id += stride) {
    // which element we are loading relative to block offset
    const int ixloc = id % nxLocData - range;
    const int iyloc = id / nxLocData - range;

    const int absx = MirrorIndex(ixloc + patchXOffset, nx);
    const int absy = MirrorIndex(iyloc + patchYOffset, ny);
    sLocalData[id] = d_imgIn[absx + nx * absy];
  }
  __syncthreads();

  // perform the convolution with the prefetched data
  const int ix = threadIdx.x + blockDim.x * blockIdx.x;
  const int iy = threadIdx.y + blockDim.y * blockIdx.y;
  
  if (ix < nx && iy < ny) {
    int kernelIdx = 0;
    float sum = 0.0f;
    for (int iny = 0; iny < nKernel; iny++) {
      for (int inx = 0; inx < nKernel; inx++) {
        const int sx = threadIdx.x + inx;
        const int sy = threadIdx.y + iny; 
        sum += sLocalData[sx + sy * nxLocData] * c_kernel[kernelIdx];
        kernelIdx++;
      }
    }

    const int linIdxOut = ix + nx * iy;
    d_imgOut[linIdxOut] = sum;
  }
}

This optimization allows us to run the kernel in 116 microseconds, which is a 30% runtime improvement against the previous implementation.

constexpr kernel size

If we want to take it a step further, we can utilize the fact that the kernel size is usually known at compile time or that at least there is a few very common kernel sizes that occur in practice. If we know the kernel size, the compiler can fully unroll loops, improve instruction scheduling, and keep kernel values in registers. In practice, we would define constexpr implementation for the most common patch sizes (e.g. 3, 5, 7, 11, 21, 51) and fall back to a runtime specified size for the not precompiled implementations.

c++
template <int nKernel>
__global__ void Convolution2DConstexpr(float* __restrict__ d_imgOut,
                                       const float* __restrict__ d_imgIn,
                                       const int nx,
                                       const int ny) {
  // rest stays the same
}

This will further push down the execution time to 81 microseconds, yet another 30% runtime improvement. Comparing this against the initial 185 microseconds, we are more than twice as fast in solving the problem.

Looking at the results of Nsight Compute for our final constexpr version, we are now close to hitting the peak performance of the device:

  • the occupancy is at 93.07% with 100% theoretical maximum
  • both memory and compute throughput are well balanced around 75%

Next steps

For our existing problem, we have reached a good optimization already, nevertheless, it is worth mentioning that especially for large kernels, this implementation might still be insufficiently fast. In such cases, one might consider performing the convolution in FFT space instead or looking into the topic of separable convolution. Both of those approaches come with their own limitiations but can allow significant improvements over the naive implementation by modifying the computational complexity of the operation.

Created by Urs Hofmann