Introduction

Welcome back to Image Processing with CUDA. We are now at lesson 2, which means we already have one full image kernel behind us. As we saw in the grayscale lesson, a CUDA program becomes much easier to reason about when we map work clearly and verify the result carefully.

In this lesson, we will build a blur filter. More specifically, we will implement a 3x3 box blur for a single-channel image, where each output pixel becomes the average of its nearby pixels, including itself. By the end, we will understand both the image logic and the CUDA flow needed to run it correctly.

From Pixelwise Work To Neighborhood Work

In the previous lesson, each thread read one pixel and produced one result from that same location. A blur is different: each output pixel depends on a small neighborhood around it. That makes the task more interesting, because every thread must look beyond its own position.

This is an important step in CUDA image processing:

  • Grayscale conversion is a pixelwise operation; one pixel goes in, one value comes out.
  • Blur filtering is a neighborhood operation; one output pixel depends on nearby input pixels.

That shift introduces a new idea we will use often later: safe neighbor access, especially near the image borders.

Understanding The 3x3 Box Blur

A box blur replaces each pixel with the average of the pixels in a square window. Here, the window is 3x3, so the largest possible neighborhood contains 9 samples.

A simple way to describe it is:

output(x,y)=1Nky=11kx=11input(x+kx,y+ky)\text{output}(x, y) = \frac{1}{N} \sum_{ky=-1}^{1}\sum_{kx=-1}^{1} \text{input}(x + kx, y + ky)
Preparing The Input

We begin with the usual host side setup: headers, a small CUDA error-checking macro, image dimensions, host vectors, and a tiny test image. Unlike the RGB example from the previous lesson, this image is single channel, so each pixel uses just one byte. That makes the memory layout easier to follow.

This test input is very deliberate. We place a single bright pixel in the center of a 5 by 5 image, so the blur effect becomes easy to predict: the brightness should spread into nearby positions. The before.raw export stores the original bytes exactly as they are in memory, which is useful for inspection later. The CUDA_CHECK macro keeps calls honest by stopping the program if an error occurs.

Creating A CPU Reference

Before using the GPU, we build a CPU version of the same blur. This gives us a trusted answer to compare against later. In small teaching examples like this, a CPU reference is one of the best debugging tools we can have.

The structure here mirrors the blur idea directly: for each pixel (x, y), we visit the surrounding 3x3 area, collect only the valid neighbors, and divide by count. Notice that we do not pretend outside pixels are zero. Instead, we shrink the average near the borders by using fewer samples. That same rule will appear again in the CUDA kernel, which helps keep the CPU and GPU logic aligned.

Mapping Threads To Pixels

Now we move to the CUDA kernel. As you may recall from the grayscale lesson, the first step is not the image math, but the thread-to-pixel mapping. Each thread will still compute exactly one output pixel, even though it now needs to read several input pixels.

The expressions for x and y are the standard way to turn CUDA block and thread indices into image coordinates. The bounds check is essential, because our launch shape may contain more threads than actual pixels. Any thread that lands outside the image exits immediately. After that, we prepare sum and count, which will hold the total brightness and the number of valid samples for the current pixel.

Averaging Neighbor Pixels

Once a thread knows which output pixel it owns, it can walk through the surrounding neighborhood. This is the heart of the blur filter. The loops visit offsets from -1 to 1 in both directions, which gives us the full 3x3 window around (x, y).

There are two key ideas here. First, nx and ny are the neighbor coordinates, so each thread reads nearby pixels rather than just its own. Second, the bounds check applies to every neighbor. That is how we handle corners and edges safely. The indexing expression ny * width + nx converts the 2D position back into a linear array index. Finally, we divide by count and write the average into the matching output location.

Launching The Blur Kernel

With both the CPU reference and GPU kernel ready, we can allocate device memory, copy the input image to the GPU, define the launch shape, and run the kernel. This part follows the same host to device pattern we used earlier in the course, which is helpful because it shows that the main change is the kernel logic, not the whole CUDA workflow.

Calling boxBlurCPU before the GPU work gives us the expected result for validation. Then we allocate d_input and d_output on the device and copy the host input into d_input. The 16 by 16 block size is larger than this tiny image, but that is fine because the kernel already protects itself with a bounds check. The rounded grid size ensures that every pixel is covered, and the error checks after launch help us catch problems early.

Copying Back And Verifying

After the kernel finishes, we copy the blurred image back to the host, save it as raw output, compare it against the CPU result, and print the final status. This last step matters a lot: without verification, we would only be hoping the kernel worked.

This section closes the full pipeline:

  • h_output receives the GPU result,
  • after.raw stores the final blurred bytes,
  • the loop checks every pixel i against the h_expected CPU reference,
  • device memory is freed before the program exits.

If everything matches, we see this output:

That message confirms that the blur kernel, memory transfers, and validation logic all agree.

Visualizing the Transformation

To better understand the effect of the 3x3 box blur, we can compare the raw input data against our processed results.

The original image contains a single bright pixel at the center, surrounded by darkness:

After the CUDA kernel runs, the single point of light is distributed across its immediate neighbors:

Notice how the center pixel is now dimmer and the surrounding pixels have gained brightness. This visual spread confirms that our kernel correctly averaged the 3×33 \times 3 neighborhood, effectively "blurring" the sharp point into a soft square.

Conclusion and Next Steps

In this lesson, we extended our CUDA image processing skills from a pixelwise operation to a neighborhood-based one. We prepared a simple input, created a CPU reference, mapped GPU threads to pixels, averaged valid neighbors in a 3x3 window, copied the result back, and verified correctness with a clear SUCCESS message.

This pattern is very useful: first understand the image rule, then express it safely in a kernel, then confirm it with a trusted reference. In the practice ahead, we will turn this understanding into code you can complete with confidence, one blur step at a time.

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal