Introduction

Welcome back to Image Processing with CUDA. We are now at lesson 3, so we have enough foundation to build something that feels much closer to a real image workflow. As you may recall from the previous lessons, we have already practiced a direct pixel transform and a neighborhood-based filter. Now, we will connect those ideas into one complete flow.

Our goal is an image filter pipeline: first, convert an RGB image to grayscale, then apply a Sobel edge detector to that grayscale result. By the end of this lesson, we will understand the GPU path, the matching CPU reference, and the final validation step that confirms the whole pipeline works correctly.

Why The Pipeline Comes In Two Steps

A Sobel filter looks for changes in brightness, not for color by itself. Because of that, it makes sense to first turn each RGB pixel into a single intensity value and then run edge detection on that simpler image. This keeps the second stage focused and easier to reason about.

The two core formulas are:

gray=0.299R+0.587G+0.114B\text{gray} = 0.299R + 0.587G + 0.114B magnitude=grayx2+
Preparing The Program State

We begin with the same careful setup style used in earlier lessons: standard headers (iostream, vector, fstream, cmath, cstdlib, cuda_runtime.h), a CUDA_CHECK macro, image sizes, and host vectors.

To make our kernels more efficient, we also declare our Sobel coefficients in __constant__ memory. Unlike local arrays inside a kernel, __constant__ memory is a specialized read-only cache that is shared by all threads. This is a common real-world CUDA pattern for fixed filter weights.

Optimizing Weights with Constant Memory

In previous kernels, we have declared small arrays directly inside the kernel. However, for fixed data like the Sobel convolution kernels, __constant__ memory is much more efficient. It resides in a special memory space that is cached on the GPU. When all threads in a warp read the same location in constant memory—which happens here as every thread applies the same 3×33 \times 3 weights—the hardware broadcasts that value to all threads in a single cycle.

Because __constant__ variables have a different lifetime and scope than standard device pointers, we cannot use cudaMemcpy to fill them. Instead, we use cudaMemcpyToSymbol. This function tells the CUDA driver to look up the specific symbol name in the compiled device code and copy the host data into that protected memory region.

Creating A Simple Test Image

Before launching kernels, we need input data that makes edges easy to spot. The code builds a bright yellowish square on top of a dark blueish background. That shape is perfect for Sobel because the border of the square creates strong brightness changes, while the flat inside and outside regions stay mostly calm. The program also writes the raw RGB bytes to a file for inspection.

By using different values for R, G, and B, we can see how the grayscale conversion weights each color channel differently. The before.raw file preserves the exact input bytes.

Building The Grayscale Reference

As in the earlier lessons, we first create a CPU version of the operation. This provides a trusted answer before we involve the GPU. The grayscaleCPU function is compact because it is a pixelwise transform: each output position depends only on the matching RGB pixel, not on nearby neighbors. That makes it a clean first stage for the pipeline.

The loop runs once per pixel, not once per byte. For each pixel index i, the red value lives at rgb[i * 3], the green at rgb[i * 3 + 1], and the blue at rgb[i * 3 + 2]. The weighted sum follows the standard luminosity rule, then casts the result back to unsigned char. This function produces the grayscale image that the CPU Sobel stage will read next.

Building The Sobel Reference

Now, the CPU reference moves from a simple pixel formula to a neighborhood operation. For each output pixel, the code reads a 3 x 3 area around it, applies one kernel for horizontal change and another for vertical change, then combines both responses into one edge strength value. This mirrors the logic we will later place on the GPU.

There are three ideas to notice here:

  • Gx reacts to left versus right intensity change
  • Gy reacts to top versus bottom intensity change
  • The final magnitude combines both, then clamps large values to 255

The bounds check inside the neighbor loops protects the image borders, where some surrounding positions do not exist.

Converting Pixels On The GPU

The first GPU kernel should feel familiar from lesson 1. Each thread maps to one image location, checks whether that location is inside the image, and then computes one grayscale value. Even though this is the first stage of a larger pipeline, it is still a clean, one-pixel-to-one-output operation, which makes it a good kernel to launch first.

The mapping from blockIdx, blockDim, and threadIdx to (x, y) is the same pattern we have already used. Once the thread knows its pixel coordinates, i = y * w + x converts that 2D location into a linear index for the grayscale image. From there, the thread reads the three RGB bytes, applies the luminosity weights, and writes one byte into gray.

Detecting Edges On The GPU

The second kernel is the heart of the lesson. Each thread still owns one output pixel, but now it must read a small neighborhood from the grayscale image. Instead of declaring the coefficients inside the function, it reads them from the global c_Gx and c_Gy __constant__ arrays we defined earlier.

This kernel closely matches the CPU reference, which is exactly what we want for validation. By using __constant__ memory, we ensure that every thread in a warp accesses the same coefficient simultaneously, which is highly optimized by the hardware. Finally, sqrtf computes the gradient magnitude, and the result is clamped so it fits into one byte.

Running The Full GPU Pipeline

With both CPU functions and both GPU kernels ready, the main program can execute the complete workflow. First, it initializes the __constant__ memory using cudaMemcpyToSymbol. Then, it builds the CPU reference, allocates device memory, and launches the kernels in sequence.

This section shows the value of a pipeline on the GPU: d_gray acts as a device-side bridge between the two kernels, so we do not copy intermediate data back to the CPU. We check cudaGetLastError() after each launch to ensure any configuration or resource errors are caught immediately at the relevant stage. Because both kernels are launched into the same stream (the default stream), the CUDA driver guarantees they execute in the order they were issued, meaning the kernel reads the grayscale image only after the first kernel has finished writing its results.

Verifying The Final Result

The last step is to save the output, compare it with the CPU reference, print the status, and release device memory. Notice that the comparison allows a difference of 1 instead of demanding exact equality. That small tolerance is useful here because floating-point math and casting can produce tiny rounding differences, even when the overall result is correct.

A few final details are worth noticing:

  • after.raw stores the final Sobel image as raw grayscale bytes
  • The loop checks every pixel against h_sobel_ref
  • All device buffers (d_rgb, d_gray, d_sobel) are freed before the program exits

When everything matches, the program prints:

Visualizing the Transformation

To better understand the effect of the Image Filter Pipeline, we can compare the raw input data against our processed results.

The original image contains a bright yellowish square set against a dark blue background:

After the CUDA kernels run, the grayscale conversion and Sobel operator isolate the boundaries between these regions:

Notice how the flat, solid-colored areas become dark, while the edges of the square are highlighted as bright lines. This visual result confirms that our pipeline correctly transformed the three-channel input into a single-channel map of intensity gradients, identifying exactly where the brightness changes most sharply.

Conclusion and Next Steps

In this lesson, we built a full image filter pipeline on the GPU: RGB input, grayscale conversion, Sobel edge detection, result export, and CPU-based validation. We also introduced __constant__ memory, a powerful way to store read-only filter weights that all threads need to access.

This is an important step forward because we are no longer applying one isolated filter; we are chaining stages together while keeping the intermediate data on the device and optimizing our data access patterns. In the practice section ahead, we will reinforce this flow by implementing the stages ourselves.

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal