Welcome back to 2D Grids and Matrix Math. We are now at Lesson 3, and that means we already know two key ideas from this course: how to map 2D thread coordinates to matrix positions, and how to use that mapping inside a real matrix kernel. In this lesson, we take the next step and make that pattern scalable.
As you may recall from previous lessons, we often launch enough threads so that each matrix element gets one thread. That is simple and clear, but it is not always enough for very large data. Here, we will build a 2D grid-stride loop so a smaller grid can still cover a much larger matrix correctly.
A standard 1-to-1 mapping fails if the grid is smaller than the matrix. While we usually aim for one thread per element, launching a smaller grid is often necessary for:
- Hardware Limits: Matrices can exceed the maximum grid dimensions allowed by the GPU.
- Portability: It ensures the kernel works regardless of the specific GPU hardware or the input size.
- Efficiency: Reusing threads can improve cache hits and reduce the overhead of managing massive numbers of blocks.
A 2D grid-stride loop fixes this by letting each thread "step" through the matrix at regular gaps:
Let's start by defining the kernel itself. The first part should look familiar: each thread computes its starting column and row from block and thread indices.
There are two important details here. First, A and B are marked const because the kernel only reads from them. Second, width and height describe the matrix shape, while startCol and startRow tell us where this thread begins. So far, this is the same coordinate mapping pattern we already know. The new idea will come from what the thread does after computing that starting point.
Here is the heart of the lesson. Instead of touching one element and stopping, each thread moves through the matrix in two dimensions. This is what allows a small launch to process a much larger matrix.
A few ideas are packed into this small block:
strideXis the total number of threads covering thexdirection, andstrideYis the same fory.- The loops stop naturally at the matrix boundaries, so no separate bounds check is needed here.
index = r * width + cuses the same row-major mapping from earlier lessons.- Each visited element is just matrix addition:
C[index] = A[index] + B[index].
With the kernel ready, we can set up a large input in main(). The program uses a 2048 x 2048 matrix, which is big enough to show why this technique matters.
This test is designed to make verification easy. Every value in h_A is 1.0f and every value in h_B is 2.0f, so every output value should become 3.0f. We also allocate three device buffers and copy only the inputs to the GPU. The output buffer d_C does not need an initial host value because the kernel will overwrite every valid position it reaches.
Now comes the interesting part: we intentionally launch a grid that is much smaller than the matrix. This would fail in a one-pass kernel, but it works here because of the 2D grid-stride loops.
The printed lines help us see the mismatch clearly: the matrix has over four million elements, but the launch covers only 256 x 256 thread positions at a time. That is exactly the scenario where the stride logic earns its place. The error checks after launch are also important; they catch launch failures and then wait until the GPU has finished before we continue.
After the kernel finishes, we copy the result back, check every element, print the status, and free device memory. This completes the full workflow from host setup to GPU execution to verification.
The verification loop checks all W * H values, not just a sample. Using std::fabs with 1e-5f is a safe floating-point habit, even though this example is very simple. If the kernel handled the whole matrix correctly, the program prints the success message below, which means every output element matched the expected value 3.0f.
In this lesson, we extended the 2D indexing pattern from earlier lessons into something much more practical: a 2D grid-stride loop. We saw how a thread computes a starting row and column, how strideX and strideY let it revisit more work, and how this makes a matrix kernel scale beyond the size of the launched grid. We also walked through the host code that builds a large test, launches a deliberately small grid, verifies every result, and frees GPU memory safely.
This pattern is simple, but it is also powerful. It lets us write kernels that are less tied to one exact launch shape and more ready for real workloads. In the practice section ahead, we will put this idea to work so we can write scalable matrix kernels with real confidence.
