Welcome back! So far, we’ve used a "one-thread-per-element" approach, which assumes our grid is always large enough to cover the data. But what happens if the dataset is massive or the GPU hardware limits our grid size?
To build professional, "battle-ready" CUDA code, we use the Grid Stride Loop. This pattern allows threads to process multiple elements by "stepping" through the array. By the end of this lesson, your kernels will be hardware-agnostic—capable of handling any data size correctly, regardless of how many threads you launch. Let’s upgrade our indexing for maximum robustness!
So far, we have aimed for “one thread per element.” That works well, but it assumes we launch enough threads to cover all items. However, relying on a fixed 1:1 mapping between threads and data points is not always feasible or optimal in production. What if your dataset size exceeds the hardware limits of the GPU, or you want to process millions of elements using only a modest number of threads for better cache reuse?
A grid stride loop addresses this by changing the mindset: instead of a thread handling exactly one element, each thread handles element i, then jumps ahead by a fixed stride and handles i + stride, i + 2*stride, and so on, until we run out of data.
This pattern is conceptually similar to a modulo operation, where each thread takes responsibility for all data indices such that .
Now we add the key idea of this lesson. We compute the same starting index, but then we loop forward by a stride of gridDim.x * blockDim.x.
Why this is more robust:
- Even if the grid has fewer threads than
n, threads“come back around”to handle later elements. - The loop condition
i < nbecomes oursafety check, so every write is still in bounds. - We can tune
blocksPerGridforperformancelater without breaking correctness.
Choosing the right pattern depends on your goals:
While the simple approach is great for learning, the grid stride loop is the industry standard because it ensures your kernel won't crash when ported to a different GPU with different hardware limits.
Next, the program creates a large problem size and prepares host and device memory. Notice how we compute bytes once, since CUDA allocation and copies use byte counts.
Here we allocate:
h_a,h_bas inputs, filled withconstants.h_cas output, initialized tozero.d_a,d_b,d_casdevice arrayswith matching sizes.
We copy inputs to the GPU, then compute launch dimensions using a ceiling division. This ensures full coverage when using the baseline one-index-per-thread style.
The key formula is:
Now we launch the kernel, check for launch errors, and synchronize. The same launch works whether we use the baseline kernel or the grid stride version.
Two important checks happen here:
cudaGetLastError()catchesconfigurationandlaunch issues.cudaDeviceSynchronize()ensures thekernelis finished before we copyresultsback.
Finally, we copy d_c back, verify every element is 3.0f, print success or failure, and free device memory.
This verification step performs real work for us: it confirms that our indexing logic and our launch setup actually produced correct results for all 50,000 elements.
When everything is configured correctly, the program prints the launch size and the final verification result.
This tells us two things: we computed a grid large enough to cover N, and the CPU-side check confirmed that every output element matched the expected sum.
We now have a stronger mental model for 1D CUDA work: global index gives each thread a starting point, and a grid stride loop lets that thread safely handle more elements by stepping forward with a fixed stride. This pattern makes kernels correct across many sizes, and it keeps working even when we change the grid size for tuning.
Next, we will move into practice problems where we apply grid stride loops and launch math ourselves, and you will see how quickly this becomes a dependable habit.
