Welcome back to Shared Memory Optimization. We are now at Lesson 4, which means we already know how to declare shared memory, synchronize threads, and build a boundary safe tiled matrix multiplication kernel. In this lesson, we take the next practical step: we compare that optimized kernel against a simple baseline and measure the difference carefully.
Our goal is not just to get a smaller timing number. We also want a comparison that is fair, repeatable, and correct. By the end, we will have a small benchmarking pattern that helps us judge whether a shared memory optimization is truly worth using.
Before we look at code, it helps to define what a good CUDA benchmark should do. As you may recall from previous units, GPU work is often asynchronous, so timing can be misleading if we measure carelessly.
A fair comparison should keep four things steady:
- The same input data for both kernels.
- The same matrix size and
launch shape. - Repeated runs instead of one quick measurement.
- Correctness checks for both outputs.
We also keep the matrix width at 530, which is not divisible by TILE_WIDTH = 16. That matters because it tests the edge-handling path, not just the easy middle of the matrix.
To measure improvement, we first need a baseline. The naive kernel computes one output element per thread, but every multiplication step reads directly from global memory.
This kernel is simple, which makes it a good reference point. Each thread forms one dot product for C[r, c], so the logic is easy to follow. It is also boundary safe because of the if (r < w && c < w) guard. The main weakness is memory reuse: nearby threads often need the same values from A and B, but this version fetches them again and again from global memory.
Now we place the optimized version beside the baseline. This is the tiled kernel we built in the previous lesson, and now it becomes the second half of our benchmark.
The key idea is reuse. Instead of reading every needed value straight from global memory, each block loads a tile of A and a tile of B into shared memory, then reuses those values across many multiplications. The loop uses ceiling division, so even w = 530 works safely. Threads that would step past the matrix load 0.0f, which keeps the edge tiles correct without risking invalid reads.
With both kernels ready, we need a timing method we can trust. A host clock alone is not enough, because kernel launches return before the GPU has necessarily finished the work.
This helper measures elapsed GPU time with CUDA events and returns the average time per launch, not the total time. In other words, it computes . A few details matter here:
Next, we build the host side data and allocate memory on the GPU. This part makes the comparison controlled and easy to verify.
There are two smart choices here. First, w = 530 forces boundary safe behavior, because 530 is not a multiple of 16. Second, the input values are easy to reason about: every entry of A is 1.0f, and every entry of B is 2.0f. That means later, each correct output value should be the same, which makes validation simple and dependable.
For a fair test, both kernels must use the same matrix data and the same block shape. We also warm up the GPU before collecting timing numbers.
Using the same block and grid shapes keeps the comparison honest. The ceiling division in grid matches the matrix size, including the edge tiles. The warm-up launches are important because the first execution can include one time startup costs that would make the benchmark less representative. After both warm-up launches, cudaDeviceSynchronize() ensures that timing begins only after the GPU has finished.
Now the benchmark helper does the repeated timing for us. We pass each kernel launch into bench(...) through a small lambda.
The capture list [&] lets each lambda reuse local variables from main(), such as grid, block, dA, dB, and w. This keeps the timing helper clean and generic. The result is two averages: tNaive for the baseline kernel, and tTiled for the shared memory version. If the optimization is helping, we expect tTiled to be smaller, sometimes much smaller, though the exact gap depends on the GPU.
A faster kernel is only useful if it still computes the right answer. That is why the program checks both outputs after the timing step.
The expected value comes from the input pattern. Each dot product has w terms, and every term is 1.0f * 2.0f = 2.0f, so the expected result is . The code checks the naive output and the tiled output separately, which is exactly what we want in a comparison lesson. If one fails, we learn which version is wrong instead of treating both results as a single pass or fail.
Finally, we report the timings, print whether each output passed verification, compute the speedup, and release GPU memory.
This last block turns raw measurements into a clear comparison. The most important lines are the two output checks, because they tell us whether the numbers are worth believing. The speedup is only printed when both kernels are correct and tTiled is nonzero, which avoids meaningless output. The console format will look like this, though the timing values depend on the GPU and current system load:
In this lesson, we completed the full comparison workflow for a shared memory optimization. We kept the setup fair, timed both kernels with CUDA events, repeated launches to reduce noise, used a matrix size that exercises boundary safe logic, and checked that both outputs were correct before trusting the speedup.
That is the big lesson here: performance analysis is not only about measuring time; it is about measuring the right thing, under the right conditions, while protecting correctness. In the practice section ahead, we will turn this pattern into hands-on work so we can benchmark CUDA kernels with much more confidence and independence.
