Introduction

Welcome back to Shared Memory Optimization. We are now at Lesson 3, which means we already have two key tools in hand: we know how to place data in shared memory, and we know how to make threads wait safely with __syncthreads(). In this lesson, we will combine those ideas into a full tiled matrix multiplication.

Our goal is practical: we will build a kernel that processes a matrix in multiple tiles, handles edge cases safely, and still produces the correct answer when the matrix width is not divisible by the tile width. By the end, we will have a complete pattern that we can reuse in many CUDA programs.

Why Tiling Needs Boundary Safety

A matrix multiplication entry is computed as:

C[row,col]=k=0width1A[row,k]×B[k,col]C[row, col] = \sum_{k=0}^{width-1} A[row, k] \times B[k, col]
Mapping A Block To One Output Tile

Now we begin the kernel. Each block will compute one TILE_WIDTH x TILE_WIDTH region of the output matrix C. To do that, the block needs one shared tile from A, one shared tile from B, and a private accumulator in each thread.

There are two coordinate systems here, and both matter. tx and ty tell us where a thread lives inside its block; row and col tell us which output element that thread is responsible for in the full matrix. The variable value will collect the dot product for C[row, col] over several tile iterations.

Planning The Tile Loop And Guarded Loads

The next step is deciding how many tile pairs we must process. Because the width may not divide evenly by TILE_WIDTH, we use ceiling division. Then, in each loop iteration, every thread loads one value from A and one from B into shared memory.

This is the heart of the boundary-safe design. For width = 34, numTiles becomes 3. On the last iteration, some threads ask for valid matrix elements, while others would step outside the matrix. The ternary expressions prevent bad reads by writing 0.0f into those unused shared memory positions. After that, __syncthreads() ensures the whole tile is ready before any thread starts using it.

Doing The Tile Computation Safely

Once both shared tiles are fully loaded, each thread multiplies across the current tile and adds the result into its running sum. Then we need a second synchronization point before the next iteration starts.

The inner loop always runs for TILE_WIDTH steps, even on the last partial tile. That works because any missing values were replaced with 0.0f, so they contribute nothing to the sum. The second __syncthreads() is just as important as the first one: it prevents one thread from overwriting s_A or s_B for the next tile while another thread is still reading the current tile.

Writing Only Valid Output Elements

After all tile pairs have been processed, each thread has its final dot product in value. Even then, we still need one last bounds check because some threads in the edge blocks correspond to positions outside the real matrix.

This final guard protects the output write. For a 34 x 34 matrix with 16 x 16 blocks, the grid covers a 48 x 48 region in total. That means many edge threads are useful for cooperation during loading, but they do not own valid output elements. The if statement ensures only real matrix positions are written back to global memory.

Preparing The Host Side Test

With the kernel complete, we can set up a test in main(). The matrix width is deliberately chosen as 34, so we can prove that the tiled approach works even when the size does not line up neatly with the tile width.

The input values make verification easy. Every entry of A is 1.0f and every entry of B is 2.0f; therefore, each product inside a dot product is 2.0f. We also allocate device memory for all three matrices, then copy only A and B to the GPU, because C will be produced by the kernel.

Launching Enough Work For The Whole Matrix

Next, we choose the block and grid dimensions. The block is a single 16 x 16 tile, and the grid uses the same ceiling division logic as the kernel so that every output position is covered, including the edge region.

For width = 34, both grid dimensions become 3, so we launch a 3 x 3 set of blocks. That gives us full coverage of the matrix, plus extra edge threads that are safely handled by the guards inside the kernel. As in the previous lesson, cudaGetLastError() checks launch issues, and cudaDeviceSynchronize() waits for the kernel to finish before we copy results back.

Verifying The Numbers And Cleaning Up

Now we confirm that the result is correct. Since each output entry is the sum of 34 products, and each product is 1.0f * 2.0f = 2.0f, the expected value is 2.0f * width, which is 68.0f. After checking every value, the program prints the result, frees device memory, and returns a status code.

This final check is small, but it tells us a lot. If even one tile step, one synchronization point, or one bounds guard were wrong, the comparison would fail. When everything is correct, we get:

That message confirms the full tiled computation is working, including the difficult edge tiles.

Conclusion and Next Steps

In this lesson, we turned the ideas from the first two lessons into a complete tiled matrix multiplication. We used shared memory for reuse, __syncthreads() for safe cooperation, ceiling division for the tile count, and guarded loads and stores for boundary safety. Most importantly, we verified the result on a matrix size that does not divide evenly by the tile width.

This is a very common CUDA pattern, and it is worth knowing well. In the upcoming practice section, we will reinforce it step by step, so we can write tiled kernels with much more confidence and independence.

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal