Welcome back to 2D Grids and Matrix Math. We are now in lesson 2, which means we already have an important tool from the course: mapping a 2D thread layout to matrix rows and columns. In the last lesson, we used that idea to place values into their correct positions. In this lesson, we will use the same mapping for something much more useful: naive matrix multiplication. By the end, we will have a CUDA kernel where each thread computes one element of an output matrix.
Before we look at the code, let's build a mental model. If matrix A has shape rowsA x colsA and matrix B has shape colsA x colsB, then the result matrix C has shape rowsA x colsB.
Each output entry is a dot product:
We create a small test case on the CPU to understand the math before scaling up. Matrix is and matrix is , which results in a matrix :
Now, we define the kernel. As you may recall from the previous lesson, 2D indexing provides each thread with a global row and col. The key difference now is what those coordinates mean. They are no longer just labels for memory positions; they tell the thread which element of the output matrix C it owns.
A few design choices are worth noting. The input pointers are const because the kernel reads from A and B but does not change them. Also, the y-direction maps to rows, and the x-direction maps to columns, which matches the pattern we built in lesson 1.
Once a thread knows its output coordinate, it can compute the dot product for that specific cell. A bounds check is still necessary because the grid may be rounded up, launching a few extra threads. Inside the loop, k walks across one row of A and one column of B. This version is called naive because every thread reads directly from global memory, with no shared memory optimization yet.
The three index formulas are the heart of the kernel:
A[row * colsA + k]: element from rowrowofA;B[k * colsB + col]: element from columncolofB;C[row * colsB + col]: output location for the finished sum.
This is exactly where 2D matrix thinking meets the 1D memory layout we covered in the previous lesson.
With the kernel ready, we can prepare the device-side data. We allocate three buffers on the GPU: one for d_A, one for d_B, and one for d_C. Then, we copy only the input matrices from the host to the device. The output matrix does not require an initial copy because the kernel will write all its values.
The launch configuration follows the same 2D structure from the previous lesson, but now the grid is shaped around the output matrix. Since C has rowsA rows and colsB columns, we need enough threads to cover those dimensions. The x-direction spans output columns, and the y-direction spans output rows.
The round-up formula is important:
(size + blockSize - 1) / blockSizeensures we launch enough blocks, even when the matrix size is not a perfect multiple of the block size.
For this tiny 2 x 2 output, one block is enough. Still, this exact formula also works for much larger rectangular matrices.
At this point, everything is ready. We launch the matrixMultiply kernel, check for launch errors using cudaGetLastError, wait for the device to finish with cudaDeviceSynchronize, and copy the result back to the host. Then, we compare each output value against h_expected. Notice that the check uses std::fabs from the <csmath> library with a small tolerance of 1e-5f, which is a good habit for floating-point results. Finally, we print a success message, show the top-left value h_C[0], free device memory with cudaFree, and return a status code.
The program produces:
This output tells us two useful things at once: the full verification passed and the first element matches the hand-computed value 58. Together, they provide strong evidence that the indexing and dot product logic are correct.
In this lesson, we turned 2D thread indexing into a real matrix operation. We defined matrix shapes for general rectangular multiplication, mapped each thread to one output element, used a loop over k to compute a dot product, launched a 2D grid based on the output matrix, and verified the result on the host.
This kernel is called naive for a reason: it is correct and clear, but not yet optimized. This is a great place to be while learning because we now understand the full flow from math to memory to execution. In the practice section ahead, we will reinforce this pattern so we can write the kernel confidently on our own.
