Naive Matrix Multiplication

Introduction

Welcome back to 2D Grids and Matrix Math. We are now in lesson 2, which means we already have an important tool from the course: mapping a 2D thread layout to matrix rows and columns. In the last lesson, we used that idea to place values into their correct positions. In this lesson, we will use the same mapping for something much more useful: naive matrix multiplication. By the end, we will have a CUDA kernel where each thread computes one element of an output matrix.

From Coordinates To Dot Products

Before we look at the code, let's build a mental model. If matrix A has shape rowsA x colsA and matrix B has shape colsA x colsB, then the result matrix C has shape rowsA x colsB.

Each output entry is a dot product:

C[row, col] = \sum_{k=0}^{colsA-1} A[row, k] \times B[k, col]

Describing The Matrices On The Host

We create a small test case on the CPU to understand the math before scaling up. Matrix $A$ is $2 \times 3$ and matrix $B$ is $3 \times 2$ , which results in a matrix :

Mapping Each Thread To One Output Element

Now, we define the kernel. As you may recall from the previous lesson, 2D indexing provides each thread with a global row and col. The key difference now is what those coordinates mean. They are no longer just labels for memory positions; they tell the thread which element of the output matrix C it owns.

A few design choices are worth noting. The input pointers are const because the kernel reads from A and B but does not change them. Also, the y-direction maps to rows, and the x-direction maps to columns, which matches the pattern we built in lesson 1.

Computing The Dot Product

Once a thread knows its output coordinate, it can compute the dot product for that specific cell. A bounds check is still necessary because the grid may be rounded up, launching a few extra threads. Inside the loop, k walks across one row of A and one column of B. This version is called naive because every thread reads directly from global memory, with no shared memory optimization yet.

The three index formulas are the heart of the kernel:

A[row * colsA + k]: element from row row of A;
B[k * colsB + col]: element from column col of B;
C[row * colsB + col]: output location for the finished sum.

This is exactly where 2D matrix thinking meets the 1D memory layout we covered in the previous lesson.

Allocating Device Memory And Copying Inputs

With the kernel ready, we can prepare the device-side data. We allocate three buffers on the GPU: one for d_A, one for d_B, and one for d_C. Then, we copy only the input matrices from the host to the device. The output matrix does not require an initial copy because the kernel will write all its values.

Choosing The 2D Launch Shape

The launch configuration follows the same 2D structure from the previous lesson, but now the grid is shaped around the output matrix. Since C has rowsA rows and colsB columns, we need enough threads to cover those dimensions. The x-direction spans output columns, and the y-direction spans output rows.

The round-up formula is important:

(size + blockSize - 1) / blockSize ensures we launch enough blocks, even when the matrix size is not a perfect multiple of the block size.

For this tiny 2 x 2 output, one block is enough. Still, this exact formula also works for much larger rectangular matrices.

Launching, Verifying, And Cleaning Up

At this point, everything is ready. We launch the matrixMultiply kernel, check for launch errors using cudaGetLastError, wait for the device to finish with cudaDeviceSynchronize, and copy the result back to the host. Then, we compare each output value against h_expected. Notice that the check uses std::fabs from the <csmath> library with a small tolerance of 1e-5f, which is a good habit for floating-point results. Finally, we print a success message, show the top-left value h_C[0], free device memory with cudaFree, and return a status code.

The program produces:

This output tells us two useful things at once: the full verification passed and the first element matches the hand-computed value 58. Together, they provide strong evidence that the indexing and dot product logic are correct.

Conclusion And Next Steps

In this lesson, we turned 2D thread indexing into a real matrix operation. We defined matrix shapes for general rectangular multiplication, mapped each thread to one output element, used a loop over k to compute a dot product, launched a 2D grid based on the output matrix, and verified the result on the host.

This kernel is called naive for a reason: it is correct and clear, but not yet optimized. This is a great place to be while learning because we now understand the full flow from math to memory to execution. In the practice section ahead, we will reinforce this pattern so we can write the kernel confidently on our own.

Previous Lesson

Next Lesson: 2D Grid Stride Loops

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal