Introduction

Welcome to Shared Memory Optimization. We are at Lesson 1, so this is the perfect time to build a strong foundation for the rest of the course. In this lesson, we will focus on shared memory declaration, one of the most useful ideas in CUDA. The goal is practical: we will create a small matrix multiplication kernel that copies data from global memory into a shared tile, reuses that tile inside the block, and produces a result we can verify with confidence.

Understanding the Memory Hierarchy

Before we dive into shared memory, we need to distinguish between the different storage areas available in a CUDA program. So far, we have treated the GPU as a single block of "Device Memory," but the reality is a hierarchy with significant trade-offs in speed and scope.

  1. Host Memory (RAM): This is your system memory. It is large (gigabytes) but physically separated from the GPU. Data must travel across the PCIe bus via cudaMemcpy to reach the device. This is the slowest transfer in your application.
  2. Global Memory (VRAM): This is the "Device Memory" we have used previously. It resides on the GPU board but off-chip from the actual processing cores. While it is much faster than the PCIe bus, it still has high latency—it takes many clock cycles for a thread to fetch data from here. It is persistent and visible to all threads in all blocks.
  3. Shared Memory (SRAM): This is the focus of our lesson. It is located on-chip, physically sitting next to the processing cores. It is orders of magnitude faster than global memory because it avoids the off-chip round trip. However, it is tiny (kilobytes) and its scope is limited: it is only visible to threads within the same block and is destroyed once the block finishes.

In short, Global Memory is like a massive warehouse at the edge of town (high capacity, long travel time), while Shared Memory is a small workbench right in front of the worker (tiny capacity, instant access). Optimization in CUDA usually involves moving data from the "warehouse" to the "workbench" to minimize travel time.

Shared Memory vs. Cache

You may notice that Shared Memory sounds similar to a CPU or GPU cache. You would be partially right; both live on-chip for speed and both use SRAM hardware. The difference is control:

  • Cache (L1/L2): Is hardware-managed. The GPU automatically decides what to store based on recent usage. You cannot force a specific value to stay in cache.
  • Shared Memory: Is software-managed. You (the programmer) explicitly move data into it and decide exactly how long it stays.

Because you have total control, you can guarantee that data is available exactly when needed, which is more reliable for high-performance math than relying on the hardware's "best guess" cache logic.

The Tiling Strategy

Because shared memory is fast but very small, we usually cannot place an entire matrix into it. Instead, we divide the data into small fixed-size chunks called tiles.

A tile is just a small sub-region of a matrix, such as a 16 x 16 block of values. The idea is simple:

  1. Load a small chunk from global memory into shared memory.
  2. Let all threads in the block reuse that chunk many times.
  3. Avoid repeated slow global-memory reads during the computation.

For this lesson, we intentionally keep the example small: the matrix width is at most TILE_WIDTH, so the whole problem fits inside a single shared tile. That means we are not yet looping over multiple tiles across a large matrix. Even so, using the word tile now is still helpful, because the exact same pattern will scale later when one block processes just one piece of a much larger matrix.

So before we look at the kernel code, keep this mental model in mind:

  • Global memory holds the full matrices.
  • Shared memory holds only the small working chunk needed by the block.
  • The block’s threads cooperate to fill that shared chunk, then reuse it for the computation.
Declaring Shared Memory In The Kernel

Now that we know we want to work with small shared-memory tiles, we can declare them directly inside the kernel with the __shared__ keyword. These arrays are visible to all threads in the same block and exist only for the lifetime of that block.

There are several important architectural details in this snippet:

  • Memory Modifiers: A and B are marked const because the kernel only reads them, while C is the writable output in global memory.
  • Shared Buffers: tileA and tileB are our reusable on-chip buffers. Because this example uses a single block where width <= TILE_WIDTH, the thread indices directly map to our coordinates.
  • Local vs. Global Indexing: Note the distinction between tx/ty and row/col. We use tx and ty specifically for indexing the local shared tiles, while row and represent coordinates in the . Separating these variables distinguishes the logic for the "workbench" (shared memory) from the "warehouse" (global memory), which is essential once you scale to multiple blocks.
Loading Values Into Shared Tiles

Once the shared arrays exist, each thread loads one value from A and one value from B into the tile. This is the moment when data moves from slower global memory into faster shared memory.

The bounds checks matter because our block has 16 x 16 threads, but the matrix might only be 8 x 8. Threads outside the valid matrix region must not read invalid addresses. Instead, they store 0.0f into shared memory. This zero-padding keeps the tile safe to use, and it mirrors a pattern we will also need later in larger tiled-matrix algorithms.

Synchronizing And Reusing The Tile

After loading data, we must make sure every thread in the block has finished writing its shared values before any thread starts reading them. That is the job of __syncthreads(). Only then is the tile ready for reuse inside the multiplication loop.

This is where shared memory becomes valuable. Each thread computes one output element by walking across a row of tileA and down a column of tileB. The important part is reuse: values were loaded from global memory once, then read again and again from shared memory inside the k loop. The final bounds check ensures that only valid threads write into C.

Creating A Small Host Test

Next, we build a simple test case on the CPU side. The matrix width is 8, which fits inside our 16 x 16 tile. That is a deliberate choice: we want to focus on shared memory itself, not on managing multiple tiles across many blocks yet.

This setup makes verification easy:

  1. h_A contains increasing values, so errors are easy to spot.
  2. h_B is the identity matrix, so mathematically A×I=AA \times I = A.
  3. h_C starts as zeros and will hold the result copied back from the .
Allocating Device Memory And Launching

With the host data ready, we can prepare GPU memory and launch the kernel. The program allocates space for the input and output matrices on the device, copies the host inputs over, and then runs a single 16 x 16 block.

This launch code reflects the kernel assumptions directly. dimGrid(1, 1) means we use exactly one block, and dimBlock(TILE_WIDTH, TILE_WIDTH) creates one thread per tile position. The two checks after launch are important: cudaGetLastError() catches launch issues, and cudaDeviceSynchronize() waits for the kernel to finish and surfaces runtime errors.

Validating The Result

The last step is to bring the computed matrix back to the host, compare it with the expected answer, and clean up. Since floating-point results can differ by tiny amounts, the comparison uses a small tolerance instead of exact equality.

Because B is the identity matrix, the correct output should match A element by element. If the shared memory loads, synchronization, and reuse all work correctly, the program prints:

Conclusion and Next Steps

In this lesson, we declared shared memory inside a CUDA kernel, loaded one tile of data from global memory, synchronized the block, and reused those values to compute a correct matrix product. We also connected the code to the bigger idea: shared memory helps reduce repeated global memory reads by giving threads in a block a fast common workspace. In the practice section ahead, we will reinforce this pattern with hands-on tasks so that shared-tile programming starts to feel natural and dependable.

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal