Introduction

Welcome back to Shared Memory Optimization. In the previous lesson, we already used shared memory and included a __syncthreads() call so the kernel would behave correctly. That means this lesson is not about introducing the barrier from scratch. Instead, we will build on it with a more interesting question:

What useful thing can threads do once they can safely share data?

A natural next step is to use shared memory not just for reuse, but also for reordering data cooperatively. In this lesson, we will write a small matrix transpose kernel for a single tile. Each thread will load one matrix element into shared memory, wait at the barrier, and then write out the transposed value by reading from a different shared-memory coordinate.

By the end of the lesson, we will have a complete kernel that uses one shared tile and one __syncthreads() to perform a correct block-local transpose.

Why Threads Still Need A Meeting Point

Even though all threads in a block are launched together, they do not all move in lockstep. CUDA executes threads in groups of 32 called warps, and different warps may advance at different times.

That matters because one thread may try to read from shared memory before another thread has finished writing the value it needs.

For a transpose, the dependency is especially clear:

  • Thread (ty, tx) loads A[ty][tx] into tile[ty][tx]
  • Later, thread (ty, tx) wants to read tile[tx][ty]
  • But tile[tx][ty] was written by thread (tx, ty)

So each thread depends on a partner thread at the transposed coordinate.

Without a meeting point, some threads could read from shared memory before their “partner threads” have finished writing. That’s a classic data race: the read might see an old value or an uninitialized value, depending on timing.

__syncthreads() solves this by acting as:

  1. an execution barrier: no thread proceeds past it until all threads in the block arrive,
  2. a visibility guarantee: shared-memory writes performed before the barrier are visible to threads after the barrier.
Shared Memory as a Staging Area

Shared memory isn’t only useful as a fast scratchpad for reuse. It’s also a shared staging buffer that lets threads rearrange data cooperatively.

A transpose is the simplest example of reordering:

  1. Produce phase: each thread loads one element from global memory into shared memory using normal (row, col) coordinates.
  2. Barrier: the block waits until the shared tile is fully populated.
  3. Consume phase: each thread reads from shared memory using swapped coordinates (col, row) and writes the transposed value back out.

This “produce → sync → consume” pattern is what makes shared memory feel genuinely cooperative: after the barrier, threads are no longer just reading what they personally wrote—they’re reading data written by other threads in the block.

Synchronization Rules

Two rules make __syncthreads() safe and predictable:

  1. Block scope only
    __syncthreads() synchronizes threads within a single block. It does not coordinate different blocks anywhere in the grid.

  2. All threads must reach it
    Every thread in the block must execute the barrier. If some threads skip it while others wait there, the block can deadlock.

This is unsafe unless the condition is guaranteed to be uniform across the whole block:

For a shared-memory transpose tile, the clean approach is to place the barrier unconditionally at the boundary between the load phase and the transposed-read phase.

Building A Single-Tile Matrix Transpose Kernel

Let's start writing the kernel. We will use a single 16 x 16 tile and a single block. The matrix width is exactly TILE_WIDTH, so every thread maps to one valid element and no bounds checks are needed.

At this point, each thread has contributed exactly one value to the shared tile. But no thread should yet try to read from a transposed position like tile[tx][ty], because that value may still be in the process of being written by another thread.

Our transpose kernel avoids that issue because every thread executes the barrier unconditionally.

Finishing The Kernel: Barrier + Transposed Write-Back

Now we add the second half of the kernel: the meeting point and the consume phase.

At the end of the load phase, each thread (ty, tx) has written exactly one element. But for a transpose, thread (ty, tx) must read the element that belongs at its output location, which is sitting in shared memory.

Crucially, tile[tx][ty] was written by a different thread: (tx, ty). So we must wait until all threads have finished populating the tile before any thread starts reading transposed coordinates.

That is exactly what __syncthreads() provides here.

A few important details to notice:

  • The barrier is unconditional (every thread reaches it), so the block cannot deadlock.
  • The indices are swapped only in the shared-memory read: tile[tx][ty]. That swap is the entire “transpose” operation.
  • Because this lesson’s setup uses exactly one 16x16 block on a 16x16 matrix, we don’t need bounds checks—every (ty, tx) is valid.
Creating The Host Side Test

To test the kernel, we create a 16 x 16 matrix filled with increasing values. That makes the transpose easy to verify, because the output at position (row, col) should equal the input at (col, row).

Using distinct values instead of all ones is important here. A matrix of all ones would look the same after transpose, so it would not prove that the coordinates were actually swapped correctly.

Launching The Kernel

Because __syncthreads() only works inside a block, we launch exactly one block with one thread per tile position.

This launch matches the kernel assumptions directly:

  • one block,
  • one shared tile,
  • one input element loaded per thread,
  • one transposed output element written per thread.
Verifying The Transpose

After the kernel finishes, we copy the result back and compare it against the expected transpose.

If everything works correctly, the program prints:

Conclusion and Next Steps

In this lesson, we moved beyond simply “using __syncthreads() because shared memory needs it.” We used synchronization to support a new matrix operation: a shared-memory tile transpose. Each thread first wrote one element into a common tile, then—after a block-wide barrier—read from a different coordinate that was filled by a different thread.

That is the key idea to carry forward: shared memory is not just for caching values; it is also for reorganizing them cooperatively across threads. In the next lessons and practices, this will become the foundation for more advanced tiled matrix kernels.

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal