Welcome back to Shared Memory Optimization. In the previous lesson, we already used shared memory and included a __syncthreads() call so the kernel would behave correctly. That means this lesson is not about introducing the barrier from scratch. Instead, we will build on it with a more interesting question:
What useful thing can threads do once they can safely share data?
A natural next step is to use shared memory not just for reuse, but also for reordering data cooperatively. In this lesson, we will write a small matrix transpose kernel for a single tile. Each thread will load one matrix element into shared memory, wait at the barrier, and then write out the transposed value by reading from a different shared-memory coordinate.
By the end of the lesson, we will have a complete kernel that uses one shared tile and one __syncthreads() to perform a correct block-local transpose.
Even though all threads in a block are launched together, they do not all move in lockstep. CUDA executes threads in groups of 32 called warps, and different warps may advance at different times.
That matters because one thread may try to read from shared memory before another thread has finished writing the value it needs.
For a transpose, the dependency is especially clear:
- Thread
(ty, tx)loadsA[ty][tx]intotile[ty][tx] - Later, thread
(ty, tx)wants to readtile[tx][ty] - But
tile[tx][ty]was written by thread(tx, ty)
So each thread depends on a partner thread at the transposed coordinate.
Without a meeting point, some threads could read from shared memory before their “partner threads” have finished writing. That’s a classic data race: the read might see an old value or an uninitialized value, depending on timing.
__syncthreads() solves this by acting as:
- an execution barrier: no thread proceeds past it until all threads in the block arrive,
- a visibility guarantee: shared-memory writes performed before the barrier are visible to threads after the barrier.
Shared memory isn’t only useful as a fast scratchpad for reuse. It’s also a shared staging buffer that lets threads rearrange data cooperatively.
A transpose is the simplest example of reordering:
- Produce phase: each thread loads one element from global memory into shared memory using normal
(row, col)coordinates. - Barrier: the block waits until the shared tile is fully populated.
- Consume phase: each thread reads from shared memory using swapped coordinates
(col, row)and writes the transposed value back out.
This “produce → sync → consume” pattern is what makes shared memory feel genuinely cooperative: after the barrier, threads are no longer just reading what they personally wrote—they’re reading data written by other threads in the block.
Two rules make __syncthreads() safe and predictable:
-
Block scope only
__syncthreads()synchronizes threads within a single block. It does not coordinate different blocks anywhere in the grid. -
All threads must reach it
Every thread in the block must execute the barrier. If some threads skip it while others wait there, the block can deadlock.
This is unsafe unless the condition is guaranteed to be uniform across the whole block:
For a shared-memory transpose tile, the clean approach is to place the barrier unconditionally at the boundary between the load phase and the transposed-read phase.
Let's start writing the kernel. We will use a single 16 x 16 tile and a single block. The matrix width is exactly TILE_WIDTH, so every thread maps to one valid element and no bounds checks are needed.
At this point, each thread has contributed exactly one value to the shared tile. But no thread should yet try to read from a transposed position like tile[tx][ty], because that value may still be in the process of being written by another thread.
Our transpose kernel avoids that issue because every thread executes the barrier unconditionally.
Now we add the second half of the kernel: the meeting point and the consume phase.
At the end of the load phase, each thread (ty, tx) has written exactly one element.
But for a transpose, thread (ty, tx) must read the element that belongs at its output location, which is sitting in shared memory.
Crucially, tile[tx][ty] was written by a different thread: (tx, ty). So we must wait until all threads have finished populating the tile before any thread starts reading transposed coordinates.
That is exactly what __syncthreads() provides here.
A few important details to notice:
- The barrier is unconditional (every thread reaches it), so the block cannot deadlock.
- The indices are swapped only in the shared-memory read:
tile[tx][ty]. That swap is the entire “transpose” operation. - Because this lesson’s setup uses exactly one
16x16block on a16x16matrix, we don’t need bounds checks—every(ty, tx)is valid.
To test the kernel, we create a 16 x 16 matrix filled with increasing values. That makes the transpose easy to verify, because the output at position (row, col) should equal the input at (col, row).
Using distinct values instead of all ones is important here. A matrix of all ones would look the same after transpose, so it would not prove that the coordinates were actually swapped correctly.
Because __syncthreads() only works inside a block, we launch exactly one block with one thread per tile position.
This launch matches the kernel assumptions directly:
- one block,
- one shared tile,
- one input element loaded per thread,
- one transposed output element written per thread.
After the kernel finishes, we copy the result back and compare it against the expected transpose.
If everything works correctly, the program prints:
In this lesson, we moved beyond simply “using __syncthreads() because shared memory needs it.” We used synchronization to support a new matrix operation: a shared-memory tile transpose. Each thread first wrote one element into a common tile, then—after a block-wide barrier—read from a different coordinate that was filled by a different thread.
That is the key idea to carry forward: shared memory is not just for caching values; it is also for reorganizing them cooperatively across threads. In the next lessons and practices, this will become the foundation for more advanced tiled matrix kernels.
