Embedded Systems & Firmware
Shared Memory Optimization
In this course, you will tackle global memory latency by harnessing the power of fast, on-chip shared memory. You will learn to synchronize threads using shared memory, implement a boundary-safe tiled matrix multiplication algorithm, and empirically compare it against a naive implementation using validated benchmarks.
CUDA
4 lessons
22 practices
3 hours
GPU Architecture and Memory Hierarchy
Course details
Shared Memory Declaration
Shared Memory Mystery
Shared Memory Teamwork
Shared Tile Debugging
Building Block Cooperation
Indexing Under Pressure

Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal





