Embedded Systems & Firmware
Shared Memory Optimization
In this course, you will tackle global memory latency by harnessing the power of fast, on-chip shared memory. You will learn to synchronize threads using shared memory, implement a boundary-safe tiled matrix multiplication algorithm, and empirically compare it against a naive implementation using validated benchmarks.
CUDA
4 lessons
22 practices
3 hours
Badge for GPU Architecture and Memory Hierarchy,
GPU Architecture and Memory Hierarchy
Course details
Shared Memory Declaration
Shared Memory Mystery
Shared Memory Teamwork
Shared Tile Debugging
Building Block Cooperation
Indexing Under Pressure
Turn screen time into skills time
Practice anytime, anywhere with our mobile app.
Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal