Welcome to CUDA Basics and 1D Operations. To build high-performance GPU applications, we must first master the bridge between two separate worlds: the Host (CPU) and the Device (GPU). Unlike standard C++ where memory is often unified in the programmer's mind, CUDA traditionally requires us to manually manage data across distinct hardware boundaries.
In this lesson, we will establish a professional foundation for GPU programming. You will learn how to use cudaMalloc to carve out space in video memory (VRAM), initialize that memory using cudaMemset, and ensure your code is production-ready using robust error-checking macros.
CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model created by NVIDIA. It allows developers to use the Graphics Processing Unit (GPU) for general-purpose mathematical tasks, a field known as GPGPU (General-Purpose computing on Graphics Processing Units).
While a CPU is designed with a few complex cores optimized for sequential logic and low-latency execution, a GPU is built for massive parallelism. It contains thousands of smaller, specialized cores capable of running the same instruction on different pieces of data simultaneously. To harness this power, we must embrace a specific architecture.
CUDA programs often use a heterogeneous computing model, meaning the application execution is split between two different processors. In the explicit memory management model we are covering in this course:
- The Host: The CPU and its system memory (RAM). It manages the overall control flow and logic.
- The Device: The GPU and its dedicated video memory (VRAM). It handles the intensive data-parallel computations.
In this explicit model, these two spaces are treated as isolated. The Host cannot directly dereference a Device pointer, and the Device cannot directly access Host memory. While CUDA provides advanced features like Unified Memory (which creates a shared virtual address space), mastering the explicit allocation and initialization of data is the fundamental skill required for high-performance GPU engineering.
CUDA programs are written in .cu files, which blend standard C++ with CUDA-specific keywords. To compile them, we use nvcc (NVIDIA CUDA Compiler).
nvcc acts as a "wrapper" compiler:
- It parses the source file and strips out the GPU-specific code (kernels).
- It sends the standard host code to a general C++ compiler like
gcc,clang, orMSVC. - It compiles the GPU code into instructions the hardware understands (PTX or SASS).
- It links everything into a single executable.
In low-level systems programming, we don't have the luxury of high-level exceptions. Most CUDA functions return a status code of type cudaError_t. If you ignore these codes, your program might continue running with invalid pointers or failed allocations, leading to "silent failures" that are incredibly difficult to debug.
To write professional-grade CUDA, we wrap every call in a macro. This ensures that if a call fails, the program immediately halts and prints a human-readable error message via cudaGetErrorString, along with the exact file and line number.
With proper error checking in place, we can move on to writing our actual program. First, we'll cover the most important part of any program: memory.
In standard C++, the CPU owns host memory, and you manage it using new or std::malloc. However, the GPU owns device memory, which is a separate address space. Because of this separation, we cannot use std::vector or new for GPU work.
To allocate memory on the GPU, we use cudaMalloc. It functions similarly to std::malloc, but with two key differences:
- The Return Value: It returns a
cudaError_tstatus code (which we check with our macro). - The Pointer: Because the return value is used for the error code, we pass the address of our pointer (
float**) socudaMalloccan write the new GPU memory address into it.
Once memory is allocated, it contains "garbage" values. To initialize this memory, CUDA provides cudaMemset.
Crucially, cudaMemset is a byte-fill operation, not a typed initialization. It fills a range of memory with a specific 8-bit value. This has important implications for different data types:
- Zeroing Memory:
cudaMemset(ptr, 0, bytes)is the most common use case. Since a 32-bitfloator a 64-bitdoublewith all-zero bits is interpreted as0.0, this is a safe and fast way to zero-initialize numerical arrays. - Non-Zero Values: If you try
cudaMemset(ptr, 1, bytes)on a float array, it sets every single byte to0x01. This results in a 32-bit pattern of0x01010101, which represents a tiny, garbage subnormal float value, not1.0f.
Because it works at the byte level, you must always provide the size in total bytes, not the number of elements.
Every cudaMalloc must be paired with a cudaFree. Failing to do so causes a memory leak in VRAM. This is particularly dangerous because GPU memory is often more limited than system RAM, and leaks can persist until the application closes or the driver is reset.
Even after allocation and initialization, the CPU cannot "see" what is inside the device pointer. To verify our work, we need a mechanism to copy data back from the GPU to the CPU so we can inspect it.
In the example below, we use cudaMemcpy to act as this bridge. We will explore the mechanics of data transfers and cudaMemcpy much more in-depth in the next unit; for now, consider it a tool to pull our GPU data back into a std::vector for validation.
- Pointer Confusion: Attempting to access
d_data[0]directly in themain()function will cause a segmentation fault or crash. The CPU cannot resolve GPU addresses. - Elements vs. Bytes:
cudaMalloc,cudaMemset, andcudaMemcpyall require sizes in bytes. This is particularly dangerous withcudaMemset; if you pass the element count, you will only initialize the first 1/4th of a float array, leaving the rest as garbage. - Silent Failures: Without
CUDA_CHECK, a failedcudaMalloc(due to being out of VRAM) would leaved_dataasnullptr. Your program would then crash on the next line, making it much harder to pinpoint the original cause. - Memset Misconceptions: Remember that
cudaMemsetfills memory byte-by-byte. It is perfect for zeroing out memory, but it cannot be used to initialize a float array to1.0for3.14f.
We now have a repeatable, professional approach to GPU memory management. By using cudaMalloc to allocate, cudaMemset to initialize, and cudaFree to clean up—all guarded by CUDA_CHECK—we are ready to start processing data. In the upcoming practice, you will implement this pattern to ensure it becomes second nature before we move on to detailed data transfers and writing our first kernels.
