Welcome back to CUDA Basics and 1D Operations. We are now a few lessons into the course, and we have the two key building blocks from earlier units: allocating device memory and moving data with cudaMemcpy. In this lesson, we will finally do the "GPU work" part by writing a 1D vector addition kernel, launching it, and verifying that every element was added correctly.
Before we dive into the implementation, we need to understand the fundamental unit of GPU execution: the Kernel.
In CUDA, a kernel is a specialized C++ function that is called by the CPU (the host) but executed on the GPU (the device). Unlike a standard CPU function that runs once sequentially, a kernel is designed to be executed many times in parallel by different threads.
Think of a kernel as a template for a single unit of work. By defining this template once, you enable the GPU to replicate that work across its entire array of processors. This approach shifts the focus from "how do I loop through this data?" to "what operation should be performed on a single data point?", letting the hardware handle the massive parallelism automatically.
In a standard CPU program, you might use 4 to 16 threads to handle tasks. In CUDA, we think much bigger. A modern GPU can manage tens of thousands of threads simultaneously.
Each thread is an independent path of execution. However, in the CUDA model, they all execute the same function (the kernel) at the same time. This is known as SIMT (Single Instruction, Multiple Threads). This model allows the GPU to process vast amounts of data by broadcasting a single instruction to thousands of individual threads simultaneously.
To manage this volume of work, the GPU hardware groups threads into units called warps, usually consisting of 32 threads. Threads within a warp execute the same instruction at the same time, which allows the GPU to stay efficient by sharing the instruction-processing hardware across many threads at once. This high-density execution model is what gives GPUs their massive computational throughput.
While we launch thousands of threads, CUDA does not treat them as one giant, flat list. Instead, threads are organized into a hierarchy: Threads are grouped into Blocks, and all blocks together form the Grid.
- Thread: The smallest unit of execution.
- Warp: A hardware-level group of 32 threads that execute instructions simultaneously. While not explicitly used in your kernel code, it always operates in 32-thread increments behind the scenes, so targeting block sizes that are multiples of 32 is a standard optimization.
- Block: A group of threads that run on the same Streaming Multiprocessor (SM).
- Grid: The collection of all blocks launched for a specific kernel.
This grouping is necessary because hardware has limits; for example, a single block cannot contain more than 1024 threads on modern GPUs. By using blocks, we can scale our work to millions of elements by simply launching more blocks across the various SMs available on the GPU.
Vector addition is the "Hello World" of CUDA because it perfectly illustrates data-parallel programming. To understand why we write kernels the way we do, we need to compare the CPU approach to the GPU approach.
In standard C++, you would add two vectors using a for loop:
In this model, a single CPU thread moves through the data linearly. In CUDA, we remove the loop entirely. Instead of one thread performing additions, we launch threads to perform one addition each, all at the same time.
When you launch a kernel, every single thread executes the exact same code. This is the power behind simultaneous execution, but it presents a challenge: if our kernel simply said c[0] = a[0] + b[0], then 10,000 threads would all calculate the first element 10,000 times. We obviously don't want that!
To make the threads work together effectively, each thread must identify which piece of data it is responsible for. Just like how a CPU uses a loop iterator i to keep track of which elements it's processing, each GPU thread needs a unique identifier to determine its specific task. This is why we calculate a global thread index.
Think of the GPU as a massive factory with thousands of workers. To ensure the work is divided correctly, every worker is given a unique ID based on their "Floor" (blockIdx) and their "Station" (threadIdx). By calculating this unique ID, each worker knows exactly which box on the assembly line to process.
The formula index = blockIdx.x * blockDim.x + threadIdx.x maps the hierarchical hardware structure (Blocks and Threads) into a flat, linear coordinate system that matches the way arrays are stored in memory.
With the theory covered, it's time to implement the kernel. We define the kernel in the same file as our standard host code, but it behaves differently than a regular function. It is launched by the CPU but executed on the GPU.
In code, a CUDA kernel is a function marked with the __global__ specifier. Here is the implementation of the vector addition kernel:
To understand how this works, let's look closer at two critical lines:
1. Calculating the Global Index
To find a thread's unique position in the entire grid (the "global index"), we use:
index = blockIdx.x * blockDim.x + threadIdx.x
blockIdx.x: The ID of the current block.blockDim.x: The number of threads in each block.threadIdx.x: The local ID of the thread within its block.
Think of it like an apartment building: to find your global "mailbox number," you take your floor number (blockIdx) multiplied by the number of mailboxes per floor (blockDim), plus your specific mailbox number on that floor (threadIdx).
This index is effectively the same as the incrementor used in a standard CPU loop. The difference is that instead of one thread processing indices sequentially, CUDA assigns each index to a specific thread (e.g., , , ). It is important to note that these threads run . There is no guaranteed order of execution; might finish its addition before , but because they are all mapped to unique indices, they never interfere with each other.
With our kernel defined, we create our input arrays on the host using std::vector next, then allocate matching arrays on the device with cudaMalloc. Notice how we compute the byte size once, since CUDA APIs operate in bytes.
Here, h_a and h_b are the input arrays on the host and h_c is the pre-allocated result array; similarly, d_a, d_b, and d_c are the corresponding memory locations on the device.
As in the previous lesson, we use cudaMemcpy with the correct direction flag cudaMemcpyHostToDevice. We copy only the input arrays because the output d_c will be written by the kernel.
To launch the kernel, we first need to define a block size (the number of threads per block) and a grid size (the total number of blocks). We typically choose a block size that is a multiple of 32, such as 256, to ensure efficiency on the hardware. We calculate the block count using a ceiling division formula: (element_count + threadsPerBlock - 1) / threadsPerBlock. Since standard integer division in C++ rounds down, this logic ensures that if our data size isn't a perfect multiple of the block size, we launch one additional block to cover the remaining elements.
Once we have those values, we can launch the kernel using the triple chevron notation <<<...>>>. This special syntax, known as the execution configuration, takes our two layout parameters and tells the GPU exactly how many blocks to schedule and how many threads to pack into each block, after which it passes the function arguments to the device.
Kernel launches are asynchronous, meaning the CPU continues executing the next line of code immediately after telling the GPU to start. While a subsequent cudaMemcpy(..., cudaMemcpyDeviceToHost) is a blocking call that will wait for the kernel to finish, we use explicit synchronization and error checking to ensure the kernel launched correctly and to surface any asynchronous execution errors.
What this does:
cudaGetLastError()catches launch issues, such as an invalid configuration.cudaDeviceSynchronize()acts as a checkpoint, forcing the CPU to wait until theGPUfinishes all preceding work. This is useful for debugging and clearly separating the execution phase from the data-retrieval phase.
After the kernel finishes, we copy d_c back to the host, then verify every element using a small tolerance.
Finally, we print whether the check passed, and we print the output vector so we can visually confirm the addition.
Expected output:
Good CUDA programs clean up after themselves. We free each device allocation, then return 0 for success.
Below is the complete, runnable code for the vector addition example. This combines all the steps we discussed: kernel definition, memory management, kernel execution, and verification.
We now have a complete 1D CUDA workflow: allocate device memory, copy inputs, run a __global__ kernel with a correct global index, synchronize, copy results back, and verify correctness. The most important new skill here is the index mapping plus the bounds check.
Next, we will jump into hands-on practice where we will write and launch kernels ourselves.
