Welcome back to CUDA Basics and 1D Operations. We have already learned how to allocate and free GPU memory safely, so we are now ready for the next key skill: moving real data from Host RAM to Device VRAM, and then bringing results back. In this lesson, we will build small wrapper functions around cudaMemcpy, so transfers become reliable, readable, and easy to verify.
A CUDA program lives in two memory worlds: CPU memory for control and setup; GPU memory for fast parallel work. Since the CPU cannot directly read or write a device pointer, we must explicitly copy bytes across.
A helpful way to think about it is: if the GPU needs floats, we must send exactly bytes to , using the correct .
As you may recall from the previous lesson, we should not trust CUDA calls without checking their return codes. We will reuse the same CUDA_CHECK pattern so that a failed transfer stops immediately and reports a clear message.
This gives us two practical benefits: failures are not silent; we also get the exact __FILE__ and __LINE__ number where the problem happened.
We still need device memory to copy into, so we keep small helpers for allocation and cleanup. This is the same idea as Lesson 1, just applied in our transfer-focused program.
Key points to keep straight:
cudaMallocneeds bytes, not element counts; we multiply bysizeof(float).cudaFreeis guarded with anullptrcheck so cleanup is safe and simple.
Now we implement the main feature of this lesson: a checked wrapper for copying CPU data into GPU memory.
The key element here is cudaMemcpyHostToDevice. This is a constant from the cudaMemcpyKind enumeration defined in <cuda_runtime.h>. It serves as a direction flag, telling the CUDA driver to move data from Host RAM to Device VRAM.
This wrapper makes our intent obvious: dest must be a device pointer; src must be a host pointer; the size is computed from element_count.
To validate work or retrieve computed results, we also need the reverse transfer. This function mirrors the previous one, but flips the direction flag to cudaMemcpyDeviceToHost.
With these two wrappers, our program can move arrays back and forth without repeating low-level cudaMemcpy details everywhere.
Next, we connect everything in main: we prepare host data in a std::vector, allocate device memory, copy to the GPU, copy back, then verify the values match. Since this is a direct bit-for-bit copy without any arithmetic performed, we can safely check for exact equality. Notice how the wrappers keep main focused on the story of the program.
What we should expect when it works is: the verification line reports SUCCESS, and the returned data prints the same sequence we started with.
- Swapping Destination and Source:
cudaMemcpyfollows the standard Cmemcpysignature:(destination, source, size, direction). Swapping the first two arguments is a common logic error that leads to data corruption or crashes. - Wrong Transfer Direction: Using
cudaMemcpyHostToDevicewhen you meant to copy data back to the CPU will cause a runtime error. The direction flag must strictly match the pointer types provided. - Elements vs. Bytes: Like
cudaMalloc,cudaMemcpyexpects the size in bytes. If you passNinstead ofN * sizeof(float), you will only copy 1/4th of your data (on systems where a float is 4 bytes). - Pointer Confusion: Attempting to access
d_data[i]directly withinmain()(the Host) will cause a segmentation fault. The CPU cannot dereference pointers that belong to the GPU's memory space. - Silent Failures: Transfers can fail for many reasons (e.g., passing an invalid pointer or exceeding memory limits). Always wrap
cudaMemcpyinCUDA_CHECKto ensure you aren't trying to process "garbage" data that was never actually copied.
We can now reliably prepare GPU-ready datasets by allocating VRAM, copying from host to device, and copying back for validation, all with consistent error checking. This is the exact workflow we will reuse before launching kernels, because kernels assume the data is already in the right place.
Up next, the practice tasks will help us build speed and confidence by writing these transfer steps correctly without needing to look them up.
