Host to Device Transfer

Introduction

Welcome back to CUDA Basics and 1D Operations. We have already learned how to allocate and free GPU memory safely, so we are now ready for the next key skill: moving real data from Host RAM to Device VRAM, and then bringing results back. In this lesson, we will build small wrapper functions around cudaMemcpy, so transfers become reliable, readable, and easy to verify.

Why Host-to-Device Transfer Matters

A CUDA program lives in two memory worlds: CPU memory for control and setup; GPU memory for fast parallel work. Since the CPU cannot directly read or write a device pointer, we must explicitly copy bytes across.

A helpful way to think about it is: if the GPU needs $N$ floats, we must send exactly $N \cdot sizeof(float)$ bytes to , using the correct .

The Error-Checking Foundation We Rely On

As you may recall from the previous lesson, we should not trust CUDA calls without checking their return codes. We will reuse the same CUDA_CHECK pattern so that a failed transfer stops immediately and reports a clear message.

This gives us two practical benefits: failures are not silent; we also get the exact __FILE__ and __LINE__ number where the problem happened.

Device Memory Helpers for Clean Code

We still need device memory to copy into, so we keep small helpers for allocation and cleanup. This is the same idea as Lesson 1, just applied in our transfer-focused program.

Key points to keep straight:

cudaMalloc needs bytes, not element counts; we multiply by sizeof(float).
cudaFree is guarded with a nullptr check so cleanup is safe and simple.

Copying from Host to Device Safely

Now we implement the main feature of this lesson: a checked wrapper for copying CPU data into GPU memory.

The key element here is cudaMemcpyHostToDevice. This is a constant from the cudaMemcpyKind enumeration defined in <cuda_runtime.h>. It serves as a direction flag, telling the CUDA driver to move data from Host RAM to Device VRAM.

This wrapper makes our intent obvious: dest must be a device pointer; src must be a host pointer; the size is computed from element_count.

Copying from Device Back to Host

To validate work or retrieve computed results, we also need the reverse transfer. This function mirrors the previous one, but flips the direction flag to cudaMemcpyDeviceToHost.

With these two wrappers, our program can move arrays back and forth without repeating low-level cudaMemcpy details everywhere.

End-to-End Transfer Verification in main

Next, we connect everything in main: we prepare host data in a std::vector, allocate device memory, copy to the GPU, copy back, then verify the values match. Since this is a direct bit-for-bit copy without any arithmetic performed, we can safely check for exact equality. Notice how the wrappers keep main focused on the story of the program.

What we should expect when it works is: the verification line reports SUCCESS, and the returned data prints the same sequence we started with.

Common Mistakes to Avoid

Swapping Destination and Source: cudaMemcpy follows the standard C memcpy signature: (destination, source, size, direction). Swapping the first two arguments is a common logic error that leads to data corruption or crashes.
Wrong Transfer Direction: Using cudaMemcpyHostToDevice when you meant to copy data back to the CPU will cause a runtime error. The direction flag must strictly match the pointer types provided.
Elements vs. Bytes: Like cudaMalloc, cudaMemcpy expects the size in bytes. If you pass N instead of N * sizeof(float), you will only copy 1/4th of your data (on systems where a float is 4 bytes).
Pointer Confusion: Attempting to access d_data[i] directly within main() (the Host) will cause a segmentation fault. The CPU cannot dereference pointers that belong to the GPU's memory space.
Silent Failures: Transfers can fail for many reasons (e.g., passing an invalid pointer or exceeding memory limits). Always wrap cudaMemcpy in CUDA_CHECK to ensure you aren't trying to process "garbage" data that was never actually copied.

Conclusion and Next Steps

We can now reliably prepare GPU-ready datasets by allocating VRAM, copying from host to device, and copying back for validation, all with consistent error checking. This is the exact workflow we will reuse before launching kernels, because kernels assume the data is already in the right place.

Up next, the practice tasks will help us build speed and confidence by writing these transfer steps correctly without needing to look them up.

Previous Lesson

Next Lesson: 1D Vector Addition Kernel

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal