Fast Data Loads in Redshift

Introduction: Loading and Exporting Data Efficiently

Now that you have created your dimension tables and set up your star schema in Amazon Redshift, you are ready for the next step: moving data in and out of your warehouse efficiently. In the previous lesson, you learned how to organize your data for analytics by building tables that are optimized for fast queries and reporting. In this lesson, you will learn how to load large amounts of data from your S3 data lake into Redshift using the COPY command, and how to export curated datasets back to S3 using the UNLOAD command. These are essential skills for any data engineer working with Redshift, as they allow you to quickly populate your warehouse and share results with downstream users or other systems. By the end of this lesson, you will be able to use Python and the boto3 library to move data between S3 and Redshift in a way that is both fast and reliable.

Preparing For COPY And UNLOAD

Before you can load or export data, you need to make sure your Redshift environment has the right permissions and configuration. As a reminder, you will need an IAM role that allows Redshift to access your S3 buckets. This role should have both GetObject and PutObject permissions, since you will be reading from and writing to S3. You should also have your environment variables set up for your AWS region, Redshift workgroup name, database name, the ARN of your AWS Secrets Manager secret, your S3 bucket, and your AWS account ID. If you are working in the CodeSignal environment, these libraries and most environment variables are already set up for you, but it is good practice to know how to configure them on your own machine.

To connect to Redshift and run SQL commands from Python, you will use the boto3 library and the Redshift Data API. Here is a quick reminder of how to set up your client and environment variables:

With this setup, you are ready to start moving data.

Helper Function: Running SQL Statements

To interact with Redshift from Python, you can use a helper function that executes SQL statements and waits for them to finish. This function will also fetch results for queries that return data (like SELECT), and handle errors gracefully.

Loading Parquet Data From S3 With COPY

Let’s walk through how to load data from S3 into your Redshift fact table using the COPY command. The COPY command is designed for high-speed, parallel data loading. It can read data in various formats, but Parquet is especially efficient because it is columnar and schema-aware.

Suppose your fact table is named with your account ID for isolation, like warehouse_{ACCOUNT_ID}.fact_borrow_{ACCOUNT_ID}. Here’s how you can load Parquet data from your S3 bucket into this table:

This command tells Redshift to load all Parquet files from the specified S3 path into your fact table. The IAM_ROLE parameter provides the necessary permissions, and FORMAT AS PARQUET tells Redshift to expect Parquet files. Before running COPY, make sure the incoming Parquet schema is compatible with your Redshift target table: column names, order expectations, and data types should line up closely enough to avoid runtime mapping or coercion errors. In practice, it is worth validating the upstream schema before large loads so you catch mismatches early.

Verifying The Data Load

After loading data, it is important to verify that the expected number of rows has been loaded into your table. You can do this by running a simple SELECT COUNT(*) query:

This code will print the number of rows currently in your fact table, confirming that your data load was successful. A typical run might look like this:

Exporting A Data Mart With UNLOAD

Once your data is loaded and verified, you may want to share a subset of it with other teams or systems. The UNLOAD command allows you to export the results of a query from Redshift back to S3, often in Parquet format for efficiency and compatibility. You can also partition the output files by a column, which makes it easier for downstream users to work with the data.

Here is an example that exports a summary of total fines by book title to S3, partitioned by book_title. The output files are named with your account ID for uniqueness:

This script runs a query that joins the fact and dimension tables, groups by book title, and sums the fines. The results are exported to S3 in Parquet format, with one partition per book title. In real pipelines, you should think carefully about export layout: partition only on columns that improve downstream filtering, prefer compressed columnar output such as Parquet, and avoid producing too many tiny files because they increase storage overhead and slow later reads. Good partitioning and file sizing usually improve both query performance and cost.

Summary And Practice Preview

In this lesson, you learned how to move data efficiently between S3 and Redshift using the COPY and UNLOAD commands. You saw how to load Parquet data into your fact table, verify the load with a row count, and export a curated data mart back to S3 for sharing or further analysis. These skills are essential for building a fast, reliable analytics warehouse on AWS. In the next set of practice exercises, you will get hands-on experience running these commands yourself, so you can confidently move data in and out of Redshift in your own projects.

Previous Lesson

Next Lesson: Query Tuning with Redshift

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal