Querying S3 with Spectrum

Introduction: Connecting Redshift Serverless to Your Data Lake

Welcome to the first lesson of the course, Data Warehousing on AWS with Amazon Redshift. In this lesson, you will learn how to connect Amazon Redshift Serverless to your S3 data lake and run your very first cross-lake query. This is a foundational step in building a modern analytics warehouse on AWS, and it will set you up for more advanced topics in the units ahead.

Amazon Redshift Serverless is a fully managed data warehouse that automatically handles infrastructure, scaling, and maintenance. This means you can focus on analyzing your data rather than managing clusters. In this lesson, you will see how Redshift Serverless can connect directly to data stored in Amazon S3, which is often used as a data lake for raw and processed data.

To make this connection work, you will use a few key AWS services:

Amazon S3: Stores your data files, such as Parquet or CSV.
AWS Glue Data Catalog: Holds metadata about your data, like table definitions and schema.
IAM Roles: Manage permissions so Redshift can securely access your S3 data.

You will use a feature called Redshift Spectrum to query data in S3 without having to load it into Redshift tables first. This is a fast way to get value from your data lake and is often the first step in building a warehouse on AWS.

Quick Reference: IAM Permissions for Redshift Data Lake Operations

Before you dive into querying, it's helpful to know what permissions are required behind the scenes. In real AWS environments, the IAM role attached to your Redshift workgroup must have the right permissions to access S3 and Glue. Here's a quick matrix to clarify what's needed for the most common operations:

Operation	S3 Permissions	Glue Permissions	Notes
Spectrum SELECT	`s3:GetObject`	`glue:GetDatabase`, `glue:GetTable`, `glue:GetPartitions`	For reading external tables in S3
COPY from S3	`s3:GetObject`	(none required)	For loading data into Redshift
UNLOAD to S3	`s3:PutObject`	(none required)	For exporting data from Redshift to S3

In this CodeSignal environment, all required permissions are already set up for you. But in production, you'll need to ensure your Redshift IAM role has these permissions—otherwise, your queries may fail with access errors.

The Library Table: Data Lake Schema Overview

Before you start querying, let's look at the structure of the data you'll be working with. The main table in your S3 data lake is called library. This table is defined in the AWS Glue Data Catalog and points to Parquet files stored in your S3 bucket. Each row represents a record of a book borrowing event at a library.

Here is a JSON overview of the library table schema:

You will use this table in your upcoming practice exercises to run cross-lake queries directly on S3 data.

Setting Up Redshift Serverless (Overview)

Before you can query data in S3, you need a Redshift Serverless workgroup and namespace. Let's understand what these are and how they work together.

A namespace in Redshift Serverless is a logical container that holds your databases, schemas, users, and other database objects. Think of it as your data warehouse environment. A workgroup is the compute layer that processes your queries. It defines the compute capacity and network settings for your queries.

In a real AWS environment, you would typically set up these components through the AWS Console or using Infrastructure as Code tools like CloudFormation. Here's what the setup process would look like:

Create a namespace: Define your database name, admin user credentials, and encryption settings
Create a workgroup: Configure compute capacity, subnet groups, and security groups
Set up IAM roles: Create roles that allow Redshift to access S3 and Glue Data Catalog
Configure the Glue Data Catalog: Define your data lake tables and their schemas

However, in CodeSignal, all of this infrastructure is already set up for you. You don't need to worry about creating workgroups, namespaces, or configuring IAM roles. The environment variables you'll use in your code point to pre-configured resources that are ready to use.

This allows you to focus on the core concepts of querying your data lake rather than spending time on infrastructure setup. In production environments, you would work with your DevOps or cloud engineering team to set up these foundational components.

Key Concepts For Cross-Lake Queries

Now that you understand the infrastructure setup, let's clarify the key concepts you'll use to query your data lake.

An external schema in Redshift is a way to map tables defined in your Glue Data Catalog to your Redshift environment. This allows you to run SQL queries on data stored in S3 as if it were part of your Redshift database. You do not need to move or copy the data; Redshift Spectrum reads it directly from S3.

The Glue Data Catalog acts as a central store for metadata. It keeps track of what tables exist, their columns, data types, and where the data is stored in S3. When you create an external schema, you are telling Redshift to use this catalog to find and understand your S3 data.

IAM roles are essential for security. Redshift needs permission to read data from S3 and to access the Glue Data Catalog. You will attach an IAM role to your Redshift workgroup that grants these permissions. Without the correct IAM role, your queries will not be able to access the data. In CodeSignal, the IAM role setup is already handled for you, so you do not need to configure roles yourself.

With these concepts in mind, you are ready to see how to set up and run your first cross-lake query.

Note on Glue database name in this environment:

The Glue database name is dynamic (for example, data_lake_catalog_).
Always use the GLUE_DATABASE environment variable in code instead of hard-coding a name.

Visual Aid: How Redshift, S3, Glue, and IAM Work Together

To help you visualize how the different AWS services interact in a modern data warehousing workflow, take a look at the diagram below:

Amazon Redshift Serverless: Runs your SQL queries and manages compute resources.
Amazon S3 Data Lake: Stores your raw and processed data files (e.g., Parquet, CSV).
AWS Glue Data Catalog: Holds metadata (table definitions, schema) about your S3 data.
IAM Role: Grants Redshift permission to access both S3 and Glue securely.

When you run a query in Redshift using Spectrum, Redshift consults the Glue Data Catalog for table definitions, uses the IAM Role to securely access S3, and reads the data directly from your data lake—all without moving or copying the data.

This architecture enables you to analyze large datasets efficiently and securely, leveraging the strengths of each AWS service.

Setting Up Your Python Environment

Let's walk through a practical implementation of how to connect Redshift Serverless to your S3 data lake and run a query using Redshift Spectrum. You will use Python and the boto3 library to interact with AWS services. On CodeSignal, these libraries are pre-installed.

First, set up your environment variables and create a boto3 client for the Redshift Data API. This client allows you to send SQL commands to your Redshift Serverless workgroup.

These environment variables contain all the configuration needed to connect to your Redshift Serverless workgroup. The WORKGROUP_NAME identifies your compute resources, DATABASE is the database you'll query, SECRET_ARN provides authentication credentials, IAM_ROLE grants permissions to access S3 and Glue, and GLUE_DB is the name of your Glue Data Catalog database.

Next, you'll need a helper function to execute SQL statements and wait for results. The Redshift Data API is asynchronous, meaning it returns immediately while your query runs in the background.

Creating an External Schema

Now that you have your Python environment set up, you can create an external schema in Redshift. This schema will point to your Glue Data Catalog, which describes the tables stored in S3.

This SQL statement creates an external schema named ext_lake that connects to your Glue Data Catalog. Let's break down what each part does:

CREATE EXTERNAL SCHEMA IF NOT EXISTS ext_lake: Creates a schema called ext_lake in Redshift. The IF NOT EXISTS clause ensures the command won't fail if the schema already exists.
FROM DATA CATALOG: Tells Redshift to use the AWS Glue Data Catalog as the metadata source.
DATABASE '{GLUE_DB}': Specifies which Glue database contains your table definitions, read from the GLUE_DATABASE environment variable.
IAM_ROLE '{IAM_ROLE}': Provides the necessary permissions for Redshift to access both S3 and the Glue Data Catalog.
Because the Glue database is already created during setup, the external schema just maps Redshift to the existing catalog metadata.

Once this schema is created, you can query tables in your S3 data lake as if they were regular Redshift tables, using the ext_lake schema prefix.

Querying the External Table

With the external schema in place, you can now query your S3 data directly. Let's start with a simple count query to verify the connection works.

This is a simple SQL query that counts all rows in the library table within the ext_lake schema. The table library exists in your S3 data lake and is defined in the Glue Data Catalog. When you run this query, Redshift Spectrum will read the data directly from S3 and return the count without loading the data into Redshift first.

Now let's put it all together and execute both the schema creation and the query:

This main execution block runs the two SQL statements in sequence. First, it creates the external schema, which establishes the connection between Redshift and your Glue Data Catalog. Then it runs the count query on the external table. The result from the Redshift Data API comes back in a specific format where numeric values are stored in a longValue field within nested arrays and dictionaries. The result[0][0]['longValue'] extracts the actual count number from this structure.

When you run this script, you will see output similar to:

Summary And What's Next

In this lesson, you learned how to connect Amazon Redshift Serverless to your S3 data lake using an external schema and the Glue Data Catalog. You also reviewed the structure of the library table, which you will use for your queries. You understood the infrastructure components needed (workgroups, namespaces, and IAM roles) and saw how they work together to enable cross-lake queries. You also walked through a complete implementation of creating an external schema and running a query on S3 data using Python and the Redshift Data API.

You are off to a great start — let's move on to the hands-on practice!

Next Lesson: Creating Dimension Tables

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal