Introduction: From Raw Data to Analytics-Ready Models

Now that you have learned how to connect Amazon Redshift Serverless to your S3 data lake and run queries using Redshift Spectrum, you are ready to take the next step in building a modern analytics warehouse. In this lesson, you will move beyond querying raw data and start organizing it for analytics by creating dimension tables inside Redshift. This is a key part of modeling your data for fast, flexible analysis. By the end of this lesson, you will know how to use Python and the boto3 library to create dimension tables in Redshift, setting the foundation for a star schema that supports business intelligence and reporting.

Spectrum vs. Loading Data into Redshift: When to Use Each

Before we dive into modeling, let’s clarify a key design decision: Should you keep querying data in S3 with Spectrum, or should you load it into Redshift tables? Each approach has its strengths and trade-offs. Here’s a quick comparison to help you choose the right tool for the job:

ApproachLatencyPerformanceCostJoinsWhen to Use
Spectrum (S3 external)HigherSlower for complexPay per query/dataLimited (external→internal joins slow)Ad hoc, raw/large data, infrequent
Internal Redshift tableLowerFastest, optimizedStorage + computeFull support, fastFrequent queries, reporting, joins

Guidance:
Use Spectrum for ad hoc or infrequent analysis on large, raw, or rarely accessed data in S3.
Load data into Redshift tables for frequent, complex, or join-heavy analytics, especially when powering dashboards or BI reports.

  • Latency: Spectrum queries have higher startup time since data is read from S3. Internal tables are always ready for fast access.
  • Performance: Internal tables are optimized for analytics (sort keys, distribution, indexes). Spectrum is great for quick exploration, but can be slow for large joins or aggregations.
  • Cost: Spectrum charges per amount of data scanned. Internal tables incur storage and compute costs, but are more cost-effective for repeated queries.
  • Joins: Joining external (S3) and internal (Redshift) tables is possible, but much slower than joining internal tables only.
Why Model Your Data?

Raw data in your data lake is often messy, repetitive, and not optimized for analytics. While you can run queries directly on S3 using Redshift Spectrum, these queries can be slow and difficult to maintain as your data grows. Modeling your data—by creating structured tables inside Redshift—makes your analytics faster, more reliable, and easier to use for business intelligence (BI) and reporting.

The Star Schema: Foundation of Analytics Warehousing

A star schema is a classic data modeling pattern used in analytics databases. It organizes your data into two main types of tables:

  • Fact tables: Store measurable events (e.g., book borrowings, sales transactions).
  • Dimension tables: Store descriptive information about the entities involved in those events (e.g., books, patrons, dates).

The fact table sits at the center, and each dimension table connects to it like points on a star. This structure makes it easy to join, filter, and aggregate data for reporting and analysis.

Visual: Star Schema for Library Analytics

Deep Dive: What Are Dimension Tables?

Dimension tables are the backbone of your analytics model. Each dimension table describes a single business entity in detail. For example:

  • dim_book: Information about each book (ID, title, genre, author, etc.)
  • dim_patron: Information about each library patron (ID, name, contact info)
  • dim_date: Calendar information (date, year, month, day)

Dimension tables are typically small compared to fact tables, but they are rich in descriptive columns. They provide the context needed to interpret the numbers in your fact tables.

Why Are Dimension Tables Important?

  • Clarity: They turn cryptic IDs into meaningful labels (e.g., book_id → book_title).
  • Consistency: Centralize business definitions (e.g., what counts as a “patron”).
  • Flexibility: Make it easy to add new attributes (e.g., book language, patron segment).
  • Performance: Enable fast joins and filtering in analytics queries.
Quick Recap: Why Modeling Matters
  • Performance: Queries run faster when facts and dimensions are designed for analytics (using sort keys, distribution styles, and surrogate keys).
  • Consistency: Centralized dimensions provide a single source of truth for entities (e.g., a patron’s name and contact info).
  • Usability: Analysts work with clean, descriptive columns instead of raw, denormalized fields.
  • Flexibility: New facts and dimensions can be added without disrupting existing analysis.

For this course, the dimension tables stay simple and natural-key based so you can focus on the fundamentals of Redshift modeling. In production warehouses, teams often introduce surrogate keys and slowly changing dimension (SCD) patterns to track history and keep joins stable over time, but those topics are beyond the scope of this unit. The important modeling idea here is that small dimensions are often a good fit for DISTSTYLE ALL, because copying them to every node can make joins to large fact tables very fast. As dimensions grow, though, ALL becomes more expensive to maintain, and patterns like DISTSTYLE KEY or EVEN may be more appropriate depending on query shapes and table size.

Environment Preparation

To follow along with the example, you will need access to your Redshift Serverless workgroup, the database you want to use, and the necessary AWS credentials. On CodeSignal, the boto3 library and other required packages are already installed for you. If you are working on your own machine, you can install boto3 using pip, but you do not need to worry about this step in the CodeSignal environment.

You will also need to set a few environment variables so that your script can connect to Redshift securely. These include your AWS region, workgroup name, database name, and the ARN of your AWS Secrets Manager secret. These variables help keep your credentials secure and your code flexible.

Creating Dimension Tables With Python And boto3

Let’s walk through how to create dimension tables in Redshift using Python and the boto3 library. The following script will create a new schema called warehouse and three dimension tables: dim_book, dim_patron, and dim_date. Each table is designed with a primary key and columns that describe the entity.

1. Set Up Your Environment and Redshift Data API Client
2. Define SQL Statements for Schema and Dimension Tables

At this stage, you are creating one schema to hold your warehouse objects and three small dimensions that provide descriptive context for later analytics. dim_book stores book attributes used for slicing by title or genre, dim_patron stores borrower attributes, and dim_date gives you a reusable calendar dimension for grouping by year, month, or day.

  • DISTSTYLE ALL: Copies the entire dimension table to every node in the Redshift cluster, making joins with fact tables very fast (ideal for small tables).
  • SORTKEY: Organizes data on disk for efficient querying, especially when filtering or joining on the key column.
3. Helper Function to Run SQL Statements
4. Run the Script to Create Schema and Tables

When you run this script, you will see output similar to:

How These Dimension Tables Fit Into Analytics

With your dimension tables in place, you are ready to build the fact table (e.g., fact_borrow) that records borrowing events. The fact table will reference the primary keys from your dimension tables, creating a star schema. This structure allows you to:

  • Analyze borrowing trends by book genre, patron, or date
  • Calculate overdue fines by month or patron segment
  • Power BI dashboards and reports with fast, reliable queries
Summary and What’s Next

In this lesson, you learned why we move from raw data to modeled tables for analytics, what a star schema is and how dimension tables fit into it, and how to create dimension tables in Amazon Redshift using Python and the boto3 library. You also saw a clear comparison between querying external data with Spectrum and loading data into Redshift tables, helping you make informed design choices for your analytics warehouse.

Next, you’ll get hands-on practice creating schemas and tables in Redshift. After that, you’ll load data into your dimension tables and build the central fact table, bringing your analytics warehouse to life!

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal