Understanding the ETL Script

Introduction: The Role of the ETL Script

Welcome back! In the previous lesson, you prepared your AWS data lake by uploading sample JSON data and organizing your S3 folders. Now, you are ready to take the next step: transforming your raw data into a format that is easier to analyze and more efficient to store. This is where the ETL (Extract, Transform, Load) script comes in.

The ETL script is a key part of your data pipeline. It reads the raw JSON data you uploaded to S3, applies important transformations, and writes the cleaned and optimized data back to S3 in a new format. In this lesson, you will learn how the ETL script works, what transformations it performs, and why these steps are important for building a reliable and high-performance data lake. This understanding will prepare you for running and monitoring ETL jobs in AWS Glue in the next part of the course.

Understanding the ETL Process: Extract, Transform, Load

Before diving into the script details, let's understand what ETL means and why it's essential for data analytics:

Extract: This is where we read data from its source. In our case, we're extracting JSON data from your S3 bucket. JSON is great for storing and transmitting data, but it's not optimized for analytics queries.

Transform: This is where we clean, modify, and restructure the data. We handle missing values, add new columns for better organization, and prepare the data for efficient querying. Think of this as organizing a messy room - we're making everything neat and putting things where they belong.

Load: This is where we write the transformed data back to storage, but in a format that's optimized for analytics. Instead of keeping the data as JSON, we convert it to Parquet format, which is much faster and cheaper to query.

Why Parquet? Understanding the Format Change

You might wonder: "Why not just keep the data as JSON?" Here's why we convert to Parquet:

JSON limitations for analytics:

JSON stores data row by row, which means to calculate something like "average fine amount," you have to read every single record
JSON includes field names repeatedly, wasting storage space
JSON doesn't compress well, leading to higher storage costs
Querying JSON is slow because analytics tools have to parse text for every operation

Parquet advantages:

Columnar storage: Parquet stores data column by column. When you want the average fine amount, it only reads that specific column, not entire records
Better compression: Similar data values are stored together, allowing for much better compression (often 75% smaller files)
Faster queries: Analytics tools like Athena can skip entire chunks of data that don't match your query criteria
Schema evolution: Parquet handles changes to your data structure better than JSON

Think of it this way: JSON is like storing books by throwing them randomly in a warehouse, while Parquet is like organizing them by subject on specific shelves - much easier to find what you need!

The ETL Script We'll Build and Use

ETL scripts are incredibly powerful and flexible - they can perform virtually any data transformation you need. You could use them to join data from multiple sources, perform complex calculations, apply machine learning models, clean messy text data, or restructure entire datasets. The possibilities are endless, and in real-world projects, ETL scripts often contain hundreds of lines of sophisticated data processing logic.

However, for this course, we're starting with a focused script that demonstrates the core ETL concepts you need to master. This script performs the most common and essential transformations that you'll use in almost every data pipeline project.

Here's the complete ETL script that you'll be building and using throughout this course:

When you run this script, you'll see output showing the data transformation process:

ETL Script Workflow and Key Transformations

Now let's break down exactly what this script does, step by step, and understand why each transformation matters:

Step 1: Setup and Configuration The script starts by setting up the AWS Glue environment and reading configuration parameters that tell it where to find input data and where to write output data. This makes the script flexible - you can run it against different datasets without changing the code.

Step 2: Extract - Reading the JSON Data with Schema Inference The script connects to your S3 bucket and reads the JSON files using spark.read.json(input_path, multiLine=True). The multiLine=True option matters here because the sample input is stored as a JSON array rather than newline-delimited JSON. Here's where something powerful happens: schema inference. Instead of manually telling the script "this field is text, this field is a number," the script automatically figures this out by examining your data. This makes your pipeline flexible - if you add new fields to your JSON data later, the script will handle them automatically.

Step 3: Transform - Cleaning and Enhancing the Data This is where the magic happens. Our script performs a critical transformation:

Missing Value Handling: Real-world data is messy. Sometimes a library book doesn't have a fine amount recorded, or the overdue status is missing. Our script fills these gaps:

Missing fine_amount becomes 0.0 (no fine)
Missing overdue becomes False (not overdue)

This prevents errors in later analysis and ensures consistent data quality.

Step 4: Load - Writing Parquet Files The script writes the cleaned data back to S3 in Parquet format. All the data is written to the specified output path, without partitioning into subfolders.

Once you master these fundamental transformations, you'll have the foundation to build much more complex ETL processes. You could extend this script to merge customer data with transaction data, calculate rolling averages, detect anomalies, or perform any other data transformation your business requires.

Summary and What's Next

In this lesson, you learned how the ETL script transforms your raw JSON data into an analytics-optimized format. You now understand:

Why ETL is essential for data analytics projects
The advantages of Parquet over JSON for analytical workloads
How schema inference makes your pipeline flexible and maintainable
The specific transformations that clean and enhance your data

You are now ready to practice these concepts yourself. In the next set of exercises, you will get hands-on experience working with the ETL script, exploring its transformations, and preparing your data for the next stages of your data lake project. You'll see firsthand how these transformations make your data faster and cheaper to analyze.

Previous Lesson

Next Lesson: Creating Glue ETL Jobs

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal