Welcome back! In the previous lesson, you learned how to register your processed Parquet data in the AWS Glue Catalog using a Glue Crawler. This made your data discoverable and ready for analysis with tools like Amazon Athena. Now that your data is organized and cataloged, the next step is to prepare it for business analysis by summarizing or aggregating it.
Aggregation is a key part of any data pipeline. It allows you to turn large amounts of raw data into meaningful summaries, such as daily totals or event counts. These summaries are much easier to analyze and are often what business users and analysts need for reporting and decision-making. AWS Glue ETL jobs make it possible to automate this process so you can regularly create up-to-date summary tables from your raw or processed data.
In this lesson, you will learn how to build a Glue ETL job that reads processed data from your Glue Catalog, performs aggregations using PySpark, and writes the results back to S3 in a format that is ready for further analysis.
Before we dive into the example, let’s quickly review the main components you’ll see in a Glue ETL script for aggregation. If you’ve followed along with earlier lessons, some of these will be familiar, but I’ll point out what’s new.
A typical Glue ETL script starts by setting up the job environment. This includes creating a GlueContext, which is the main entry point for working with AWS Glue, and a SparkContext, which is needed for running Spark jobs. You’ll also see the use of getResolvedOptions to read job arguments, such as the source database and table, and the target S3 bucket and prefix where you want to write your results.
One important concept in Glue is the DynamicFrame, which is a flexible data structure designed for semi-structured data. However, for aggregation tasks, it’s common to convert a DynamicFrame to a Spark DataFrame, which provides powerful functions for grouping and summarizing data.
Let’s look at how you can aggregate data using PySpark within a Glue ETL job. Suppose you have a table of user events, and you want to create a daily summary that counts the number of events and sums up the revenue for each day.
Here’s a code example that shows how to do this. This script assumes your processed data is already registered in the Glue Catalog, and you have the necessary job arguments set up.
Let’s break down what’s happening here. The script starts by importing the necessary libraries and reading job arguments. It then creates the Spark and Glue contexts, which are required for running the ETL job. The script loads the source data from the Glue Catalog as a DynamicFrame and immediately converts it to a Spark DataFrame using .toDF(). This is important because Spark DataFrames provide the groupBy and aggregation functions you need.
The aggregation step groups the data by the event_date column. For each date, it counts the number of events and sums the revenue column. The result is a new DataFrame with one row per day, showing the total number of events and the total revenue for that day.
Once you have your aggregated DataFrame, the next step is to save the results back to S3. In this example, the script writes the summary data in Parquet format, which is efficient for both storage and querying. The .write.mode("overwrite").parquet(...) command tells Spark to write the DataFrame to the specified S3 path, replacing any existing data at that location.
The output path is built using the target bucket and a prefix for curated data, followed by daily_summary/. This organization makes it easy to find and manage your summary tables in S3. For example, if your target bucket is library-data-lake-{SUFFIX} and your curated prefix is curated/library/, the output will be written to:
After running the script, you will see Parquet files in this S3 location. Each file contains the daily summary data, ready for further analysis or reporting.
For larger production datasets, you would often partition curated outputs by a field such as date to make downstream querying more efficient. In this course, we keep the curated outputs flat under paths like curated/library/daily_summary/ and curated/library/patron_summary/ so the aggregation logic stays focused on PySpark fundamentals.
Let’s walk through the full script step by step to make sure you understand how all the pieces fit together.
The script begins by importing the required modules and reading job arguments, which allows you to reuse the script for different source tables or output locations. It then sets up the Spark and Glue contexts, which are needed for any Glue ETL job.
Next, the script loads the source data from the Glue Catalog using create_dynamic_frame.from_catalog. This function reads the table you specify and returns a DynamicFrame. Since aggregation is easier with Spark DataFrames, the script converts the DynamicFrame to a DataFrame using .toDF().
The core of the script is the aggregation step. Using the groupBy method, the script groups the data by event_date. It then uses the agg function to count the number of events and sum the revenue for each day. The result is a DataFrame with columns for event_date, total_events, and total_revenue.
Finally, the script writes the aggregated DataFrame to S3 in Parquet format. The mode("overwrite") option ensures that if you rerun the job, the output will be replaced with the latest results. The script ends by calling job.commit(), which signals to Glue that the job has finished successfully.
In this lesson, you learned how to use AWS Glue ETL and PySpark to aggregate processed data into daily summaries. You saw how to load data from the Glue Catalog, perform group-by and aggregation operations with Spark DataFrames, and write the results back to S3 in Parquet format. This process is a key step in preparing your data lake for business analysis and reporting.
You are now ready to practice these skills yourself. In the next set of exercises, you will get hands-on experience building and running a Glue ETL job that performs data aggregation. This will help you reinforce what you’ve learned and prepare you for the next steps in building a robust, automated data pipeline on AWS.
