Creating Glue ETL Jobs

Introduction: Launching Your First Glue ETL Job

Welcome back! In the previous lesson, you prepared your AWS data lake by uploading your sample JSON data and ETL script to Amazon S3. This setup is essential because AWS Glue, the service you will use for data transformation, needs access to both your data and your script in S3.

Now that your files are in place, it is time to take the next step: creating and running your first AWS Glue ETL job. An ETL (Extract, Transform, Load) job in AWS Glue is a managed process that reads data from a source, applies transformations using your script, and writes the results to a target location. Automating this process is a key part of building a scalable and reliable data pipeline.

In this lesson, you will learn how to create a Glue ETL job using Python and the boto3 library, and how to monitor the job’s execution. This will help you move from simply storing data to actually processing and transforming it in the cloud.

What is an AWS Glue Job?

Think of an AWS Glue job as your personal data chef in the cloud. When you have raw ingredients (your data) sitting in S3, and you want to turn them into a delicious, ready-to-serve dish (cleaned, transformed data), you need a recipe (your ETL script) and a chef to follow it. That’s exactly what a Glue job does!

An AWS Glue job takes your script and runs it on fully managed cloud infrastructure. It reads your data from a source (like S3), applies all the transformations you’ve defined (such as cleaning, filtering, or joining data), and then writes the results to a destination of your choice. You don’t have to worry about setting up servers, scaling resources, or handling failures—Glue takes care of all the heavy lifting.

Glue jobs are perfect for automating repetitive data processing tasks. You can schedule them to run at specific times, trigger them when new data arrives, or launch them on demand. This makes Glue jobs a powerful tool for building reliable, scalable data pipelines that keep your data lake fresh and ready for analysis.

What You Need to Create a Glue Job

Before you can create a Glue job, there are a few important pieces you need to have ready. Since you have already uploaded your data and ETL script to S3, you are well prepared. Here is a quick reminder of what is required:

S3 Bucket: This is where your raw data and ETL script are stored. For example, your bucket might be named library-data-lake-{SUFFIX}.
ETL Script Path: This is the S3 location of your ETL script, such as s3://library-data-lake-{SUFFIX}/glue-scripts/sample_etl_script.py.
IAM Role: AWS Glue needs permission to access your S3 data and other AWS resources. This is provided by an IAM role, typically named something like AWSGlueServiceRole.
Job Settings: These include the job name, the type of worker, the number of workers, and other configuration options.

All of these pieces come together when you create a Glue job. The job will use your ETL script to process the data in S3, and the IAM role will ensure it has the necessary permissions. If you are using the CodeSignal environment, you do not need to install boto3 — it is already available for you.

Example: Creating a Glue ETL Job with Boto3

Let’s walk through how to create a Glue ETL job using Python and the boto3 library. Below is a sample script that brings together all the required components. This script will attempt to create a Glue job named sample-etl-job using your uploaded ETL script and the correct IAM role.

When you run this script, it will connect to AWS Glue and attempt to create a new job. If the job already exists with the same configuration, you will see:

If you try to create a job with the same name but a different configuration, you will see a message like:

Let’s break down what is happening here. The script first creates a Glue client and retrieves your AWS account ID to build the correct IAM role ARN. It then calls create_job with all the necessary parameters, including the job name, IAM role, script location, and job settings. If Glue returns a structured , the script checks the AWS error code and reports whether the job already exists with a different configuration or whether another problem occurred.

Example: Monitoring Glue Job Runs

After creating your Glue job, it is important to monitor its execution. This helps you understand whether your job is running as expected, and it can help you troubleshoot any issues that arise.

Here is a Python function that uses boto3 to check the status of recent runs for your Glue job:

When you run this code, you will see output similar to the following:

Each line shows the unique run ID, the current state of the job run (such as SUCCEEDED, FAILED, or RUNNING), and the time the run started. This information is very useful for tracking the progress of your ETL jobs and identifying any problems.

The JobRunState tells you whether the job completed successfully, failed, or is still in progress. If you see a state of FAILED, you may want to check the AWS Glue console for more details about the error.

Summary and What’s Next

In this lesson, you learned how to create an AWS Glue ETL job using Python and the boto3 library, and how to monitor the status of your job runs. You saw how all the pieces you prepared earlier — your S3 data, ETL script, and IAM role — come together to launch a managed ETL process in AWS.

You are now ready to practice these steps on your own. In the next set of exercises, you will get hands-on experience creating and monitoring Glue jobs, which will help you build confidence in automating your data processing workflows. Take a moment to review the code examples and outputs from this lesson, and get ready to put your new skills into action!

Previous Lesson

Next Lesson: Cataloging Data with Glue Crawler

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal