Cataloging Data with Glue Crawler

Introduction: Why Catalog Your Processed Data?

Welcome back! In the last lesson, you learned how to create and monitor AWS Glue ETL jobs to transform your raw data into a more efficient Parquet format. Now that your data is processed and stored in S3, the next step is to make it easy to find and query. This is where cataloging comes in.

Cataloging your processed data means registering it in a central place — the AWS Glue Data Catalog — so that other AWS services, like Athena, can easily discover and query it. Without this step, your data would just be files in S3, and you would have to manually describe their structure every time you wanted to analyze them. By cataloging, you automate schema discovery and make your data lake much more useful.

To help with this, AWS provides a tool called the Glue Crawler. In this lesson, you will learn how to use a Glue Crawler to scan your processed Parquet data and register its schema in the Glue Catalog. This is a key step in building a data lake that is ready for analytics.

What Is An AWS Glue Crawler?

An AWS Glue Crawler is a service that automatically scans your data in S3, figures out its structure (schema), and updates the Glue Data Catalog with this information. Think of it as a smart assistant that looks at your files, understands their format and columns, and then creates or updates tables in your data catalog.

When you run a crawler, it connects to your S3 bucket, reads the files (like your Parquet files), and infers the schema — such as column names, data types, and partitions. It then creates a table in the Glue Catalog, which acts as a map to your data. This table can be used by Athena and other AWS services to run SQL queries without you having to manually define the schema.

This process saves you time and reduces errors, especially as your data changes or grows. Crawlers are especially useful in data lakes where new data arrives often and you want to keep your catalog up to date.

What You Need Before Running A Crawler

Before you can run a Glue Crawler, there are a few things you need to have ready. Since you have already processed your data with a Glue ETL job, you should now have Parquet files stored in a specific S3 path. You will also need a Glue database to store your catalog tables and an IAM role that gives Glue permission to access your S3 data and update the catalog.

Here is a quick reminder of what you need for this step:

The S3 path to your processed Parquet data. For example, this might look like s3://library-data-lake-{SUFFIX}/processed/library/.
The name of the Glue database where you want the new table to be created. In this course, we use data_lake_catalog_{SUFFIX}.
An IAM role that allows Glue to read from your S3 bucket and update the Glue Catalog. In this course, this is typically glue_s3_library.

That crawler role needs a few specific permissions to work correctly. It must be able to read the processed Parquet files in s3://library-data-lake-{SUFFIX}/processed/library/, which means S3 read permissions such as s3:GetObject, s3:ListBucket, and s3:GetBucketLocation on the suffixed course bucket. It also needs Glue Catalog permissions to create or update tables in data_lake_catalog_{SUFFIX}, plus CloudWatch Logs write permissions so crawler execution details can be recorded.

If you are working in the CodeSignal environment, you do not need to install the boto3 library — it is already available for you. On your own machine, you would need to install it with pip install boto3.

Creating And Running A Glue Crawler With Boto3

Let’s walk through how to create and run a Glue Crawler using Python and the boto3 library. Below is a script that does exactly that. It first creates a crawler that points to your processed Parquet data in S3, and then starts the crawler and waits for it to finish.

Let’s break down what this script does. First, it sets up the names and paths for your database, crawler, and S3 location. The create_crawler function connects to AWS Glue and tries to create a new crawler. If a crawler with the same name already exists, it prints a message and moves on. Otherwise, it creates the crawler and points it to your Parquet data in S3.

The run_crawler_and_wait function starts the crawler and then checks its status every 10 seconds. While the crawler is running, you will see output like:

Summary And Practice Preview

In this lesson, you learned how to register your processed Parquet data in the AWS Glue Catalog using a Glue Crawler. You saw how a crawler can automatically scan your S3 data, infer its schema, and create a table in the catalog. This makes your data easy to find and ready for analytics.

You are now ready to practice these steps yourself. In the next set of exercises, you will create and run a Glue Crawler to catalog your own processed data. This will prepare you for the next stage, where you will use Athena to query your data lake and unlock valuable insights.

Previous Lesson

Next Lesson: Querying Data with Athena

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal