Welcome back! In the previous lessons, you learned how to set up a robust data lake on Amazon S3 and automate the delivery of streaming data using Kinesis and Firehose. Now that your data is reliably landing in S3, the next step is to make it easy to organize, search, and analyze. This is where AWS Glue comes in.
AWS Glue is a fully managed data catalog and ETL (Extract, Transform, Load) service. In this lesson, you will focus on the Glue Data Catalog, which acts as a central metadata store for your data lake. By cataloging your data, you make it discoverable and queryable by other AWS services, such as Athena and Redshift Spectrum. The first step in using Glue is to create a Glue database, which will serve as a logical container for your data tables and metadata. This lesson will show you how to create and list Glue databases using Python and the boto3 library.
An AWS Glue database is not a database in the traditional sense. Instead, it is a logical grouping or container for tables in the Glue Data Catalog. Think of it as a folder that helps you organize metadata about your datasets stored in S3. Each Glue database can contain multiple tables, and each table describes the schema and location of a dataset in your data lake.
By using Glue databases, you can keep your data catalog organized, especially as your data lake grows to include many different datasets and sources. For example, you might have one database for raw data, another for processed data, and a third for curated analytics datasets. This structure makes it easier to manage permissions, automate data discovery, and run analytics across your data lake.
You’ll use boto3 to interact with AWS services. If you’re on the CodeSignal platform, boto3 is already installed and AWS credentials are pre-configured for you.
For AWS Glue specifically, your IAM user/role needs permissions to manage and read from the Glue Data Catalog. At minimum for this lesson, ensure you have actions like:
glue:CreateDatabaseglue:GetDatabases
If you’re following along on CodeSignal, these permissions are already set up. If you’re working in your own AWS account, confirm your user/role includes the necessary Glue permissions.
Let’s walk through how to create a Glue database using Python and the boto3 library. Below is a complete example that creates a new Glue database called data_lake_catalog with a description. If the database already exists, the code will let you know instead of failing.
When you run this code, it will attempt to create a new Glue database. If the database is created successfully, you will see output like:
If the database already exists, you will see:
This approach ensures that your script is safe to run multiple times without causing errors. The boto3 client for Glue is used to interact with the AWS Glue service, and the create_database method is called with the desired database name and description. Handling the AlreadyExistsException allows your code to be idempotent, which is a good practice in data engineering.
After creating a Glue database, you might want to see all the databases that exist in your AWS Glue Data Catalog. This is helpful for verifying that your new database was created and for exploring what other databases are available. Here is a Python example that lists all Glue databases and prints their names, descriptions, and creation times:
Note (production):
get_databases()is paginated. In large catalogs, loop onNextTokenor use a boto3 paginator (glue.get_paginator('get_databases')) to retrieve all databases.
When you run this code, you will see output similar to the following, depending on how many databases you have:
This output shows each database’s name, description, and creation time. This is a great way to confirm that your new database is present and to get familiar with the structure of your Glue Data Catalog.
In this lesson, you learned how to create and list AWS Glue databases using Python and the boto3 library. This is an important step in organizing your data lake, as Glue databases act as logical containers for your data catalog. By cataloging your data, you make it easier to discover, manage, and analyze with other AWS services.
You are now ready to put these skills into practice. In the next set of exercises, you will create your own Glue database and list all databases in your environment. This hands-on experience will prepare you for the next step: using Glue Crawlers to automatically discover and catalog the structure of your S3 data. Keep going — your data lake is becoming more powerful and organized with each lesson!
