Introduction: Querying Your Cataloged Data with Athena

Welcome back! In the last lesson, you learned how to use an AWS Glue Crawler to catalog your processed Parquet data. Now that your data is registered in the Glue Data Catalog, you are ready to unlock its value by running SQL queries for analytics and reporting. This is where Amazon Athena comes in.

Athena is a serverless query service that lets you analyze your data directly in S3 using standard SQL. Because your data is already cataloged, Athena can quickly discover the schema and let you run queries without any extra setup. In this lesson, you will learn how to use Python and the boto3 library to run Athena queries programmatically. This is a powerful skill for automating analytics and integrating data lake queries into your applications or workflows.

What You Need to Run Athena Queries

Before you can run queries with Athena, let’s quickly review what you need. This is mostly a reminder, as you have already completed these steps in previous lessons. First, your processed data should be stored in S3 and cataloged in the Glue Data Catalog. You should also have an Athena database that points to your cataloged tables. In this course, we use the database name data_lake_catalog_{SUFFIX}.

Athena needs a place to store the results of your queries. This is usually an S3 bucket and folder, such as s3://library-data-lake-{SUFFIX}/athena_query_results/. In practice, that output prefix should already exist or be created during setup, and your IAM user or role must be able to write query results there in addition to using Athena and the Glue Catalog. If you are working in the CodeSignal environment, the boto3 library is already installed for you. If you are working on your own machine, you will need to install it with pip install boto3.

Step-By-Step: Running An Athena Query From Python

Let’s walk through how to run a query in Athena using Python and boto3. Below is a script that connects to Athena, runs a SQL query, waits for the query to finish, and then prints the results. This script is designed to be simple and easy to follow.

Here’s how this script works. First, it sets up the Athena client using boto3. The execute_athena_query function takes a SQL query as input. It starts the query execution in Athena and gets a unique query ID. The script then checks the status of the query every second until it finishes. Once the query is done, it fetches the results, prints the header row first, and then prints up to ten result rows so the output stays readable.

For example, if you run the query SELECT COUNT(*) FROM library;, you might see output like:

The first row is the column header, and the second row is the result of the count. This approach keeps query output easier to read while still working for any SQL query you want to run on your cataloged data.

Reading Results And Troubleshooting

When you run a query with Athena using this script, the results are returned as rows, with each value in a list. The first row usually contains the column names, and the following rows contain your data. Printing the header separately and limiting the displayed rows can make larger result sets much easier to scan.

If you run into issues, there are a few common things to check. Make sure your IAM role or user has permission to use Athena, access the Glue Catalog, and write to the S3 output location. If your query fails, double-check the table name and database in your query. Athena is case-sensitive, so be sure to use the exact names as they appear in the Glue Catalog. If you see an error about the output location, confirm that the S3 path exists and that you have write access.

Summary And Practice Preview

In this lesson, you learned how to use Python and boto3 to run SQL queries on your cataloged data in Amazon Athena. You saw how to set up the Athena client, execute a query, wait for it to finish, and print the results. This is a key step in turning your data lake into a source of insights and analytics.

You are now ready to practice running Athena queries yourself. In the next set of exercises, you will use Python to query your own data and explore the results. This will help you build confidence in using Athena for analytics and reporting in your AWS data lake.

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal