Preparing Data for AWS

Introduction: Getting Ready for AWS Data Engineering

Welcome to your first step in mastering data engineering with AWS Glue and Athena. In this course, you will learn how to transform, process, and analyze data using some of the most powerful tools in the AWS ecosystem.

Before you can build robust ETL (Extract, Transform, Load) pipelines or run analytics, you need to prepare your data and scripts. This lesson will guide you through setting up the foundation for your data lake in Amazon S3 and uploading your sample data. By the end of this lesson, you will have your S3 data lake zones (raw, processed, curated) and script storage in place, with your sample JSON data uploaded and ready for use.

Understanding the Sample JSON Data

In data engineering, JSON (JavaScript Object Notation) is a common format for storing and exchanging raw data. JSON files are easy to read and write, and they work well with many data processing tools. In this course, you will work with a sample JSON file that contains library borrow records. Each record in the file represents a single book borrowing event, including details about the patron, the book, the librarian, and the borrowing status.

Here is a small sample of what the JSON data looks like:

Each object in the array contains fields such as borrow_id, borrow_date, patron_name, book_title, and more. This structure makes it easy to process and analyze the data later on. You will use this file as your raw data source throughout the course.

Setting Up Your S3 Data Lake Structure

A well-organized data lake is essential for efficient data processing. In AWS, Amazon S3 is used to store all your data files and scripts. To keep things organized, it is best practice to create separate folders (also called prefixes in S3) for different stages of your data pipeline. In this course, you will use the following structure:

raw/library/ for your original JSON data
processed/library/ for data that has been transformed (for example, converted to Parquet format)
curated/library/ for business-ready, aggregated data
glue-scripts/ for your ETL scripts

You can create these folders in S3 using a Python script with the boto3 library. Here is an example script that creates the required folders:

When you run this script, you will see output like:

This confirms that your S3 bucket now has the correct folder structure to support your data pipeline.

Uploading the Sample JSON Data to S3

With your S3 folders in place, the next step is to upload your sample JSON data file. This is a critical step—your ETL pipelines and analytics will all start from this raw data in S3.

To upload your sample JSON file, you can use the following Python script:

After running this script, you should see:

This confirms that your sample JSON data is now available in the raw zone of your S3 data lake.

These scripts use the boto3 library to interact with AWS S3. On CodeSignal, boto3 is pre-installed, but if you are working on your own machine, you may need to install it using pip install boto3.

Summary and What’s Next

In this lesson, you learned how to prepare your AWS data lake for ETL processing by focusing on the S3 setup. You explored the structure of the sample JSON data, created a clear folder structure in S3, and uploaded your raw data to the correct location. These steps are essential for building a reliable and organized data pipeline.

Next, you will practice these steps yourself in the upcoming exercises. You’ll get hands-on experience creating S3 folders and uploading the sample JSON data, making sure you are comfortable with the process before moving on to building and running your first AWS Glue ETL job.

Next Lesson: Understanding the ETL Script

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal