Structuring Data Lakes

Introduction: Data Lakes and S3 Structure

Welcome to the first lesson of the course, Designing & Ingesting Data into AWS Data Lakes. In this course, you will learn how to build a robust and scalable data lake on AWS, starting from the very basics. A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as is, without having to first structure it, and run different types of analytics — from dashboards and visualizations to big data processing, real-time analytics, and machine learning.

Amazon S3 (Simple Storage Service) is one of the most popular choices for building data lakes because it is highly durable, scalable, and cost-effective. S3 allows you to organize your data in a way that makes it easy to manage and analyze. In this lesson, you will learn how to design a well-organized data lake structure on S3, create the necessary folders (also called prefixes), and upload sample data in a way that supports efficient analytics later on. By the end of this lesson, you will have a strong foundation for all the data engineering tasks that follow in this course.

What is a Data Lake?

A data lake is a centralized storage repository that allows you to store vast amounts of data in its original, raw format. Unlike traditional databases or data warehouses, which require you to define a schema before storing data, a data lake lets you ingest structured, semi-structured, and unstructured data without upfront transformation. This flexibility makes data lakes ideal for handling diverse data sources such as logs, images, sensor data, and relational data.

Key characteristics of a data lake include:

Scalability: Easily store petabytes of data as your needs grow.
Flexibility: Store any type of data—structured (tables), semi-structured (JSON, CSV), or unstructured (images, videos, text).
Cost-effectiveness: Pay only for the storage you use, making it affordable for large-scale data storage.
Analytics-ready: Data lakes support a wide range of analytics, including big data processing, machine learning, and real-time analytics.

By storing all your data in a single location, a data lake enables you to break down data silos and make data available for various analytics and business intelligence use cases. This approach is especially powerful in cloud environments like AWS, where services such as Amazon S3 provide the durability, scalability, and integration needed for modern data lake architectures.

Designing a Multi-Zone Data Lake Layout

A well-structured data lake is organized into different zones, each serving a specific purpose in the data lifecycle. The most common zones are raw, processed, and curated. The raw zone is where you store data exactly as it arrives, without any transformation. The processed zone contains data that has been cleaned or transformed in some way. The curated zone is for data that is ready for analytics or reporting, often aggregated or enriched.

A typical folder structure for these zones in S3 might look like this:

Organizing your data lake in this way helps you manage data as it moves through different stages of processing. In Amazon S3, these "folders" are actually called prefixes—they are part of the object key that defines the path to your data. S3 does not have a true hierarchical file system, but by using prefixes, you can logically separate and organize your data as if you were using folders. This structure also makes it easier to control access (for example, by setting permissions at the prefix level), manage costs, and optimize query performance by allowing analytics tools to scan only the relevant prefixes. Using clear and consistent naming conventions for your prefixes and files is important, especially as your data lake grows. Proper use of prefixes also helps with lifecycle management, data retention policies, and efficient data retrieval.

Example: Creating an S3 Bucket

Before you can start building your data lake, you need a place to store your data. In AWS, this is done using an S3 bucket. An S3 bucket is a top-level container for storing objects (files and data) in Amazon S3. Each bucket name must be globally unique across all AWS users, and it should follow certain naming rules (for example, only lowercase letters, numbers, and hyphens; no spaces or uppercase letters).

When designing a data lake, it's a good practice to choose a descriptive and unique bucket name. In this course, the environment setup provides a shared SUFFIX automatically, so the examples can build unique resource names without you having to manage those details manually.

Here's how you can create an S3 bucket using boto3:

When you run this code, you should see output like:

or, on a repeated run:

Example: Setting Up the Data Lake Folder Structure (Prefixes)

Once your bucket is created, you need to organize it using a folder structure that matches your data lake zones (raw, processed, curated). In Amazon S3, these "folders" are actually called prefixes. Prefixes are simply the path part of the object key, and they help you logically organize your data.

To create a folder structure in S3, you upload empty objects with keys ending in a / to represent folders. This is a common practice for organizing data in S3, even though S3 itself is a flat object store.

Here's how you can set up the folder (prefix) structure using boto3:

The output will confirm the creation of each folder (prefix):

This sets up the basic structure of your data lake, making it easy to organize and manage your data as it flows through different stages. By using clear and consistent prefixes, you ensure your data is easy to find, secure, and ready for efficient analytics.

Security note (production)

In a real production data lake, you should treat your S3 bucket as sensitive infrastructure:

Block all public access (S3 Block Public Access should be enabled).
Enable encryption at rest (SSE-S3 or, more commonly in enterprises, SSE-KMS).
Use least-privilege access controls (IAM and bucket policies), ideally restricting access at the prefix level (e.g., raw/, processed/, curated/).
Consider enforcing TLS-only requests in the bucket policy.

The examples in this course may use default settings to keep the exercises simple in a controlled training environment, but you should not assume those defaults are appropriate for production.

Example: Uploading Partitioned Sample Data

Now that you have your bucket and folder structure, let's upload some sample data. In real-world scenarios, partitioning your data by time (such as year, month, and day) is a best practice. This makes it much easier and faster to run queries on specific time periods later.

A common convention for partitioning data in data lakes is Hive-style partitioning. In Hive-style partitioning, each partition is represented as a folder with the format column_name=value, such as year=2024/month=06/day=10/. This approach is widely supported by analytics tools like AWS Athena, Amazon Redshift Spectrum, and Apache Spark, which can automatically recognize partition columns and values from the folder structure. Using Hive-style partitioning improves query performance by allowing these tools to scan only the relevant partitions, reducing both query time and cost.

First, you can generate some sample user event data. Here's a function that creates random events for the last 7 days:

Next, you can upload this data to S3, partitioned by year, month, and day using Hive-style partitioning. This is done by grouping events by date and saving each group to a separate file in the appropriate folder:

Verifying Your S3 Data Lake Structure

After creating your folder structure and uploading data, it's important to verify what actually exists in your S3 bucket. S3 does not have real folders—everything is stored as objects with keys (paths). To help you see exactly what was created, you can list the objects in your bucket and inspect their keys and sizes.

Here's a helper function to list all objects in your bucket:

Call verify_uploaded_files() after upload or folder creation step to see:

The actual S3 keys (paths) that were generated
Which "folders" (prefixes) and files exist
The size of each object

Summary and What's Next

In this lesson, you learned how to design and set up a multi-zone data lake structure on Amazon S3. You created a new S3 bucket, organized it into raw, processed, and curated zones, and uploaded sample data partitioned by date. This structure is the foundation for a scalable and efficient data lake, making it easier to manage data as it moves from raw ingestion to analytics-ready formats.

You are now ready to get hands-on practice with these concepts. In the next exercises, you will have the chance to create your own data lake structure, upload data, and explore how partitioning works in practice. Take a moment to review the code and folder structure from this lesson, as it will help you in the upcoming practice tasks and future lessons in this course.

Next Lesson: Streaming Ingestion with Kinesis

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal