Streaming Ingestion with Kinesis

Introduction: Moving from Batch to Real-Time Ingestion

Welcome back! In the previous lesson, you built a strong foundation for your AWS data lake by designing a well-organized structure on Amazon S3. You learned how to create a multi-zone layout with raw, processed, and curated data, and you uploaded sample data using time-based partitioning. This setup is perfect for batch analytics, where data is collected and processed in chunks.

However, many modern applications need to process data as soon as it arrives. For example, you might want to analyze user activity on your website in real time, detect fraud as transactions happen, or monitor sensor data from IoT devices without delay. This is where real-time data ingestion becomes important.

To enable real-time data flows into your data lake, AWS offers a service called Amazon Kinesis Data Streams. Kinesis allows you to collect, process, and move data continuously, so you can react to new information instantly. In this lesson, you will learn how to create a Kinesis Data Stream using Python. This is the first step in bridging the gap between real-time data sources and your S3-based data lake.

Kinesis Data Streams: Key Concepts

Before you start working with Kinesis, let’s clarify a few important concepts. A Kinesis Data Stream is a managed service that lets you collect and transport data records in real time. Think of it as a pipeline that carries small pieces of data — called records — from producers (like your application or website) to consumers (such as analytics tools or storage systems).

Each stream is made up of one or more shards. A shard is like a lane on a highway: it determines how much data can flow through the stream at once. If you need to handle more data, you can add more shards, just like adding more lanes to a road allows more cars to travel at the same time. For most beginner projects, starting with a single shard is enough.

When you create a Kinesis stream, it goes through different status phases. At first, the stream is in the CREATING state while AWS sets up the necessary resources. Once it is ready, the status changes to ACTIVE, and you can start sending and receiving data. It is important to wait for the stream to become ACTIVE before using it; otherwise, your application may run into errors.

Here is a quick summary of the key terms:

Stream: The main resource that collects and transports your data records.
Record: A single piece of data, such as a JSON object or a string.
Shard: A unit of capacity within the stream. More shards mean more data can be handled at once.
Status: The current state of the stream, such as CREATING or ACTIVE.

Understanding these concepts will help you as you start working with and connect it to your data lake.

Shard limits and cost

Because each shard represents both capacity and cost, it’s important to keep shard count in mind as you design your stream. For small projects and course exercises, 1 shard is usually enough, but production workloads often scale by adding shards as throughput grows.

Shard count affects both throughput and cost, so it’s worth reviewing AWS quotas and pricing as you plan your stream:

Kinesis Data Streams quotas (limits): https://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html
Kinesis Data Streams pricing: https://aws.amazon.com/kinesis/data-streams/pricing/

Creating a Kinesis Data Stream with Python

Let’s walk through how to create a Kinesis Data Stream using Python and the boto3 library. This example will show you how to set up a stream called user-events-stream with one shard, wait for it to become active, and handle common errors, such as trying to create a stream that already exists.

On CodeSignal, the boto3 library is already installed, so you do not need to worry about installation here. If you are working on your own device, you can install it using pip install boto3.

Here is the complete code example:

Understanding the Partition Key in Kinesis

When you send a record to a Kinesis Data Stream, you must specify a partition key. The partition key is a string that Kinesis uses to determine which shard in the stream will store the record. This is a crucial concept for both performance and data organization.

How the Partition Key Works:
Kinesis uses the partition key to compute a hash value, which maps the record to a specific shard in the stream. All records with the same partition key will always go to the same shard.
Why the Partition Key Matters:
- Load Distribution: If you use the same partition key for all records, all your data will go to a single shard, which can quickly become a bottleneck. To take advantage of multiple shards (and thus higher throughput), you should use a partition key that distributes records evenly across shards.
- Ordering Guarantees: Kinesis guarantees that records with the same partition key are stored in order within a shard. This is important if you need to process events for a particular user or device in the exact order they occurred.
- Scalability: Choosing a good partition key helps you scale your stream. For example, using a user ID, device ID, or order ID as the partition key can help distribute records more evenly, especially if you have many unique users or devices.
Example: Choosing a Partition Key:
In the order event example below, we use the user_id as the partition key. If you have many users placing orders, this will distribute the records across shards based on the user ID. If you used a constant value (like "orders"), all records would go to the same shard, which is not ideal for high-throughput scenarios.
Best Practices:
- High Cardinality: Choose a partition key with high cardinality (many unique values) to maximize parallelism.
- Consistent Usage: Use the same logic for partition keys across your producers to ensure predictable data distribution.
- Ordering Needs: If you need to process all events for a specific entity in order, use that entity’s ID as the partition key.

Sending Data Records to the Kinesis Stream

Now that you have created a Kinesis Data Stream, let’s see how to send data into it. In Kinesis terminology, this is called putting records into the stream. Each record can represent an event, such as a user action or a sensor reading.

Here’s a simple example that demonstrates how to send a single order event (as a JSON object) to the user-events-stream using Python and boto3:

What this code does:

Defines a function put_order_event that takes order details and sends them as a record to the Kinesis stream.
Serializes the order event as JSON and encodes it as bytes (required by Kinesis).
Uses the put_record method to send the data, specifying a PartitionKey (here, the user_id), which determines how records are distributed across shards.

How Kinesis Bridges to Your Data Lake

Now that you know how to create a Kinesis Data Stream and send data to it, let’s talk about how it fits into your overall data lake architecture. In the previous lesson, you set up an S3-based data lake to store and organize your data. Kinesis acts as a bridge between real-time data sources and your data lake.

Imagine you have a website that generates user events, such as clicks or purchases, every second. Instead of waiting to collect these events in batches, you can send each event to your Kinesis stream as soon as it happens. From there, you can build applications or use AWS services to read the data from the stream and write it directly into your S3 data lake. This allows you to analyze fresh data almost instantly, enabling real-time dashboards, alerts, or machine learning models.

For example, you might have a setup where user events are sent to Kinesis, and a consumer application reads from the stream and saves the events into the raw/user-events/ folder in your S3 data lake. This way, your data lake always has the latest information, and you can combine both batch and real-time data for analytics.

Summary and What’s Next

In this lesson, you learned how to move from batch data ingestion to real-time streaming using Amazon Kinesis Data Streams. You reviewed the key concepts behind Kinesis, saw a complete example of how to create a stream with Python, learned how to send order events into the stream, explored the importance of the partition key, and understood how Kinesis acts as a bridge to your S3 data lake. This prepares you to handle real-time data flows and sets the stage for more advanced streaming and processing tasks.

You are now ready to get hands-on practice with these concepts. In the next exercises, you will create your own Kinesis stream and explore how to send and receive real-time data. Take a moment to review the code and ideas from this lesson, as they will help you succeed in the upcoming practice tasks and future lessons in this course.

Previous Lesson

Next Lesson: Delivering Streaming Data to S3

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal