Now that you have learned how to stream real-time data into AWS using Kinesis Data Streams, you might be wondering how to get that data into your S3 data lake in a way that is both reliable and efficient. In the previous lesson, you set up a Kinesis stream and practiced sending data records to it. This is a great start, but in a real-world data lake, you want your streaming data to land in S3 automatically, without having to write and maintain custom consumer code.
This is where Kinesis Data Firehose comes in. Firehose is a fully managed service that can take data from your Kinesis Data Stream and deliver it directly to S3, handling all the heavy lifting for you. It can buffer, compress, and organize your data as it moves from the stream to your data lake. In this lesson, you will learn how to set up a Firehose delivery stream so that your real-time data flows smoothly from Kinesis into S3, ready for analytics and further processing.
Before we dive into the code, let’s clarify what a Firehose delivery stream is and how it fits into your data pipeline. This architecture follows the classic Producer/Consumer pattern: your application or service acts as the producer, sending real-time data into the Kinesis Data Stream, while Firehose acts as a managed consumer that reads from the stream and delivers data to your S3 bucket. While Firehose is a fully managed consumer that simplifies delivery to S3, you could also implement your own custom consumer if you need more control or want to process the data differently before storing it. In this lesson, we’ll focus on using Firehose as the consumer to automate and streamline the delivery process.
While Kinesis Data Streams is designed for collecting and transporting real-time data, Kinesis Data Firehose is built for delivering that data to storage destinations like S3, Redshift, or Elasticsearch. In this lesson, we will focus on S3 as the destination.
A key detail to remember is that Kinesis Data Streams retains data for 24 hours by default, though you can increase this retention period up to 7 days (and up to 365 days with extended retention enabled). This means that if your consumer—such as Firehose—falls behind or is temporarily unavailable, it can still read data from the stream as long as it is within the retention window. However, for most real-time data lake scenarios, you want to automate delivery to S3 as quickly as possible to avoid any risk of data loss due to retention limits.
A Firehose delivery stream acts as a bridge between your Kinesis stream and your S3 bucket. You configure it to read from your existing Kinesis Data Stream, and it will automatically write the incoming records to your S3 bucket. To do this, Firehose needs permission to access both the Kinesis stream and the S3 bucket. This is handled by an IAM role that you specify when creating the delivery stream.
Firehose also gives you control over how data is delivered. You can set buffering options, which determine how much data Firehose collects before writing a batch to S3. For example, you might buffer up to 5 MB or wait up to 60 seconds before writing a file. This helps balance latency and cost. You can also enable compression (such as GZIP) to save storage space and use prefixes to organize your data in S3 by date or event type, making it easier to query later.
In summary, a Firehose delivery stream:
- Reads data from your Kinesis Data Stream.
Let's walk through how to create a Firehose delivery stream using Python and the boto3 library. In this lesson, you will work with already created resources:
- Kinesis Data Stream:
user-events-stream - S3 Bucket:
my-data-lake-bucket - IAM Role:
FirehoseDeliveryRole(already created with the necessary permissions)
We'll use boto3 to look up the ARNs for the Kinesis stream and S3 bucket, so you don't have to hardcode them. The IAM role ARN is provided to you in the CodeSignal IDE for this course. However, in a real-world scenario, you would need to obtain this ARN yourself.
The S3DestinationConfiguration controls where and how Firehose writes to S3 (prefixes, buffering, compression). In production, this is also where you would ensure your S3 destination is secured (for example, by using an encrypted bucket and restricting writes/reads via IAM + bucket policy). Even if you don’t explicitly set encryption options in code, you should still enable bucket-level encryption and Block Public Access on the destination bucket.
About the Firehose IAM Role ARN:
- In the CodeSignal IDE: The
FIREHOSE_ROLE_ARNis provided for you, so you can use it directly in your code. - In a real AWS environment: You'll need to obtain the ARN of an IAM role that allows Firehose to read from your Kinesis stream and write to your S3 bucket. This role must already exist and have the correct permissions.
- If you're working in a team or enterprise environment, your AWS administrator will typically provide this ARN.
- If you have access to the AWS console, you can find it by navigating to the IAM service, searching for the role named
FirehoseDeliveryRole(or whatever name your organization uses), and copying its ARN from the role summary page.
The code above is structured to:
- Dynamically retrieve the ARNs for your Kinesis stream and S3 bucket.
- Construct the Firehose IAM role ARN using your AWS account ID.
- Define the S3 destination configuration, including buffering, compression, and folder structure.
- Attempt to create the Firehose delivery stream, and handle the case where the stream already exists.
After creating your Firehose delivery stream, it is important to verify that data is actually being delivered to your S3 bucket. You can do this by checking your S3 bucket for new files in the folder path you specified with the prefix. For example, you should see files appearing under raw/user-events/YYYY/MM/DD/ (with the current date) as data flows through the pipeline.
You can use the AWS CLI or Python to check for new files. Here’s how you can do it with Python and boto3, using today’s date to construct the prefix automatically:
If you do not see data arriving in S3, there are a few things to check. First, make sure that your IAM role has the correct permissions to read from the Kinesis stream and write to the S3 bucket. If there are permission issues, Firehose will not be able to deliver the data. You should also check the errors/ folder in your S3 bucket, as specified by the ErrorOutputPrefix. Any records that could not be delivered will be stored here, along with information about what went wrong.
It is also helpful to review the buffering settings. If you set a large buffer size or a long interval, it may take some time before files appear in S3. Adjust these settings if you need data to arrive more quickly.
In this lesson, you learned how to automate the delivery of real-time data from your Kinesis Data Stream to your S3 data lake using Kinesis Data Firehose. You saw how to set up a delivery stream with Python, configure buffering and compression, and organize your data in S3 using prefixes. You also learned how to verify that data is arriving as expected and where to look if something goes wrong.
You are now ready to put these concepts into practice. In the next exercises, you will create your own Firehose delivery stream and see how it simplifies streaming ingestion for your data lake. Take a moment to review the code and explanations from this lesson, as they will help you succeed in the upcoming hands-on tasks.
