Welcome back! In the previous lesson, you learned how to make your AWS Lambda function respond automatically when a new file is uploaded to an S3 bucket. You wrote a Lambda function that logs file details and set up the S3 event trigger so this happens instantly. This is a great first step in building automated data pipelines.
Now, let’s take your automation further. Instead of just logging uploads, you will make Lambda start a data processing job using AWS Glue. This means that as soon as a new file lands in your S3 bucket, your pipeline will automatically kick off an ETL (Extract, Transform, Load) job to process the data — no manual steps required. This is a common pattern in real-world data engineering, where you want new data to be processed and made available as soon as it arrives.
By the end of this lesson, you will know how to write a Lambda function that starts a Glue job in response to an S3 upload, passing the file details to Glue so it knows what to process.
Let’s build on what you already know. Previously, your Lambda function simply logged information about the uploaded file. Now, you will update your Lambda so that it starts a Glue job whenever it is triggered by an S3 event.
The main change is that your Lambda function will use the AWS SDK for Python, called boto3, to communicate with the Glue service. This allows your function to start a Glue job programmatically. For this to work, your Lambda function needs permission to call the glue:StartJobRun action. This is usually done by attaching the right IAM policy to your Lambda’s execution role. If you are working in the CodeSignal environment, these permissions are already set up for you, but it’s important to know about them for your own AWS projects.
The workflow now looks like this: a file is uploaded to S3, S3 triggers your Lambda function, and Lambda starts a Glue job to process the new file. This is a powerful way to automate your data pipeline.
For production systems, this is also a good place to add retry and backoff around AWS SDK calls. A short-lived service-side error or permission propagation delay can cause start_job_run() to fail even though your overall workflow is correct. A small retry loop often makes event-driven automation much more reliable.
Let’s look at a complete example of a Lambda function that starts a Glue job when a new file is uploaded to S3. Here is the code:
This example keeps the safer parsing style from Unit 2. In a controlled demo environment, direct indexing like event['Records'][0] can work, but production Lambda functions are usually better off handling missing or malformed fields gracefully so they log useful information instead of failing with a KeyError.
Let’s break down what this code does. When the Lambda function is triggered by an S3 event, it first extracts the bucket name and file key from the event payload. This tells the function which file was uploaded. It then creates a Glue client using boto3 and calls start_job_run to start a Glue job named simple-etl-job. The function passes the bucket and key as arguments to the Glue job, so the job knows which file to process. After starting the job, it prints the Glue job run ID and returns a response with the job run ID and a success message.
The Arguments dictionary is how Lambda passes runtime parameters into the Glue job. The keys start with -- because Glue job arguments are exposed on the Glue side as command-line style parameters. In the Glue script, those values are typically read with , using names like and without the leading dashes. This lets one reusable Glue job process different input files depending on which S3 event triggered the Lambda function.
To test this workflow, you can upload a new file to your S3 bucket in the folder that triggers your Lambda function. For example, if your trigger is set up for the raw/ folder, upload a file like raw/test_data.json. Once the file is uploaded, Lambda will be triggered automatically.
You can check the Lambda logs (in AWS CloudWatch or your CodeSignal environment) to see the output. You should see a message showing the new file detected and the Glue job run ID. You can also go to the AWS Glue console to see that a new job run has started for your ETL job.
If there is a problem — such as missing permissions or a typo in the job name — you will see an error message in the logs. This feedback is helpful for troubleshooting.
In this lesson, you learned how to extend your event-driven pipeline by having Lambda start a Glue ETL job whenever a new file is uploaded to S3. You saw how to extract file details from the S3 event, use boto3 to start a Glue job, and pass the right arguments so Glue knows what to process. This pattern is a key building block for automated data pipelines on AWS.
Next, you will get hands-on practice writing and testing Lambda functions that trigger Glue jobs. You will also learn how to troubleshoot common issues and make sure your automation works smoothly. This will help you build confidence in creating real-world data workflows that run without manual intervention.
