Building Your First SageMaker Pipeline

Introduction & Lesson Overview

Welcome to an exciting new chapter in your machine learning journey with AWS SageMaker! By now, you've mastered the individual components of machine learning workflows — you know how to preprocess datasets, train models in SageMaker, and deploy them for real-world use. These are powerful skills, but there's one more crucial piece to complete your ML toolkit: automation.

In this lesson, you'll learn how to connect all these individual steps into a seamless, automated workflow using SageMaker Pipelines. Think of it as moving from cooking individual dishes to orchestrating an entire restaurant kitchen — each step needs to happen in the right order, with the right inputs, and you want it all to run smoothly without manual intervention.

By the end of this lesson, you'll have built your first SageMaker Pipeline with a data preprocessing step. This foundation will prepare you for the more complex pipelines we'll build in upcoming lessons, where we'll add training, evaluation, and conditional model registration steps. You'll understand not just how to write the code, but why each component matters and how they work together to create robust, production-ready ML workflows.

What are SageMaker Pipelines?

SageMaker Pipelines is AWS's solution for creating and managing machine learning workflows. A pipeline is essentially a series of connected steps that execute in a specific order, where each step takes inputs, processes them, and produces outputs that become inputs for the next step. Just like an assembly line in manufacturing, each step takes inputs, processes them, and produces outputs that become inputs for the next step.

The real power of pipelines becomes clear when you consider what happens without them. Imagine you're working on a machine learning project where you need to preprocess data, train a model, evaluate its performance, and then deploy it only if it meets certain quality criteria. Without automation, you'd need to manually run each step, wait for it to complete, check the results, and then decide whether to proceed to the next step. This process is time-consuming, error-prone, and doesn't scale well when you need to retrain models regularly or work with multiple datasets.

SageMaker Pipelines solves these challenges by providing several key benefits:

Reproducibility — your pipeline will execute the same way every time, eliminating the "it worked on my machine" problem
Scalability — you can easily run the same pipeline on different datasets or with different parameters
Monitoring and tracking — you can see exactly what happened at each step, making debugging and optimization much easier
Collaboration — team members can understand and modify the workflow without needing to decipher scattered scripts and manual processes

These benefits transform machine learning from a collection of manual, error-prone tasks into a reliable, automated system that can scale with your business needs.

Adapting Preprocessing Script for SageMaker

To build a pipeline, we need to start with individual steps. Our first step will be data preprocessing — the same preprocessing you've done locally in previous courses, but now adapted to run in SageMaker's managed environment.

We'll create a separate file called data_processing.py that contains our preprocessing logic. The preprocessing logic itself is identical to what you've done before — we're still capping outliers, creating new features, and splitting the data. The key changes are in how we handle file paths to work within SageMaker's processing environment:

The three critical changes that make this script SageMaker-compatible are:

Input path: We read from /opt/ml/processing/input/ instead of a local file
Output paths: We write to /opt/ml/processing/train/ and /opt/ml/processing/test/ directories

Understanding SageMaker Sessions

Before building your pipeline, you need to understand that SageMaker Pipelines require two different types of sessions that serve distinct purposes:

The distinction between these sessions is about execution context, not timing:

sagemaker.Session() — This is your direct connection to AWS services. When you use this session, you're telling SageMaker "execute this operation using real AWS resources right now." It handles immediate operations like uploading data to S3, creating pipeline definitions, starting executions, and checking status.
PipelineSession() — This is a special "recording" session that creates placeholder operations instead of real ones. When you use this session with processors or estimators, instead of immediately creating SageMaker jobs, it returns step objects that become part of your pipeline definition. These placeholders get converted into real operations only when the pipeline executes.

Without PipelineSession(), if you created a processor with a regular session, SageMaker would immediately try to spin up compute instances and start processing your data before you've even finished defining your pipeline! The PipelineSession() lets you define all your pipeline components as a complete workflow first, then execute everything in the proper order when you're ready.

Simple Rule:

Use PipelineSession() for any processor, estimator, or transformer that should become a pipeline step
Use sagemaker.Session() for immediate actions like managing the pipeline itself

Setting Up AWS Resources

Now that we understand sessions, we need to set up the AWS resources and permissions that our pipeline will use:

These resources should be familiar from your previous SageMaker work — the execution role provides the necessary permissions for SageMaker to access your data and create resources, while the default bucket provides storage for your pipeline inputs and outputs.

Creating the Processing Environment

To run our preprocessing script in SageMaker, we need to define the computing environment where our code will execute. We use an SKLearnProcessor because our preprocessing script uses scikit-learn libraries, and crucially, we use the PipelineSession to ensure the processor becomes a pipeline step rather than executing immediately:

Since our preprocessing script uses pandas and scikit-learn libraries, the SKLearnProcessor provides the perfect pre-configured environment with all necessary dependencies already installed. Each parameter in the processor configuration serves a specific purpose in defining how your preprocessing will execute:

sagemaker_session=pipeline_session — Uses the pipeline session instead of the regular session, telling the processor to become part of a pipeline step rather than executing immediately when created
framework_version — Ensures we use a specific version of scikit-learn, which is important for reproducibility so your pipeline behaves the same way every time it runs
instance_type — Determines the computational resources available; ml.m5.large provides a good balance of CPU and memory for most preprocessing tasks
instance_count — Specifies the number of instances to use; we use 1 since our dataset is small enough to process on a single machine, but you could scale this up for larger datasets

Building Our First Pipeline Step

With our processor configured, we can now create the actual processing step using ProcessingStep. This step defines what data goes in, what comes out, and what code runs in between.

Before proceeding, note that we assume you have already uploaded your raw dataset (california_housing.csv) to your S3 default bucket at the path /datasets/california_housing.csv. This is necessary because the pipeline will read the input data directly from S3.

Notice how the paths in our ProcessingInput and ProcessingOutput definitions perfectly match the paths our script expects. The inputs parameter specifies where our raw data comes from (an S3 location) and where it will be mounted inside the processing container (/opt/ml/processing/input). The parameter defines where our processed data will be saved, with separate outputs for training and test data. Each output has a name that we can reference later and a source path where our processing script will write the data. The parameter points to the Python script that contains our preprocessing logic.

Creating the Pipeline

With our processing step defined, we can now create our first pipeline. A pipeline is simply a collection of steps that execute in order, and right now we have just one step. Note that we use the regular sagemaker_session for pipeline management:

This creates a pipeline definition with our single processing step. In future lessons, we'll add more steps to this list to create more complex workflows with training, evaluation, and model registration.

At this point, we've only created the pipeline definition in memory — it doesn't exist in AWS yet. To make it available in SageMaker, we need to register it with the service using the upsert method:

The upsert method is particularly useful because it handles both creation and updates intelligently. If this is the first time you're running this code, it will create a new pipeline in SageMaker. If you run the same code again after making changes to your pipeline definition, it will update the existing pipeline rather than throwing an error. This makes iterative development much smoother — you can modify your pipeline code, run it again, and SageMaker will automatically apply your changes.

Think of the pipeline definition as a blueprint or recipe. Once you've registered this blueprint with SageMaker using upsert, you can execute it multiple times. Each execution is a separate run of the same blueprint, potentially with different data or parameters.

Executing the Pipeline

Now that our pipeline is registered with SageMaker, we can start an execution. This is where the actual work begins:

When you call pipeline.start(), SageMaker immediately begins executing your pipeline in the background. This means your local Python script doesn't need to wait for the processing to complete — the heavy computational work is happening on AWS infrastructure while your script continues running or even after it finishes.

The execution object provides valuable information about the running pipeline. You'll see output similar to:

This ARN (Amazon Resource Name) is a unique identifier for your specific pipeline execution. Let's break down what each part means:

arn:aws:sagemaker — identifies this as an AWS SageMaker resource
us-east-1 — the AWS region where your pipeline is running
123456789012 — your AWS account ID
pipeline/california-housing-preprocessing-pipeline — the name of your pipeline
execution/x1gc33lgj8v5 — the unique execution ID for this specific run

The execution ID (x1gc33lgj8v5 in this example) is particularly important because it distinguishes this run from all other executions of the same pipeline. Every time you call pipeline.start(), SageMaker generates a new execution ID. This allows you to track multiple runs of the same pipeline, compare their results, and debug issues by examining the logs and outputs from specific executions. You'll also see this execution ID in the S3 paths where your pipeline outputs are stored, making it easy to trace which data came from which pipeline run.

Monitoring Your Pipeline

Once your pipeline is running, you can monitor its progress and check its status. The execution object provides several methods for tracking the pipeline's progress:

When you run this code, you'll see output similar to:

The status will initially show "Executing" and will change to "Succeeded" or "Failed" as the pipeline progresses through its steps. You can also wait for the pipeline to complete:

When successful, your processed data will be automatically saved to S3 in your default bucket under a path that includes the pipeline execution ID. For example, if your default bucket is sagemaker-us-east-1-123456789012 and the pipeline execution ID is x1gc33lgj8v5, the training split produced by the ProcessData step will be saved at:

The corresponding test split would be saved at:

This hierarchical structure makes it easy to track which pipeline run produced which outputs and helps organize your data as you iterate on your pipeline design. In the next steps, you’ll be able to easily access these S3 output locations as inputs for downstream steps in your pipeline, such as model training and evaluation.

Summary & Next Steps

In this lesson, you learned that SageMaker Pipelines provide a powerful way to automate and orchestrate machine learning workflows, offering benefits like reproducibility, scalability, and better collaboration. You discovered how to adapt existing preprocessing code to work in SageMaker's managed environment by changing file paths to use SageMaker's conventions.

You explored the essential components that every pipeline needs: two types of SageMaker sessions (regular session for immediate operations and pipeline session for creating pipeline steps), an execution role for permissions, an S3 bucket for storage, and a processor that defines the computing environment. You built a complete data preprocessing step using an SKLearnProcessor and ProcessingStep, defining inputs from S3, outputs for processed data, and the processing script that performs the actual data transformation.

Finally, you learned how to create, upsert, and execute your pipeline, along with monitoring its progress. Your pipeline successfully processes California housing data and prepares it for the next stages of the machine learning workflow.

The skills you've developed here — understanding pipeline structure, defining processing steps, managing different session types, and controlling execution flow — are the building blocks for creating sophisticated, production-ready machine learning systems. You're now ready to move on to the hands-on exercises, where you'll practice and reinforce these concepts by building your own preprocessing pipeline from scratch.

Next Lesson: Monitoring Pipeline Executions

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal