Integrating Model Training Steps

Introduction & Lesson Context

Welcome back! You've made excellent progress in your journey to master SageMaker Pipelines. In the first lesson, you successfully built your first pipeline with a data preprocessing step that transforms raw California housing data into clean, processed datasets ready for machine learning. Then, in the second lesson, you learned the critical skill of monitoring pipeline executions, tracking their progress, and diagnosing any issues that might arise during execution.

Now you're ready to take the next significant step in building automated ML workflows. In this lesson, you'll expand your existing pipeline by adding a model training step that seamlessly connects to your preprocessing output. This integration represents a fundamental concept in MLOps: creating end-to-end workflows where data flows automatically from one stage to the next without manual intervention.

By the end of this lesson, you'll have a complete pipeline that takes raw data, processes it, and trains a machine learning model — all in a single, automated workflow. This expanded pipeline will demonstrate how SageMaker Pipelines can orchestrate complex ML workflows while maintaining clear dependencies between steps. You'll learn how to configure training steps, connect pipeline outputs to inputs, and ensure your workflow runs smoothly from start to finish.

What We've Built So Far

Let's quickly recap what we've already accomplished. Our current pipeline contains a single data preprocessing step that handles the California housing dataset beautifully.

Our preprocessing step uses an SKLearnProcessor to run our data_processing.py script, which loads the raw housing data, performs feature scaling and encoding, splits the data into training and test sets, and saves the processed datasets to designated S3 locations. This step takes raw CSV data as input and produces two outputs: processed training data and processed test data, both stored in S3 with clear reference names.

This preprocessing foundation is perfect for adding model training because it produces clean, processed data in exactly the format that a training algorithm expects. The output structure we established — with separate training and test datasets — follows ML best practices and makes it straightforward to connect additional pipeline steps.

Our Training Script

Before we add the training step to our pipeline, let's understand what our training script does. Our train.py script contains the actual machine learning code that will execute during the training step. This script follows SageMaker's training script conventions, which means it knows how to read input data from specific locations and save the trained model to the correct output directory.

Here's what our training script looks like:

The main training logic follows the same machine learning workflow you've used before: loading the processed data, separating features from the target variable, training a Linear Regression model, evaluating its performance, and saving the trained model using joblib. The key difference is that this script is structured to work within SageMaker's training environment, reading from designated input channels and saving to the correct output directory.

Configuring a SKLearn Estimator

Now let's add the training components to our existing pipeline. First, we need to create an estimator that defines how our model training will be executed. Our SKLearn estimator will specify the training environment, including our entry point script, computational resources, and framework version:

This estimator configuration tells SageMaker to use our train.py script in a scikit-learn environment with the specified computational resources. Just as we used the pipeline_session with our processor, we use it here to ensure the training job will be executed as part of our pipeline workflow rather than immediately when the estimator is created. This estimator acts as a blueprint for how the training job should be executed when our pipeline runs.

Creating the TrainingStep

With our estimator configured, we can now create the TrainingStep that orchestrates model training within our pipeline. A TrainingStep is a pipeline component that combines an estimator (which defines the training environment) with input data sources. The key to building effective pipelines lies in properly connecting outputs from one step to inputs of another using SageMaker's step properties system.

The critical connection happens in the inputs parameter. The "train" key in our inputs dictionary corresponds to the SM_CHANNEL_TRAIN environment variable that our training script uses to locate input data. Remember in our training script where we had args.train = os.environ.get('SM_CHANNEL_TRAIN')?

That environment variable gets populated with the local path where SageMaker downloads our training data. When the training job runs, SageMaker will automatically download the data from the S3 location we specify and make it available to our script at the path stored in SM_CHANNEL_TRAIN.

Now let's break down how we specify which S3 location to use for the s3_data parameter:

processing_step - References our preprocessing step that we defined earlier

Let's Update Our Pipeline

Finally, let's update our pipeline creation to include both our preprocessing and training steps:

When we execute this complete pipeline, we'll see output similar to:

Our pipeline will first execute the preprocessing step, then automatically proceed to the training step once preprocessing completes successfully.

Understanding Pipeline Execution Order

Now that we have a multi-step pipeline, it's important to understand how SageMaker determines execution order. Pipeline execution order is determined by dependencies, not by the order of steps in your steps list.

When you create a pipeline with steps=[processing_step, training_step], SageMaker doesn't execute steps based on list order. Instead, it analyzes dependencies between steps. In our pipeline, we created a dependency when we configured our training step to use the processing step's output:

This property reference tells SageMaker that the training step cannot begin until the processing step completes successfully. Dependencies are explicit and created through step property references — simply placing steps in a certain order in your list doesn't create dependencies.

If steps have no dependencies between them, SageMaker executes them in parallel to minimize total execution time. This explicit dependency model makes pipelines robust and predictable, as execution flow is clearly defined by data dependencies rather than list ordering.

Summary & Next Steps

Congratulations! You've successfully expanded your SageMaker Pipeline from a simple preprocessing workflow to a complete end-to-end ML pipeline that processes data and trains a model automatically. This represents a significant milestone in your journey toward building production-ready ML workflows.

In this lesson, you learned how to integrate model training components into our existing pipeline by configuring SKLearn estimators, creating training scripts that follow SageMaker conventions, and, most importantly, connecting pipeline steps through the step properties system. You now understand how to create dependencies between steps, ensuring that data flows seamlessly from preprocessing to training without manual intervention.

You're now ready to practice these concepts in the hands-on exercises that follow, where you'll work with the complete pipeline code and gain confidence in building multi-step ML workflows. The skills you've developed here form the foundation for even more sophisticated pipelines that might include model evaluation, conditional logic, and automated deployment — topics we'll explore in future lessons.

Previous Lesson

Next Lesson: Evaluating Models in Pipelines

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal