Evaluating Models in Pipelines

Introduction & Lesson Overview

Welcome back to your journey in mastering SageMaker Pipelines! You've made tremendous progress building automated ML workflows. In your first lesson, you created a solid foundation with a data preprocessing pipeline that transforms raw California housing data into clean, machine-learning-ready datasets. Then, you learned the essential skill of monitoring pipeline executions to track progress and diagnose issues. Most recently, you expanded your pipeline by integrating a model training step, creating a complete two-step workflow that automatically processes data and trains a Linear Regression model.

Your current pipeline represents a significant achievement in ML automation. You now have an end-to-end workflow where raw data flows seamlessly through preprocessing and training stages without manual intervention. The preprocessing step produces clean training and test datasets, while the training step uses the processed training data to create a trained model artifact. However, there's one critical piece missing from this workflow: the systematic evaluation of your model's performance.

In this lesson, you'll complete your ML pipeline by adding a dedicated model evaluation step. This evaluation component will assess how well your trained model performs on the test data that was set aside during preprocessing. You'll learn to create evaluation scripts that generate comprehensive performance metrics, configure evaluation steps that connect to both your model artifacts and test data, and integrate everything into a complete three-step pipeline that processes, trains, and evaluates automatically.

By the end of this lesson, you'll have a production-ready ML pipeline that not only trains models but also provides detailed performance reports, giving you the insights needed to make informed decisions about model quality and deployment readiness.

Importance of Model Evaluation in ML Workflows

Model evaluation serves as the quality gate in your ML workflow, providing objective measurements of how well your trained model performs on unseen data. Without proper evaluation, you're essentially flying blind when it comes to understanding whether your model is ready for production use or needs further refinement.

In automated ML pipelines, evaluation becomes even more critical because you need systematic, repeatable ways to assess model performance across different training runs and data variations. Manual evaluation processes don't scale well and introduce opportunities for human error or inconsistency. By embedding evaluation directly into your pipeline, you ensure that every model gets assessed using the same rigorous standards and metrics.

For regression problems like our California housing price prediction, we'll calculate the same familiar metrics you've used before: Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and R-squared score. By calculating and tracking all these metrics systematically, your pipeline provides the information needed to make informed decisions about model quality and deployment readiness.

Exploring the Evaluation Script

Let's examine the evaluation script we'll call evaluation.py that will perform the same model assessment you've done locally in previous exercises, but now adapted to work within our SageMaker Pipeline. This script follows SageMaker's processing script conventions, reading the model and data from other pipeline steps and saving results for future pipeline components.

This evaluation script performs the exact same model assessment you've practiced locally, but now it's adapted to work seamlessly within our automated pipeline. The key difference is how it accesses the trained model. When your training step completes, SageMaker automatically packages the trained Linear Regression model (saved as model.joblib by your training script) into a compressed tar.gz archive and stores it in S3. Our evaluation script receives this model artifact as input and must first extract it before use. The tarfile.open() operation decompresses this archive, revealing the same file that your training script created, which can then be loaded with just as you would locally.

Configuring a SKLearnProcessor for Evaluation

Now, let's integrate the evaluation functionality into our existing pipeline. While we could technically reuse the same processor instance from our preprocessing step, creating separate processors for each step is a best practice that improves pipeline clarity and maintainability - each step has its own dedicated processor that can be independently configured, scaled, or modified without affecting other pipeline components.

This separation allows you to optimize each step independently. For example, you might later decide that evaluation requires more memory or different compute resources than preprocessing, or you might want to experiment with different framework versions for specific steps. Having dedicated processors makes these modifications straightforward without risking unintended effects on other pipeline components.

Understanding Property Files for Pipeline Integration

The evaluation step introduces a powerful new concept called Property Files, which allow pipeline steps to expose structured data that subsequent steps can reference and use for decision-making.

Property files enable sophisticated pipeline orchestration by making the contents of output files accessible to other pipeline components. In our case, the evaluation metrics (MSE, RMSE, MAE, R²) stored in the JSON file become available for future pipeline steps to reference. This capability becomes crucial when building more advanced pipelines that might include conditional deployment based on performance thresholds, automated model comparison logic, or routing decisions based on evaluation results.

The output_name parameter must exactly match the output name we'll specify in our evaluation step's ProcessingOutput, while the path parameter tells SageMaker where to find the JSON file within the output directory. This connection ensures that SageMaker can locate and parse the evaluation metrics for use by downstream pipeline components.

Defining the Evaluation ProcessingStep

With our processor and property file configured, we can now define the evaluation step that orchestrates the entire evaluation process:

The evaluation step demonstrates sophisticated pipeline orchestration by connecting to outputs from two different previous steps. The first input references training_step.properties.ModelArtifacts.S3ModelArtifacts, which provides the S3 location where our training step saved the trained model artifact. The second input references the test data output from our preprocessing step using the same property system you learned in the previous lesson.

This dual-input configuration creates implicit dependencies that ensure the evaluation step won't execute until both the training and preprocessing steps complete successfully. SageMaker automatically manages these dependencies, guaranteeing that our evaluation script will have access to both the trained model and the test data when it runs.

The property_files parameter enables downstream pipeline components to access the evaluation metrics programmatically. This capability becomes crucial when building more sophisticated pipelines that might include conditional deployment based on model performance thresholds or automated model comparison logic.

Integrating the Evaluation Step into the Pipeline

With all steps configured, let's update the complete pipeline that orchestrates the entire ML workflow:

When this pipeline executes, the following sequence will occur automatically:

Processing step runs first - No dependencies, processes raw data into training/test datasets
Training step executes next - Waits for processing to complete, uses processed training data
Evaluation step runs last - Requires both test data (from processing) and trained model (from training)

During the evaluation step execution, SageMaker will:

Download the model artifact from the training step's S3 location to /opt/ml/processing/model/
Download the test data from the processing step's S3 location to /opt/ml/processing/test/
Execute the evaluation script which extracts the model from its tar.gz archive, loads the test CSV file, generates predictions, and calculates metrics
Upload the evaluation report (JSON file with MSE, RMSE, MAE, R² scores) to S3 for future pipeline steps to access

This automatic orchestration eliminates the need for manual dependency management while ensuring reliable, repeatable ML workflows.

Summary & Next Steps

In this lesson, you explored the importance of systematic model evaluation in automated workflows and understood how different regression metrics provide complementary insights into model performance. You examined a complete evaluation script that handles model artifact extraction, test data processing, and comprehensive metric calculation. Most importantly, you learned to configure complex pipeline steps that connect to multiple input sources, demonstrating the sophisticated orchestration capabilities of SageMaker Pipelines.

The evaluation step you built showcases advanced pipeline concepts, including property files for exposing structured data, multi-step dependencies that ensure correct execution order, and systematic metric reporting that supports downstream decision-making. These capabilities form the foundation for even more sophisticated workflows that might include conditional deployment logic, automated model comparison, or performance-based routing decisions.

You're now ready to practice these concepts in the hands-on exercises that follow, where you'll work with the complete three-step pipeline and gain confidence in building comprehensive ML workflows. The skills you've developed represent a significant milestone in your journey toward mastering production ML systems. Your pipeline now embodies the core principles of MLOps: automation, reproducibility, and systematic quality assessment, which are essential for deploying machine learning solutions at scale.

Previous Lesson

Next Lesson: Conditionally Registering Models for Deployment

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal