Monitoring Pipeline Executions

Introduction & Lesson Overview

Welcome back! In the previous lesson, you successfully built your first SageMaker Pipeline with a data preprocessing step. By now, you should have a running pipeline that processes California housing data and saves the results to S3. However, building and executing a pipeline is only part of the story—you also need to monitor its progress, check if it completed successfully, and understand what happened at each step.

In this lesson, you'll learn how to retrieve and examine pipeline executions using SageMaker's monitoring capabilities. You'll discover how to check execution status, view individual step details, and understand execution timing. By the end of this lesson, you'll be able to track your pipeline's progress, diagnose issues when they occur, and gain insights into your workflow's performance—skills that will serve as the foundation for building more sophisticated pipelines in upcoming lessons.

Understanding Pipeline Execution Monitoring

Pipeline execution monitoring is the process of tracking and examining your SageMaker Pipeline runs to understand their status, performance, and outcomes. When you execute a pipeline, SageMaker creates an execution record that contains detailed information about the entire workflow, including when it started, its current status, and comprehensive details about each individual step.

Effective monitoring involves several key aspects:

Track overall execution status: Know whether your pipeline is still running, has completed successfully, or has encountered an error.
Understand execution timing: Identify performance bottlenecks and optimize your workflows by analyzing how long each execution and step takes.
Examine individual step details: See what each component of your pipeline accomplished, how long it took, and whether it succeeded or failed.
Diagnose and resolve issues: Use detailed execution information to quickly identify and fix problems when they occur.
Optimize costs and resources: Monitor execution times to spot steps that might be over-provisioned or underperforming.

This monitoring capability is crucial for maintaining robust ML workflows for several reasons. In production environments, pipelines often run on schedules or are triggered by events, and you need to ensure they complete successfully without manual intervention. When issues do occur, detailed execution information helps you quickly identify and resolve problems. Additionally, monitoring execution times helps you optimize costs by identifying steps that might be over-provisioned or underperforming.

SageMaker provides comprehensive APIs for accessing this execution information, allowing you to programmatically retrieve and analyze pipeline runs. This programmatic access is particularly valuable because it enables you to build automated monitoring systems or integrate pipeline status checks into your broader ML operations workflows.

Setting Up the Monitoring Environment

To monitor your pipeline, you first need to establish a connection to SageMaker and prepare the necessary components for retrieving execution information. This setup creates the foundation for all monitoring operations you'll perform throughout this lesson.

The first step is establishing a connection to SageMaker through the session and client objects. The sagemaker_session provides high-level functionality for common SageMaker operations, while the sagemaker_client gives us direct access to the low-level SageMaker API methods we need for detailed monitoring. We also define the pipeline name as a constant, which should match exactly the name you used when creating your pipeline in the previous lesson. This setup ensures we can consistently reference the correct pipeline throughout our monitoring operations.

Fetching Recent Pipeline Executions

With our monitoring environment configured, we can now retrieve information about recent pipeline executions. SageMaker's list_pipeline_executions API allows us to query for execution records and sort them to find the most recent runs.

This API call retrieves pipeline executions with several important parameters that control what information we receive. The SortBy='CreationTime' and SortOrder='Descending' parameters ensure that the newest execution appears first in the results, giving us the most up-to-date information about our pipeline's status.

Since the executions are sorted with the newest first, we can easily access the latest execution by taking the first item from the list (executions[0]). This gives us immediate access to the most recent pipeline run without having to search through multiple execution records.

Extracting Basic Execution Information

The latest_execution object contains summary information about the most recent pipeline run, including basic status and timing information that we can access directly. Let's extract and examine some basic execution details from the latest execution.

When you run this code, you'll see output similar to:

The execution summary provides some key pieces of information that form the foundation of pipeline monitoring. The three most important fields you'll use regularly are:

PipelineExecutionArn: A unique identifier that allows you to reference this exact execution when retrieving step details, accessing logs, or performing other operations
PipelineExecutionStatus: The current state of the execution, which can be:
- "Executing" (still running)
- "Succeeded" (completed successfully)
- "Failed" (encountered an error)
- "Stopped" (manually terminated)
StartTime: When the pipeline execution began, providing the baseline for calculating execution duration and understanding timing patterns

These core fields give you immediate insight into your pipeline's current state and serve as the starting point for deeper analysis when needed.

Retrieving Detailed Execution Information

While the execution summary provides basic information, for comprehensive monitoring we often need more detailed information about the execution. The describe_pipeline_execution API provides additional timing information and metadata that isn't available in the summary.

The describe_pipeline_execution API call uses the execution ARN we extracted from the summary to retrieve comprehensive details about that specific pipeline run. This detailed response includes additional information such as:

LastModifiedTime: When the execution completed (for finished executions)
PipelineExecutionDisplayName: A human-readable name for the execution
PipelineExperimentConfig: Experiment tracking configuration details
CreatedBy/LastModifiedBy: Information about who created and modified the execution
PipelineVersionId: The specific version of the pipeline that was executed

Among other fields, the execution ARN serves as a unique identifier that allows SageMaker to locate and return information about this exact execution instance.

Calculating Execution Duration

For completed executions, you can calculate the total duration to understand pipeline performance and identify trends over time. The detailed execution information uses LastModifiedTime to indicate when the execution completed.

For a completed execution, you'll see additional output like:

The duration calculation helps you establish performance baselines for your pipeline. In this example, the pipeline completed successfully and took approximately 2 minutes and 33 seconds to run. By tracking these durations over multiple executions, you can identify performance trends, detect when pipelines are taking longer than expected, and make informed decisions about resource allocation and optimization. The LastModifiedTime field is updated whenever the execution status changes, so for completed executions, it represents the actual completion time.

Retrieving Step-Level Information

While overall execution status provides a high-level view, examining individual steps gives you detailed insights into your pipeline's internal behavior. SageMaker's list_pipeline_execution_steps API allows you to retrieve comprehensive information about every step within a specific execution.

This API call uses the execution ARN we retrieved earlier to fetch all steps associated with that specific pipeline run. The response contains detailed information about each step, including its name, status, timing data, resource usage, and any failure information. This step-level granularity is essential for understanding pipeline behavior, identifying bottlenecks, and diagnosing issues within specific components of your workflow. Each step in the response represents one component of your pipeline definition, maintaining the same order and structure you specified when creating the pipeline.

Examining Individual Step Details

Now we can iterate through each step and examine its details. This step-by-step analysis provides insights into how your pipeline executes and where potential issues or optimizations might exist.

For each step, we extract several key pieces of information that help us understand the step's execution. The step name corresponds exactly to the name you gave the step when defining your pipeline, making it easy to correlate monitoring information with your pipeline definition. The step status indicates the current state of this specific step, which can be "Starting," "Executing," "Succeeded," "Failed," or "Stopped." We use the get method with a default value for the start time because steps that haven't started yet won't have this field in their response.

Calculating Step Duration and Performance Analysis

For completed steps, we can calculate execution duration and analyze performance characteristics. This timing information is particularly valuable for identifying bottlenecks and understanding resource utilization patterns across your pipeline.

This timing analysis helps you identify performance patterns and potential optimization opportunities. If one step consistently takes significantly longer than others, you might want to optimize that step's code, allocate more computational resources, or consider breaking it into smaller parallel steps. For your preprocessing pipeline, you might see output like:

This shows that your "ProcessData" step completed successfully and took about 2 minutes and 32 seconds to process the California housing dataset. By analyzing these durations across multiple executions, you can establish performance baselines, identify trends, and quickly detect when steps are performing outside their normal ranges.

Handling Failed Steps

When pipeline steps encounter issues, SageMaker provides detailed failure information that helps you diagnose and resolve problems quickly. This failure information is crucial for maintaining reliable ML workflows.

The failure reason information is particularly valuable for debugging, as it provides specific error messages that help you quickly identify and resolve issues. Common failure scenarios include insufficient IAM permissions for accessing S3 buckets, memory or disk space limitations during processing, code errors in your processing scripts, or missing input data files. By examining these failure reasons, you can systematically address issues and improve the reliability of your pipeline executions.

Summary & Next Steps

In this lesson, you mastered the essential skills for monitoring and examining SageMaker Pipeline executions. You learned how to retrieve pipeline executions, extract basic execution information directly from execution summaries, retrieve detailed execution information using the describe_pipeline_execution API, analyze individual step performance, calculate execution durations, and troubleshoot failed executions using detailed failure information. These monitoring capabilities are fundamental for maintaining robust ML workflows in both development and production environments.

The monitoring skills you've developed here will become increasingly valuable as we progress to building more complex pipelines in upcoming lessons. You're now ready to practice these monitoring techniques in the hands-on exercises that follow, where you'll work with real pipeline executions and gain confidence in your ability to track and troubleshoot ML workflows.

Previous Lesson

Next Lesson: Integrating Model Training Steps

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal