Training Models with SageMaker Estimators

Introduction & Overview

Welcome back! In the previous lessons, you've built a solid foundation for working with SageMaker and successfully uploaded your training data to S3. Now you're ready to take the next exciting step: actually training your first machine learning model in the cloud.

A training job is simply the process of running your machine learning code on cloud infrastructure to create a trained model. Instead of running your training script on your local computer, SageMaker executes it on powerful AWS servers, automatically managing all the infrastructure details like spinning up compute resources, downloading your data, and saving your results back to S3.

In this lesson, you'll learn how to use SageMaker estimators to launch training jobs. Since we've been working with scikit-learn for our machine learning code, we'll focus on the SKLearn estimator. Of course, if you're more comfortable with TensorFlow, PyTorch, or other frameworks, SageMaker has dedicated estimators for those too, following the same patterns you'll master here.

By the end of this lesson, you'll have launched your first SageMaker training job, configured compute resources, and monitored the training process from start to finish. This represents a major milestone in your machine learning journey, as you'll be running real training jobs in AWS's cloud infrastructure using the same scikit-learn skills you've already developed.

Understanding SageMaker Estimators

An estimator in SageMaker is a high-level interface that handles the complexity of launching and managing training jobs in the cloud. Think of an estimator as our control center for training — it knows how to package our code, spin up the right computing resources, run our training script, and save the results back to S3.

SageMaker provides several types of estimators to match different machine learning frameworks and use cases. It provides framework-specific estimators like:

SKLearn (for scikit-learn)
TensorFlow
PyTorch
XGBoost
And many others

Additionally, SageMaker offers generic estimators for custom Docker containers. Each estimator is optimized for its respective framework, providing the right environment and dependencies out of the box.

Working with the SKLearn Estimator

For this lesson, we'll focus on the SKLearn estimator since our machine learning code was developed using the popular scikit-learn library. However, if you prefer to work with other frameworks like TensorFlow or PyTorch, SageMaker will be there for you with dedicated estimators that follow the same patterns we'll learn here.

The SKLearn estimator is specifically designed for training machine learning models using scikit-learn. What makes it particularly powerful is that it allows us to bring our own custom training code while SageMaker handles all the infrastructure management. Instead of being limited to pre-built algorithms, we can write our training logic exactly how we want it and let SageMaker execute it at scale.

Behind the scenes, when we use the SKLearn estimator, SageMaker essentially creates a containerized environment with scikit-learn and Python pre-installed, similar to how we might use a Docker image with our dependencies already configured. However, we don't need to worry about container management, image building, or orchestration — SageMaker handles all of that complexity for us. It automatically sets up the training environment, runs our custom Python script inside that environment, and manages all the cloud resources needed for training.

This gives us the flexibility of custom code with the power and convenience of cloud infrastructure. We get the best of both worlds: complete control over our machine learning logic and the ability to scale our training jobs without managing servers or containers ourselves.

Creating Your Custom Training Script

Before configuring SageMaker to run our training job, we need to create the training script that contains our machine learning logic. This script needs to be adapted to work within SageMaker's environment while maintaining the familiar machine learning patterns we already know.

In this case, we'll name our training script train.py. This file will contain our custom machine learning logic, and while the core concepts remain familiar, the script needs to be specifically adapted to work within SageMaker's environment:

The model_fn function is a special function that SageMaker looks for when we later want to deploy our model for making predictions. This function tells SageMaker how to load our saved model from the artifacts directory. Even though we're not deploying the model in this lesson, including this function makes our training script deployment-ready for future use. SageMaker will call this function automatically when setting up inference endpoints, so it needs to know exactly how to reconstruct our trained model from the saved files.

Handling SageMaker Arguments and Environment

The script uses a specific pattern to receive information from SageMaker about where to find data and where to save results:

The if __name__ == '__main__': block is essential because SageMaker will import our script as a module when it needs to use the model_fn function for inference. Without this guard, all our training code would execute every time SageMaker imports the script, which isn't what we want. The guard ensures that the training logic only runs when the script is executed directly during the training job.

The key difference from standalone machine learning scripts is how it receives information about data locations and where to save results. Instead of hardcoding file paths, the script uses argparse to read arguments that SageMaker automatically provides through environment variables:

SM_MODEL_DIR — SageMaker sets this environment variable to point to a directory (like /opt/ml/model/) on the training instance where our script should save the trained model. After training completes, SageMaker automatically uploads everything in this directory to S3 as model artifacts.
SM_CHANNEL_TRAIN — SageMaker sets this to point to a directory (like /opt/ml/input/data/train/) on the training instance where it has already downloaded our training data from S3. The "TRAIN" part must match the channel name we'll specify later when we launch the training job.

The Training Logic Implementation

The actual training logic within the if __name__ == '__main__': block follows familiar machine learning patterns, but adapted to work with SageMaker's file system:

This section contains the familiar machine learning workflow we've seen before, but with key adaptations for the SageMaker environment:

Loading data — Uses the parsed args.train parameter to find where SageMaker downloaded our training data from S3
Training the model — Follows the standard scikit-learn pattern of creating a model instance and calling fit()
Evaluating performance — Calculates metrics to understand how well our model learned from the training data
Saving results — Uses args.model_dir to save the trained model where SageMaker expects to find it

The main difference from standalone machine learning scripts is using the parsed arguments to determine file paths rather than hardcoded locations. SageMaker automatically downloads our training data from S3 to the local file system and provides the path through the parameter. Similarly, points to where SageMaker expects us to save our trained model artifacts, which it will then automatically upload back to S3 when training completes.

Setting Up Your SageMaker Session and Resources

Now that we have our training script ready, we need to create a separate file (let's call it main.py) to configure SageMaker to run our train.py script. This separation is important: our train.py contains the machine learning logic that will run on SageMaker's training instances, while this new code will run locally to set up and launch the training job.

We need to establish our connection to SageMaker and define the basic resources we'll be working with. Since we've already worked with SageMaker sessions in previous lessons, this should feel familiar:

These first few lines establish our connection to SageMaker and retrieve essential information about our AWS environment. The sagemaker_session object will handle all communication with the SageMaker service, while the default_bucket gives us access to the S3 bucket where our training data is already stored from the previous lesson.

Defining Data Locations and Paths

Next, we'll specify where SageMaker can find our training data and where it should save the trained model artifacts:

The S3_TRAIN_DATA_URI points to the exact location of our training data that we uploaded in the previous lesson. The MODEL_OUTPUT_PATH tells SageMaker where to save our trained model after the training job completes. By organizing our models in a dedicated folder structure, we'll be able to easily manage multiple training experiments.

Configuring Compute Resources

Now we'll define the computing resources that SageMaker will use to run our training job:

The INSTANCE_TYPE parameter specifies what kind of computing power we want for training. The "ml.m5.large" instance type provides a good balance of CPU and memory for most machine learning tasks, making it perfect for getting started. We can think of this as choosing the "size" of the virtual machine that will run our training code.

SageMaker offers various instance types optimized for different workloads:

General purpose instances (like ml.m5.large, ml.m5.xlarge) provide balanced CPU and memory for typical machine learning tasks
Compute optimized instances (like ml.c5.xlarge, ml.c5.2xlarge) offer high-performance processors for CPU-intensive algorithms
Memory optimized instances (like ml.r5.large, ml.r5.xlarge) provide more RAM for algorithms that need to load large datasets into memory
GPU instances (like ml.p3.2xlarge, ml.g4dn.xlarge) accelerate deep learning training with powerful graphics processors

For this lesson with scikit-learn and a relatively small dataset, the general purpose ml.m5.large instance is ideal and cost-effective.

Setting Up AWS Account Information and Permissions

Finally, we need to configure the permissions that allow SageMaker to access our AWS resources. You might wonder: "Why do we need special permissions? We're already logged into AWS!" The answer is that when SageMaker runs our training job, it's not running as "us" — it's running as a separate service that needs its own permissions to access our data.

First, we'll retrieve our AWS account information:

This line uses AWS's Security Token Service (STS) to retrieve our unique 12-digit account ID. Every AWS account has this unique identifier, and SageMaker needs it to construct the proper Amazon Resource Name (ARN) for our execution role.

Now we can set up the SageMaker execution role using our account ID:

The SAGEMAKER_ROLE is like giving SageMaker a "key" to access our AWS resources. Think of it this way: when we run code on our laptop, it has access to our files because we're the ones running it. But when SageMaker runs our training code in the cloud, it's running on Amazon's computers, not ours. The execution role tells AWS "it's okay for SageMaker to read our training data from S3 and save our model back to S3 on our behalf."

This role was automatically created when we set up our SageMaker environment and includes permissions to read from and write to our S3 buckets. Without this role, SageMaker would be like a helpful assistant who wants to organize our files but doesn't have permission to open our filing cabinet.

Creating the SKLearn Estimator

Now we're ready to create the SKLearn estimator that will bring together all the configuration parameters we've just defined. The estimator acts as the bridge between our setup (the session, data locations, compute resources, and permissions) and our actual training execution:

Let's break down all the key parameters that configure our SKLearn estimator:

entry_point — Specifies the Python file containing our training logic
role — The IAM role that grants SageMaker permission to access our AWS resources like S3 buckets
instance_type — Defines the type of EC2 instance (virtual machine) that will run our training job
instance_count — Sets the number of instances to use for training (we'll start with 1 for learning)
framework_version — Ensures we're using a specific version of scikit-learn, helping maintain consistency across different training runs. (At the time this course was developed, '1.2-1' is the latest version available in SageMaker, even though newer versions may exist on PyPI or elsewhere.)

Launching the Training Job

Now that we have our estimator configured, it's time to actually start the training process. When we call the fit() method, we need to tell SageMaker where to find our training data. We do this by passing a dictionary that maps data channel names to S3 locations. In our case, we're using the channel name 'train' and pointing it to our training data URI in S3.

The fit() method can run in two different modes:

Synchronous mode (default):

Asynchronous mode:

For this lesson, we'll use the asynchronous approach with wait=False. This is important because training jobs can take anywhere from a few minutes to several hours, and we don't want to lock up our terminal or notebook for the entire duration. By running asynchronously, the training continues in the background on AWS infrastructure while we can continue working, monitor progress, or even launch additional jobs in parallel.

The training job is now running in the background, and we can immediately start tracking its progress and gathering information about it.

Getting the Training Job Name

Once the training job starts, you'll want to know its unique name so you can track it later. Every SageMaker training job gets a unique identifier that includes a timestamp, making it easy to distinguish between different training runs.

To retrieve and display your training job's name, you can access it through the estimator's latest training job object:

This will output something like:

Notice how SageMaker automatically generated this name following a specific template for estimator-based training jobs:

Looking at our example sagemaker-scikit-learn-2025-07-22-08-42-55-894:

sagemaker — A fixed prefix for all estimator-based training jobs
scikit-learn — The framework specified by your estimator (would be tensorflow, pytorch, xgboost, etc. for other framework estimators)
2025-07-22 — The date when the job was created (year-month-day)
08-42-55 — The time when the job was created (hour-minute-second in UTC)
894 — Additional milliseconds to ensure uniqueness

Checking the Job Status

Now that your training job is running, you'll want to know what's happening with it. Is it still starting up? Is it actively training? Has it finished? You can check the current status at any time:

This will show you the current state:

The most common statuses you'll see are:

InProgress — The job is currently running (either starting up or actively training)
Completed — The job finished successfully
Failed — Something went wrong during training
Stopping — The job is in the process of being stopped

Understanding Model Artifacts

When your training job completes, it needs somewhere to save the results. In machine learning, we call these results artifacts — they're simply the files produced by your training script. The most important artifact is your trained model itself, but there might also be logs, evaluation metrics, or other files your script creates.

This shows you where SageMaker will store your results:

It's important to understand that this is showing you the planned destination for your model artifacts — the location we configured when we created our estimator. The trained model isn't saved there yet since our training job is still running. This S3 path represents where SageMaker will automatically upload and store your trained model once the training job successfully completes.

Think of this as reserving a specific folder in S3 for your results. SageMaker will handle all the uploading and organizing automatically when training finishes. Later, when you want to use your trained model for predictions or deploy it as a web service, you'll reference this location to retrieve your completed model.

This approach gives you complete flexibility: you can launch a job, get its tracking information, and then check back later to see if it's finished and retrieve your trained model from the specified S3 location.

Summary & Next Steps

Congratulations! You learned how to create a custom training script that works within SageMaker's environment, then configure and use the SKLearn estimator to train models with your own custom code. You discovered how to set up compute resources, specify training scripts, and monitor job execution in a cloud environment. You also saw how SageMaker handles the complex orchestration of spinning up resources, running your code, and storing the results.

The skills you've developed here form the foundation for more advanced SageMaker workflows. You now understand the core pattern of cloud-based machine learning: prepare your training script, prepare your data in S3, configure an estimator, launch the job, and collect the results. This same pattern scales from simple experiments to production-grade machine learning systems.

In the upcoming practice exercises, you'll have the opportunity to apply these concepts hands-on. You'll build confidence in launching training jobs and interpreting results, preparing you for more sophisticated machine learning challenges ahead. The cloud-based training capabilities you've just mastered open up exciting possibilities for scaling your machine learning work beyond what's possible on a single machine.

Previous Lesson

Next Lesson: Retrieving and Evaluating Trained Models

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal