Advanced Training with ModelTrainer

Introduction & Lesson Overview

Welcome back! In the previous lessons, you learned how to train and evaluate machine learning models in Amazon SageMaker using the classic Estimator pattern. You now know how to upload data to S3, launch a training job, retrieve the trained model, and evaluate its performance. These are essential skills for any machine learning workflow in the cloud.

As you continue your journey, it is important to know that SageMaker is always evolving. For more advanced and production-ready workflows, SageMaker now offers a new, modular approach to training called ModelTrainer. This lesson will introduce you to ModelTrainer and show you how it builds on what you have already learned.

By the end of this lesson, you will be able to set up and launch a training job using ModelTrainer, retrieve information about your training jobs, and understand the key differences between this modern approach and the classic Estimator pattern. This will prepare you for more sophisticated model development and help you take full advantage of SageMaker's advanced features.

Understanding ModelTrainer vs Estimators

Before diving into the implementation, it's important to understand what ModelTrainer is and how it differs from the Estimator pattern you've been using.

Estimators remain a valuable and practical choice for many machine learning projects. They provide a simple, all-in-one approach where you pass all configuration parameters directly to the constructor - framework version, instance type, entry point, output path. The Estimator automatically retrieves the appropriate Docker container image based on your framework specifications, making it perfect for straightforward training workflows.

ModelTrainer is designed for when you need more control and flexibility. It uses a modular approach with separate configuration objects:

SourceCode for your training script
Compute for compute resources
OutputDataConfig for model output location
InputData for training data

A key difference is container handling. With Estimators, you specify framework_version='1.2-1' and it automatically finds the right container. With ModelTrainer, you explicitly retrieve the container image URI first using sagemaker.image_uris.retrieve(), giving you precise control over which container is used.

Both approaches have their place. Estimators are excellent for getting started and for straightforward training jobs. ModelTrainer shines when you need better organization for complex projects, want to integrate with SageMaker's advanced features, or require fine-grained control over your training environment.

Environment Setup and Configuration

Before you can launch a training job with ModelTrainer, you need to set up your SageMaker environment and define some important configuration values. This setup builds directly on what you learned with Estimators, with one key addition that ModelTrainer requires.

Just like with Estimators, you start by creating a SageMaker session and gathering the essential AWS information:

The key addition here is capturing the region. While Estimators handled region detection automatically, ModelTrainer needs this information explicitly for retrieving the correct training container images and managing resources across different AWS regions.

Defining Configuration Constants

Next, you define the same configuration values you used with Estimators, plus one additional parameter that ModelTrainer requires you to specify explicitly:

The main difference from your Estimator setup is the addition of VOLUME_SIZE_GB. With Estimators, the storage volume size was handled automatically based on your instance type. ModelTrainer's modular approach gives you more control over the compute configuration, but requires you to specify these details explicitly. This extra control becomes valuable when you need to optimize storage for large datasets or specific training requirements.

Defining the Training Container Image

With your environment configured, you now need to specify which Docker container SageMaker should use to run your training code. This is where you'll see one of the key differences in how ModelTrainer handles container management compared to Estimators.

Remember that with Estimators, you simply passed framework_version='1.2-1' to the SKLearn constructor, and it automatically found the right container. ModelTrainer takes a more explicit approach, requiring you to retrieve the container image URI first:

This explicit approach gives you precise control over which container is used and makes it easier to manage different framework versions across complex projects. The region parameter you captured earlier is essential here, as container images are region-specific.

Configuring Your Training Script

Next, you need to tell ModelTrainer where to find your training code and which script to run. Instead of passing these parameters directly to a constructor like you did with Estimators, ModelTrainer uses a dedicated SourceCode configuration object:

Compare this to the Estimator approach where you passed entry_point='train.py' directly to the SKLearn constructor. ModelTrainer's modular design separates each concern into its own configuration object, making your setup more organized and easier to maintain as your projects grow in complexity.

Setting Up Compute Resources

Now you define the compute resources for your training job using a Compute configuration object. This consolidates several parameters that were previously passed individually to the Estimator constructor:

With Estimators, you passed instance_type and instance_count directly to the constructor, and the volume size was handled automatically. ModelTrainer requires you to explicitly specify the volume_size_in_gb using the constant you defined earlier. This gives you more control over your training environment, which becomes important when working with large datasets or optimizing costs.

Configuring Model Output Location

Finally, you specify where SageMaker should save your trained model using an OutputDataConfig object:

With Estimators, you simply passed output_path=MODEL_OUTPUT_PATH to the constructor. ModelTrainer wraps this in its own configuration object, maintaining the consistent modular approach. All these configuration objects will be passed to the ModelTrainer in the next step, replacing the single constructor call you used with the SKLearn Estimator while providing much more flexibility and organization.

Creating the ModelTrainer Instance

Now that you have all your configurations in place, you are ready to create your ModelTrainer instance. This step brings together all the configuration objects you created earlier:

The base_job_name parameter defines a prefix for your training job. SageMaker will automatically append a timestamp to create a unique job name (like sklearn-modeltrainer-20250722113535). Using a descriptive base name makes it easier to find and track your training jobs later, especially when running multiple experiments.

When you run this code, you will see several warnings and informational messages. This is completely normal and expected:

Defining Input Data Configuration

Next, you need to specify the input data for your training job. ModelTrainer uses an InputData configuration object to define where your training data is located and how it should be made available to your training script:

The channel_name parameter creates a named channel for your data. When SageMaker runs your training job, it downloads the data from S3 and makes it available to your training script at /opt/ml/input/data/train (using the channel name as the subdirectory). This is the same mechanism you used with Estimators when you passed {'train': S3_TRAIN_DATA_URI} to the fit() method.

Notice that input_data is a list - this allows you to define multiple input channels if needed. For example, you could add a validation dataset with channel_name="validation" that would be available at /opt/ml/input/data/validation. This flexibility becomes important when your models require separate training and validation datasets, or when you need to pass additional reference data to your training script.

Launching the Asynchronous Training Job

With your ModelTrainer instance created and input data configured, you can now launch your training job. The train() method can start the job either asynchronously or synchronously, depending on the wait parameter you provide. If you set wait=False, the method returns immediately and the job runs asynchronously in the background. If you set wait=True (the default), the method will block and display the training logs until the job completes.

When you run this code, you will see additional warnings and informational messages:

Again, these warnings are normal and expected. The key_prefix warning appears because you're using S3 data sources rather than local files. The region and config warnings indicate that ModelTrainer is using default AWS settings. The final warning confirms that since you set wait=False, the method returns immediately without displaying training logs.

The key message to look for is "Creating training_job resource", which confirms that your training job has been submitted to SageMaker and is now running in the cloud.

Retrieving Training Job Information

After launching your training job, you often want to check its status or retrieve its results. With ModelTrainer, you can access information about the most recent training job directly from the object itself.

Once you call train(), ModelTrainer keeps a reference to the latest training job. You can access this using the _latest_training_job attribute. For example:

This will print out the name and current status of your training job:

This is especially useful when you are running multiple jobs or want to automate your workflow. The _latest_training_job attribute gives you immediate access to the job details without needing to query SageMaker separately.

Retrieving Model Artifacts from Completed Jobs

Once your training job completes, you'll need to retrieve the location of your trained model artifacts for evaluation or deployment. While Estimators provided an attach() method to reconnect to completed jobs, ModelTrainer uses the SageMaker session to query job details.

Here's how to find your latest completed training job and get the model artifacts location:

The key difference from the Estimator pattern is that instead of using SKLearn.attach() and then accessing estimator.model_data, you use the SageMaker client's list_training_jobs() to find your job and describe_training_job() to get its details. The model artifacts location is stored in the same S3 path you specified in your OutputDataConfig, with the job name appended.

This approach gives you full access to all training job information, including metrics, logs, and configuration details. You can then use this model_s3_uri to download and evaluate your model locally, just as you did with Estimators in previous lessons.

Summary & Next Steps

In this lesson, you learned how to set up and launch a training job using SageMaker’s ModelTrainer. You saw how to configure your environment, define your training script and compute resources, and start an asynchronous training job. You also learned how to retrieve information about your training job and how ModelTrainer differs from the classic Estimator pattern.

These skills will help you build more advanced and flexible machine learning workflows in SageMaker. In the next exercises, you will have the opportunity to practice these techniques and deepen your understanding. Congratulations on reaching this stage — moving from basic to advanced training patterns is a big step toward mastering machine learning in the cloud!

Previous Lesson

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal