Welcome to your first lesson on deploying models with SageMaker! You've already built a solid foundation by training machine learning models both locally and in the cloud using SageMaker. Now it's time to take the next crucial step in your machine learning journey: making your models available for real-world predictions through deployment.
In this lesson, you'll learn how to deploy a model that you trained on your local machine to a SageMaker endpoint. This is an excellent starting point because you can use the familiar models you've already created while learning the essential deployment concepts that apply to all SageMaker deployments.
By the end of this lesson, you'll understand how to package your local model artifacts, upload them to Amazon S3, configure a SageMaker deployment environment, and create a working endpoint that can serve predictions. This foundational knowledge will prepare you for more advanced deployment scenarios later in the course.
Before we dive into deployment, let's understand what a SageMaker endpoint actually is. Think of an endpoint as a web service that hosts your trained model and can answer prediction requests. It's like having a smart assistant that you can ask questions about your data, and it responds with predictions based on what your model learned during training.
SageMaker offers two main types of endpoints, and choosing the right one depends on your specific needs:
Real-time endpoints work like having a dedicated server that's always running and ready to serve your model. You choose exactly what type of computer (instance type) and how many computers you want, and AWS keeps them running 24/7 just for your model. This gives you very predictable performance - your model will always respond quickly because the resources are always there waiting. However, since these servers run continuously, you pay for them even when nobody is making prediction requests. Think of it like renting a dedicated office space that you pay for whether you're using it or not.
Serverless endpoints are like having a smart assistant that only appears when you need them. You don't need to worry about what type of computer to use or how many - AWS figures all of that out automatically. When someone makes a prediction request, AWS instantly provides the computing power needed. When there are no requests, everything scales down to zero and you don't pay anything. It's like having a taxi service where you only pay when you actually take a ride.
For this lesson, we'll use serverless endpoints because they're much simpler to get started with. You don't need to make decisions about server types or worry about costs when you're not using the endpoint. This lets us focus on learning the core concepts of model deployment without getting distracted by infrastructure details. Plus, serverless endpoints are perfect for learning and experimentation since they automatically handle all the technical complexity behind the scenes.
To deploy your locally trained model to SageMaker, you need to package it in a format that SageMaker understands. SageMaker expects model artifacts to be stored in a compressed tar.gz
file, which is essentially a zipped archive containing your trained model files.
Let's start by creating the model package:
The key part here is the tarfile.open('model.tar.gz', 'w:gz')
line, which creates a new compressed archive file. The 'w:gz'
parameter tells Python to create a writable, gzip-compressed tar file.
Then, tar.add('trained_model.joblib')
adds your trained model file to the archive.
The important thing to understand is that whatever files your model needs to make predictions must be included in this tar.gz
archive. If your model depends on additional files like preprocessing pipelines, feature encoders, or configuration files, you would add those to the archive as well.
The resulting model.tar.gz
file is now in the exact format that SageMaker expects. This standardized packaging approach ensures that SageMaker can properly extract and load your model artifacts during deployment, regardless of which machine learning framework you used for training.
Once your model is packaged in the correct format, the next step is to upload it to Amazon S3, where SageMaker can access it during deployment:
The sagemaker_session.upload_data()
method handles all the complexity of uploading your file to S3. Let's break down the parameters:
path='model.tar.gz'
specifies the local file you want to uploadbucket=default_bucket
tells SageMaker to use your account's default SageMaker bucketkey_prefix='models/local-trained'
creates an organized folder structure in S3 where your model will be stored
When you run this code, you'll see output similar to this:
This S3 URI is crucial because you'll need it in the next step to tell SageMaker where to find your model artifacts. The URI follows the format s3://bucket-name/key-prefix/filename
, and SageMaker will use this exact location to download and extract your model during deployment.
Now that your model artifacts are stored in S3, you need to configure the SageMaker deployment environment. Let's start by setting up the necessary configuration variables:
The account_id
is retrieved automatically from your AWS session and is used to construct the SAGEMAKER_ROLE
ARN, which grants SageMaker the necessary permissions for deployment. The ENDPOINT_NAME
variable specifies the name of your SageMaker endpoint—choose a descriptive and unique name within your AWS account. If you deploy multiple models, simply change this value to avoid naming conflicts.
Before we can deploy our model, we need to create an entry point script that tells SageMaker how to load and use your model for predictions. This script acts as the bridge between SageMaker's inference infrastructure and your specific model.
Create a file called entry_point.py
with the following content:
The model_fn()
function is a SageMaker convention that defines how to load your model from the extracted tar.gz
archive. When SageMaker receives a prediction request, it automatically extracts your model artifacts to the model_dir
directory and calls this function to load the model into memory. The function should return the loaded model object that will be used for making predictions.
This entry point script is essential because SageMaker needs to know exactly how to initialize your model. Different machine learning frameworks and model types may require different loading procedures, so this script gives you complete control over the model loading process.
With our entry point script ready, we can now create the SageMaker model object that combines your model artifacts with the inference logic. This object serves as a blueprint that tells SageMaker everything it needs to know about your model:
The SKLearnModel
class is specifically designed for scikit-learn models. The model_data
parameter points to your S3 model artifacts, while entry_point='entry_point.py'
specifies the Python script we just created that defines how to load and use your model for predictions. The framework_version
and py_version
parameters ensure that SageMaker uses the correct software environment to run your model.
This model object is now ready for deployment, but first we need to configure the serverless inference settings that will determine how much computing power your endpoint will have available.
Now that we have our model object configured, we need to specify the resource requirements for our serverless endpoint. These settings determine how much memory your model will have access to and how many simultaneous requests it can handle:
The ServerlessInferenceConfig
allows you to specify resource requirements for your endpoint. The memory_size_in_mb=2048
allocates 2 GB of memory, which is typically sufficient for most scikit-learn models, and max_concurrency=10
limits the number of simultaneous inference requests to control costs and prevent overwhelming your model.
With your model configured, you're ready to deploy it to a live SageMaker endpoint. The deployment process creates the necessary AWS infrastructure and makes your model available for real-time predictions:
The model.deploy()
method creates your live endpoint infrastructure. We set wait=False
to deploy asynchronously, which means the method returns immediately instead of blocking your code for several minutes while AWS provisions the resources.
Since the deployment happens in the background, we need to actively check its progress. The describe_endpoint()
call shows us the current deployment status. Initially, you'll see:
Your endpoint moves through a predictable lifecycle: it starts as Creating
while AWS sets up the infrastructure, then transitions to InService
when ready for predictions, or Failed
if something goes wrong. The deployment typically takes 2-5 minutes for serverless endpoints. If you prefer to wait automatically, you can set wait=True
in the deploy()
method and your code will block until the endpoint is fully deployed. Alternatively, you can periodically run the describe_endpoint()
code above to manually check the status. Once you see , your model is live and ready to handle requests. Until then, you'll need to wait for the provisioning process to complete.
Once your endpoint shows an InService
status, it's ready to handle prediction requests. But what exactly is this endpoint? Your SageMaker endpoint is essentially a REST API - a web service running on AWS servers that accepts HTTP requests and returns HTTP responses. Just like when you visit a website, your code will send HTTP requests to a specific URL, and the endpoint will respond with predictions.
This is where SageMaker's Predictor
class becomes essential. Rather than forcing you to construct HTTP requests manually, it provides a simple Python interface that handles all the networking complexity behind the scenes.
When you create this Predictor
object, it automatically discovers your endpoint's URL using the endpoint name, manages authentication with your AWS credentials, and prepares to handle the HTTP communication. Think of it as a specialized HTTP client that knows exactly how to talk to SageMaker endpoints - you get all the power of REST API communication without writing a single line of HTTP code.
However, creating the connection is only half the battle. We still need to solve a fundamental data transfer problem.
While the Predictor
handles the HTTP communication, your Python objects (like pandas DataFrames and numpy arrays) can't travel over the internet in their current form. They exist only in your local memory, but they need to be converted into a format that can be transmitted to AWS servers and understood by your deployed model.
This is where serialization becomes crucial. Think of it like sending a photo via text message: you can't just "send the photo" - your phone converts the image into text characters, transmits those characters, and the recipient's phone converts them back into a viewable image.
The same process happens here: when you call predictor.predict()
, the CSVSerializer
automatically converts your pandas DataFrame into CSV text format that fits inside an HTTP request body. Your endpoint receives this CSV text, feeds it to your model, and sends back predictions as CSV text in the HTTP response. The CSVDeserializer
then converts that response back into Python arrays you can immediately use for analysis.
We chose CSV format because it's perfect for tabular data like our housing features - it's simple, widely supported, and easily readable by both humans and machines. With these serializers configured, your Predictor
now has everything it needs to seamlessly bridge the gap between your local Python environment and your cloud-hosted model.
With our predictor properly configured to handle the data conversion and HTTP communication, we can now send real data to test the deployed model's performance. This evaluation serves a critical purpose: verifying that your model works correctly in the production environment and maintains the same accuracy you observed during local development.
We begin by loading our test dataset and separating the input features from the target values we want to predict. The crucial step happens with predictor.predict(X_test.values)
- this call triggers the entire serialization process we just described, converting your data to CSV, sending it to the SageMaker endpoint, and receiving the model's predictions back.
The evaluation metrics provide the final validation of your successful deployment. Typical results might look like:
When these metrics align with your local results, you can be confident that your deployment process preserved your model's predictive capabilities and that your endpoint is ready for real-world use.
Congratulations! You've successfully completed your first model deployment with SageMaker, transforming a locally trained model into a production-ready endpoint. You learned the complete deployment pipeline: packaging your model into a tar.gz
format, uploading artifacts to S3, creating an entry point script to define model loading behavior, configuring a SKLearnModel
with serverless inference settings, deploying to a live endpoint, and testing with evaluation metrics. The serverless approach offers automatic scaling, no infrastructure management, and cost-effective pricing, making it ideal for getting started with model deployment.
In the upcoming practice exercises, you'll apply these concepts hands-on to solidify your understanding. As you progress through this course, you'll build upon these foundational skills to explore more advanced scenarios like deploying SageMaker-trained models and sophisticated inference configurations. You've successfully navigated one of the most challenging aspects of machine learning and are well on your way to mastering end-to-end ML workflows.
