Moving Your Data to the Cloud

Introduction & Context

Welcome back! In the previous lesson, you learned what Amazon SageMaker is and how it helps you manage the entire machine learning workflow in the cloud. You also saw how to set up your environment, including AWS credentials and the necessary Python packages.

Now, before you can train a machine learning model with SageMaker, you need to make your data available in the cloud. SageMaker expects your training data to be stored in Amazon S3, which is AWS’s cloud storage service. This is a key step in any SageMaker workflow: uploading your local data to S3 so that SageMaker can access it for training. In this unit, you will learn how to initialize a SageMaker session, find your default S3 bucket, and upload your training data to S3 using Python code. By the end of this lesson, you will have the foundational infrastructure in place to run SageMaker training jobs with your own data.

Initializing a SageMaker Session

Now it's time to write your first code using the SageMaker Python SDK! Before you can upload data or train models, you need to establish a connection to AWS SageMaker services. This is done by creating a SageMaker session, which serves as your main interface for interacting with SageMaker.

Assuming you already have your AWS credentials configured (as covered in the previous lesson), initializing a SageMaker session is straightforward:

This single line of code creates a sagemaker_session object that will handle all your interactions with SageMaker services. The session automatically uses your configured AWS credentials and connects to the appropriate AWS region. This session object will be your gateway to uploading data, launching training jobs, and managing other SageMaker resources throughout this course.

Understanding the Default S3 Bucket

Before you can train models with SageMaker, you need a place in the cloud to store your data, models, and other artifacts. This is where Amazon S3 (Simple Storage Service) comes in. S3 is AWS's highly scalable object storage service, and SageMaker relies on it to store all files needed for machine learning workflows.

S3 organizes data using:

Buckets: Top-level containers for your data. Each bucket name is globally unique across all AWS accounts.
Objects: The actual files you store in S3 (like your training datasets, model artifacts, etc.).
Keys: The unique path for each object within a bucket. Keys can include slashes (/) to simulate folder structures, but S3 is technically a flat storage system.

When you use SageMaker, it can automatically create a default S3 bucket for you. This bucket is created in your AWS account and follows a predictable naming convention, making it easy to organize all your SageMaker-related files in one place.

You can get the name of your default SageMaker bucket using the following code:

The default_bucket() method returns the name of the S3 bucket that SageMaker will use by default. If this is your first time using SageMaker in your AWS account, this method will automatically create a new bucket for you. The bucket name follows the pattern sagemaker-{region}-{account-id}, ensuring it's unique across all AWS accounts.

For example, if your AWS configuration has:

Region: us-east-1
Account ID: 123456789012

You would see output like:

Uploading Training Data to S3

Now that you have a SageMaker session and know your default S3 bucket, you are ready to upload your training data. SageMaker provides a simple method called upload_data() that lets you copy a file from your local environment to your S3 bucket.

Here's how to upload your training data:

Here's what happens step by step:

Define the local file path: TRAIN_DATA_PATH specifies where your training data file is located on your local system
Set the S3 prefix: DATA_PREFIX acts like a folder name within your S3 bucket to organize your files
Call upload_data(): This method copies your local file to the specified S3 bucket and prefix
Get the S3 URI: The method returns the full S3 URI (Uniform Resource Identifier) where your data is now stored
Handle success/errors: The try-except block ensures you get feedback about whether the upload succeeded or failed

It's worth noting that the S3 object name (key) is determined by the S3 prefix you specify (such as datasets/) combined with the filename of the file you are uploading (such as california_housing_train.csv). The full local path to your file is not included in the S3 key—only the filename is used, unless you explicitly include additional folder structure in the S3 prefix. This allows you to organize your files in S3 independently of how they are organized on your local machine.

Verifying Your Upload with the AWS CLI

After uploading your data to S3, it's often helpful to verify that the file is present in your bucket. You can do this using the AWS Command Line Interface (CLI), which lets you interact with AWS services directly from your terminal.

To list the contents of your S3 bucket (and check that your file was uploaded), use the following command:

For example, if your default bucket is sagemaker-us-east-1-123456789012 and your prefix is datasets, the command would look like this:

You should see output similar to:

This confirms that your file is in the correct S3 location and ready to be used by SageMaker. The AWS CLI is a powerful tool for quickly checking, uploading, or downloading files from S3, and can be very useful for troubleshooting or managing your data.

Downloading the File from S3 and Opening It with Python

Another way to verify your upload is to download the file back to your local environment using Python. This not only confirms that the file was uploaded successfully, but also lets you inspect its contents to ensure the data is intact. The SageMaker SDK makes this easy with the download_data() method.

Here's how you can download your file from S3:

Once the file is downloaded, you can open it using pandas. This lets you quickly look at the contents of your file and make sure your data looks correct:

When you run this code, you should see output similar to the following, showing the first few rows of your dataset:

This confirms that your data was successfully uploaded to S3 and can be retrieved intact, ready for use in your machine learning workflows.

Summary & Next Steps

Congratulations! You've just completed a crucial milestone in your SageMaker journey. You've established a connection to SageMaker, and uploaded your first dataset to the cloud.

With your data securely stored in S3 and your SageMaker session ready to go, you're about to unlock the real power of cloud-based machine learning. In the next lessons, you'll transform this raw data into intelligent models that can make predictions and discover patterns.

Get ready for the exciting practice exercises ahead — you'll put these skills to work in real scenarios, building the confidence and expertise needed for sophisticated machine learning challenges!

Previous Lesson

Next Lesson: Training Models with SageMaker Estimators

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal