Loading...

Lesson Overview

Welcome! In today's lesson, we will learn how to split a dataset into training and testing sets. This is a crucial step in preparing your data for machine learning models to ensure they generalize well to unseen data.

Lesson Goal: By the end of this lesson, you will understand how to split financial datasets, such as Tesla's stock data, into training and testing sets using Python.

Revision of Preprocessing Steps

Before we delve into splitting the dataset, let's briefly review the preprocessing steps we have covered so far. The dataset has been loaded, new features have been engineered, and the features have been scaled.

Here's the code for those steps for a quick revision:

Understanding the Importance of Splitting Datasets

To avoid overfitting, where a model learns the training data too well and performs poorly on new, unseen data, it's important to evaluate your machine learning model on data it has never seen before. This is where splitting datasets into training and testing sets comes into play.

Why Split?

Training Set: Used to train the machine learning model.
Testing Set: Used to evaluate the model's performance and check its ability to generalize to unseen data.

This ensures that your model's performance is not just tailored to the training data but can be generalized to new inputs.

Implementing Dataset Split with 'train_test_split'

The train_test_split function from sklearn.model_selection helps us easily split the data.

Parameters of train_test_split:

test_size: The proportion of the dataset to include in the test split (e.g., 0.25 means 25% of the data will be used for testing).
train_size: The proportion of the dataset to include in the train split (optional if test_size is provided).
random_state: Controls the shuffling applied to the data before the split. Providing a fixed value ensures reproducibility.

Let's split our scaled features and targets into training and testing sets:

The train_test_split function will split our dataset into training and testing sets:

features_scaled and target are the inputs.
test_size=0.25 means 25% of the data goes to the test set.
random_state=42 ensures reproducibility. The state can be any other number, too.

Verifying Shapes and Contents of the Split Data

After splitting the dataset, it's important to verify the shapes and the contents of the resulting sets to ensure the split was done correctly.

Checking Shapes:

Print the shapes of the training and testing sets to confirm the split ratio is as expected.

Inspecting Sample Rows:

Print a few rows of the training and testing sets to visually inspect the data.

Let's check our split data:

The output of the above code will be:

This output confirms that our dataset has been successfully split into training and testing sets, showing the shape of each set and giving us a glimpse into the rows of our features and targets post-split. It's an important validation step to ensure our data is ready for machine learning model training and evaluation.

Lesson Summary

Great job! In this lesson, we:

Discussed the importance of splitting datasets to avoid overfitting.
Implemented train_test_split to divide the dataset into training and testing sets.
Verified the shapes and inspected sample rows of the resulting splits.

These steps are crucial for ensuring that your machine learning models can generalize well to new data. Up next, you'll have some practice exercises to solidify your understanding and improve your data preparation skills. Keep going!

Previous Lesson

Next Lesson: Addressing Data Leakage in Time Series

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal