Imagine you built a robot to recognize apples and oranges. But how do you know if it's good at this task? You need to test it on some new apples and oranges it hasn’t seen before. In machine learning, we do something similar by splitting our data into training and test sets. This helps us see how well our model performs on new data.
It helps indicating and preventing overfitting. Overfitting is when a machine learning model learns the training data too well, including noise and details that don’t apply to new data. This results in excellent performance on the training set but poor performance on the test set, indicating that the model has memorized specifics rather than understanding general patterns.
Today, we will learn how to split a dataset into training and test sets using the train_test_split
function from SciKit Learn. By the end of this lesson, you'll know how to prepare your data properly to evaluate your model.
A train-test split is cutting the dataset into two parts: one to train the model and one to test it. The training set helps the model learn patterns, and the test set helps us check if the model is good at predicting new data.
For example, if you have 10 pictures of fruits, you might use 8 to train your robot and 2 to test it. This ensures the robot hasn’t memorized the training pictures but can recognize new ones too.
To split the data, we use the train_test_split
function from the SciKit Learn library. This function makes it easy to divide your data randomly. Let’s first see how to import what we need:
