Welcome! Today, we’ll learn how to build a full preprocessing pipeline for the Titanic dataset. In real work, you are going to deal with big datasets with lots of features and rows.
We aim to learn how to prepare the real data for machine learning models by handling missing values, encoding categorical features, scaling numerical features, and splitting the data into training and test sets.
Imagine you have a messy jigsaw puzzle. You need to organize the pieces, find the edges first, and then start assembling. Data preprocessing is like organizing the pieces before starting the puzzle.
Let’s start by loading the Titanic dataset using Seaborn
, which has information about passengers like age, fare, and whether they survived. We'll drop some columns we won’t use.
Expected output:
We loaded the dataset and dropped columns deck
, embarked
, and alive
because they have too many missing values or are not useful. For example, embarked
column shouldn't affect the passenger's survival's rate, so it is questionable as a feature.
Next, let's handle missing values using SimpleImputer
from SciKit Learn
.
As a reminder, ravel()
is a method in NumPy that returns a contiguous flattened array. In this context, it is used to flatten the column vector returned by fit_transform()
into a 1-dimensional array. This ensures that the embark_town
column is reshaped back into a 1-D array that fits into the DataFrame correctly.
Expected output:
We filled missing numerical data (age
, fare
) using the mean and categorical data (embark_town
) using the most frequent value. This is like guessing a missing puzzle piece based on surrounding ones.
Machine learning models need numerical data. So, we use OneHotEncoder
to convert categorical features into numbers.
Next, we drop the original categorical columns and concatenate the new encoded columns with the DataFrame.
Expected output:
We converted the categorical columns into numerical ones, dropped the originals, and added the new encoded columns. It's like translating words into a secret code for a robot.
Feature scaling ensures all numerical values are on a similar scale. We use StandardScaler
for this.
Expected output:
We scaled our numerical data (age
, fare
) to have a mean of 0 and a standard deviation of 1. This is like resizing puzzle pieces to fit perfectly.
Next, we separate our features (used for predictions) and the target variable (the outcome we predict).
Expected output:
Here, X
contains all features except survived
, and y
contains the survived
column. This helps in training the model more efficiently.
Finally, we split the dataset into training and test sets using train_test_split
. This lets us train the model on one part of the data and test it on another.
Expected output:
We split the data so 80% is used for training and 20% for testing. This step is like practicing with some pieces before trying the whole puzzle.
Today, we:
- Loaded and prepared the Titanic dataset.
- Handled missing values.
- Encoded categorical features.
- Scaled numerical features.
- Separated features and the target variable.
- Split the dataset into training and test sets.
Now, you'll get to practice these steps hands-on. Happy learning!
