Lesson Introduction

Welcome! Today, we’ll learn how to build a full preprocessing pipeline for the Titanic dataset. In real work, you are going to deal with big datasets with lots of features and rows.

We aim to learn how to prepare the real data for machine learning models by handling missing values, encoding categorical features, scaling numerical features, and splitting the data into training and test sets.

Imagine you have a messy jigsaw puzzle. You need to organize the pieces, find the edges first, and then start assembling. Data preprocessing is like organizing the pieces before starting the puzzle.

Load and Prepare the Data

Let’s start by loading the Titanic dataset using Seaborn, which has information about passengers like age, fare, and whether they survived. We'll drop some columns we won’t use.

Expected output:

We loaded the dataset and dropped columns deck, embarked, and alive because they have too many missing values or are not useful. For example, embarked column shouldn't affect the passenger's survival's rate, so it is questionable as a feature.

Handle Missing Values

Next, let's handle missing values using SimpleImputer from SciKit Learn.

As a reminder, ravel() is a method in NumPy that returns a contiguous flattened array. In this context, it is used to flatten the column vector returned by fit_transform() into a 1-dimensional array. This ensures that the embark_town column is reshaped back into a 1-D array that fits into the DataFrame correctly.

Expected output:

We filled missing numerical data (age, fare) using the mean and categorical data (embark_town) using the most frequent value. This is like guessing a missing puzzle piece based on surrounding ones.

Encode Categorical Features: Part 1

Machine learning models need numerical data. So, we use OneHotEncoder to convert categorical features into numbers.

Encode Categorical Features: Part 2

Next, we drop the original categorical columns and concatenate the new encoded columns with the DataFrame.

Expected output:

We converted the categorical columns into numerical ones, dropped the originals, and added the new encoded columns. It's like translating words into a secret code for a robot.

Feature Scaling

Feature scaling ensures all numerical values are on a similar scale. We use StandardScaler for this.

Expected output:

We scaled our numerical data (age, fare) to have a mean of 0 and a standard deviation of 1. This is like resizing puzzle pieces to fit perfectly.

Separate Features and Target Variable

Next, we separate our features (used for predictions) and the target variable (the outcome we predict).

Expected output:

Here, X contains all features except survived, and y contains the survived column. This helps in training the model more efficiently.

Train-Test Split

Finally, we split the dataset into training and test sets using train_test_split. This lets us train the model on one part of the data and test it on another.

Expected output:

We split the data so 80% is used for training and 20% for testing. This step is like practicing with some pieces before trying the whole puzzle.

Lesson Summary

Today, we:

  1. Loaded and prepared the Titanic dataset.
  2. Handled missing values.
  3. Encoded categorical features.
  4. Scaled numerical features.
  5. Separated features and the target variable.
  6. Split the dataset into training and test sets.

Now, you'll get to practice these steps hands-on. Happy learning!

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal