Building Full Preprocessing Pipeline for the Titanic Dataset

Lesson Introduction

Welcome! Today, we’ll learn how to build a full preprocessing pipeline for the Titanic dataset. In real work, you are going to deal with big datasets with lots of features and rows.

We aim to learn how to prepare the real data for machine learning models by handling missing values, encoding categorical features, scaling numerical features, and splitting the data into training and test sets.

Imagine you have a messy jigsaw puzzle. You need to organize the pieces, find the edges first, and then start assembling. Data preprocessing is like organizing the pieces before starting the puzzle.

Load and Prepare the Data

Let’s start by loading the Titanic dataset using Seaborn, which has information about passengers like age, fare, and whether they survived. We'll drop some columns we won’t use.

Expected output:

We loaded the dataset and dropped columns deck, embarked, and because they have too many missing values or are not useful. For example, column shouldn't affect the passenger's survival's rate, so it is questionable as a feature.

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal