Welcome to an intriguing lesson on missing data handling! Today, we're diving into the Titanic dataset, a passage in time to the early 20th century. Our main aim? To wrangle missing data using Python and Pandas. Don't worry if you're unfamiliar with these terms yet, we'll break them down one by one!
- Python: A high-level, interpreted programming language that is easy to learn yet powerful. It has bundles of libraries, like Pandas, that make data manipulation a breeze.
- Pandas: A Python library providing high-performance, easy-to-use data structures and data analysis tools.
By the end of this lesson, you'll understand the basics of handling missing data, which is an essential step in preparing your data for machine learning models. So let's get started!
As an analyst or data scientist, it's pivotal to understand why data might be missing, as it helps in choosing the best strategy to handle it. Missing data, which are like missing puzzle pieces, can occur due to several reasons, such as not being collected, being recorded incorrectly, or even being lost over time.
Furthermore, missing data can be categorised as:
- Missing completely at random (MCAR): The missing data entries are random and don't correlate with any other data.
- Missing at random (MAR): The missing values depend on the values of other variables.
- Missing not at random (MNAR): The missing values have a particular pattern or logic.
Before we can consider how to handle missing data, let's learn how to identify it. We'll use the isnull()
and sum()
functions from the Pandas library to find the number of missing values in our Titanic dataset:
