Lesson Introduction

Hello there! Today, we're going to talk about handling missing values in datasets for machine learning. Why is this important? Imagine you are building a model to predict house prices, but some houses are missing information about their size or the number of bedrooms. These missing values can affect the performance of your model. In this lesson, you'll learn why data might be missing, different ways to handle it, and how to use Python libraries to do so.

By the end of this lesson, you'll know why handling missing values is crucial, understand different strategies to deal with them, and be able to use Python tools to handle missing data efficiently.

Understanding Missing Values

Why does data go missing? There are many reasons:

  1. Human Error: Sometimes, people forget to fill in all the fields when entering data.
  2. System Error: Occasionally, the system that collects the data might have problems.
  3. Other Reasons: Data may be intentionally left out for privacy reasons.

There are three common types of missing data:

  • MCAR (Missing Completely at Random): The data is missing randomly without any pattern.
  • MAR (Missing at Random): There is a pattern, but it is not related to the missing data itself.
  • MNAR (Missing Not at Random): There is a pattern related to why the data is missing.
Strategies for Handling Missing Values

Handling missing values can be done in several ways:

  1. If the missing data is a small percentage, you might just delete those rows or columns. But be careful: if you remove too much data, you might lose important information.

  2. You can also replace the missing values with some constant value like the mean, median, or mode. This method is often more suitable because it still keeps the data structure.

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal