Basic Data Cleaning

From Exploration to Action

In the previous lesson, you built a powerful diagnostic tool that revealed the data quality issues in our Netflix dataset. You created visualize_missing.py, a script that systematically identified where data was missing and quantified the severity of each problem. You discovered that 29.9% of titles lack director information, 9.4% are missing cast details, and another 9.4% don't have country information. You learned how to use pandas to detect these gaps, how to organize your analysis into reusable functions, and how to present your findings in a clear, readable format.

That exploration was essential, but it was only the first step. Knowing what's wrong with your data doesn't help much if you can't fix it. This is where today's lesson comes in. You're moving from passive observation to active transformation. Instead of just reporting on missing values, you'll learn how to actually fill them in and produce a cleaned dataset that's ready for analysis.

Today, you'll build a new script called clean_data.py that takes the raw Netflix dataset and produces a cleaned version with missing values handled appropriately. This script will read the original data, apply cleaning transformations to fill in missing values, and save the results to a new file. By the end of this lesson, you'll have a complete data cleaning pipeline that transforms problematic raw data into analysis-ready clean data.

Here is a simple overview of the script:

flowchart TD B[Parse Command-Line Arguments] --> C[Load Raw Data] C --> D[Validate Required Columns] D --> E[Clean Data] E --> F[Save Cleaned Data to Output File] F --> G[Print Confirmation Message]

This script matters because most data analysis and visualization tools struggle with missing values. If you try to count how many titles each director has created but 30% of your records have no director listed, your results will be incomplete and potentially misleading. If you want to visualize the geographic distribution of Netflix content but 9% of titles lack country information, your map will have significant gaps. Cleaning your data first ensures that your subsequent analysis is built on a solid foundation.

The skills you'll learn today extend far beyond this specific Netflix dataset. Every real-world dataset you encounter will have missing values, and you'll need to decide how to handle them. The patterns and techniques you learn in this lesson will serve you throughout your data career, whether you're working with customer records, scientific measurements, financial transactions, or any other type of data.

Choosing the Right Cleaning Strategy

When you encounter missing data, you have several options for how to handle it, and choosing the right approach depends on your data and your analysis goals. Understanding these options helps you make informed decisions rather than just applying the same solution to every problem.

The most drastic approach is to delete rows that contain missing values. If a title is missing its director, you could simply remove that entire row from your dataset. This approach has the advantage of simplicity, and it ensures that your cleaned dataset contains only complete records. However, it comes with a significant cost. In our Netflix dataset, if you deleted every row with any missing value, you'd lose thousands of titles. You'd be throwing away information about the title, release year, description, and all the other fields that are actually present, just because one field happens to be empty. This is rarely the right choice unless the missing data is so pervasive that the incomplete records are genuinely unusable.

A more nuanced approach is to fill missing values with defaults. This is what you'll implement today. For text fields like director, cast, and country, you can replace missing values with a placeholder like "Unknown." This preserves all your rows while explicitly marking which fields lack information. When you later analyze directors, titles with "Unknown" directors will be grouped together, making it clear that this is a category of titles where director information wasn't available. This approach works well for categorical data where there's no meaningful way to calculate what the missing value should be.

For numeric data, you might fill missing values with calculated statistics rather than arbitrary defaults. If you had a dataset of movie budgets with some missing values, you might fill them with the median budget of similar movies. If you had temperature measurements with gaps, you might interpolate between surrounding values. These approaches try to make educated guesses about what the missing values might have been, which can be appropriate when you have enough context to make reasonable estimates.

Sometimes, the right choice is to leave missing values as they are. If you're building a system that needs to handle incomplete data gracefully, or if the presence of missing values is itself meaningful information, you might choose not to fill them at all. For example, if newer titles are more likely to have complete cast information than older titles, that pattern might be worth preserving rather than obscuring by filling everything with "Unknown."

For our Netflix dataset, filling with "Unknown" is the right choice for director, cast, and country because these are categorical text fields where we can't reasonably guess what the missing values should be. We don't have enough information to infer who directed a movie if that information isn't provided. By using "Unknown" as a placeholder, we preserve all our data while being honest about what we don't know. This allows us to proceed with analysis while maintaining transparency about data quality.

The pandas fillna() Method: Your Cleaning Tool

To clean missing values in our dataset, we’ll use a few key tools from the pandas library. The most important of these is the fillna() method. This function allows you to replace missing values (represented as NaN in pandas) with a value of your choice. For example, you can use it to fill all missing directors with the string "Unknown".

You can apply fillna() to a single column, or to multiple columns at once by passing a dictionary that maps column names to their fill values. This is especially useful when you want to use different fill values for different columns. For our Netflix dataset, we’ll be filling missing values in the director, cast, country, and rating columns, each with the placeholder "Unknown".

Another important pandas tool is the .copy() method. Before making changes to your DataFrame, it’s good practice to create a copy so that your original data remains unchanged. This way, you can always refer back to the raw data if you need to.

Organizing Your Cleaning Logic

Rather than writing all your cleaning steps directly in your script, it’s best to organize them into a function. This function will take a DataFrame as input and return a cleaned version as output. By encapsulating your cleaning logic in a function, you make your code easier to read, test, and reuse.

Inside this function, you’ll:

Create a copy of the input DataFrame.
Define a dictionary that maps each column you want to clean to its fill value.
Check that all required columns are present in the DataFrame.
Use fillna() to fill missing values in the specified columns.

This approach keeps your cleaning logic clear and flexible. If you need to change which columns you clean or what values you use, you only need to update the dictionary.

Validating Your Data

Before you start cleaning, it’s important to validate your data. This means checking that the columns you expect to clean actually exist in the DataFrame. If a required column is missing, your script should raise an error with a clear message. This helps you catch problems early and makes your script more robust.

You should also check that your input file exists before trying to load it. If the file is missing, your script should stop and tell you exactly what’s wrong.

Tools and Functions You’ll Use

Here’s a summary of the specific tools and functions you’ll be working with:

pandas.read_csv() to load data from a CSV file.
DataFrame.copy() to create a copy of your data before cleaning.
DataFrame.fillna() to fill missing values in specified columns.
Dictionary mapping for column-specific fill values.
List comprehensions to check for missing columns.
Path.exists() from pathlib to validate file paths.
argparse.ArgumentParser for command-line argument handling.
DataFrame.to_csv() to save the cleaned data.

What You’ll Build

By combining these tools, you’ll create a script that:

Accepts input and output file paths as arguments.
Loads the raw Netflix dataset.
Validates that all required columns are present.
Fills missing values in the specified columns with "Unknown".
Saves the cleaned data to a new file.
Leaves the original data untouched.

This script will serve as a template for cleaning any dataset with missing values, not just the Netflix data. The techniques you learn here—using fillna(), organizing code into functions, validating inputs, and structuring scripts—are fundamental skills for any data professional.

In the next practice exercises, you’ll put these concepts into action by implementing each part of the cleaning pipeline yourself. You’ll get hands-on experience with the pandas functions and Python patterns introduced here, and you’ll see how they fit together to solve real data cleaning problems.

Previous Lesson

Next Lesson: Enriching Data with Codex

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal