Loading...

Introduction

Handling missing data is a crucial part of data analysis and cleaning. Inconsistent and absent data can lead to inaccurate analysis and predictions. Python offers robust libraries like pandas to identify, manage, and fill missing data in an efficient manner. In this lesson, we'll explore fundamental techniques of handling missing data using pandas.

Identifying Missing Data

Let's recall from the previous unit's lesson that before treating missing data, it is important to identify it. The pandas library provides several functions to detect null or missing values.

Output:

In the above example, df.isnull() generates a DataFrame of the same shape as df, filled with True for missing values and False for non-missing values, enabling easy identification.

Dropping Missing Data

One straightforward method to handle missing values is to drop any rows or columns containing them. This method is useful when the missing data is minimal and does not significantly affect the dataset.

The above code outputs the following:

The dropna() function eliminates any row where at least one element is missing, thus cleaning up the DataFrame for further analysis or operations. By showing the DataFrame before and after dropping missing value rows, you can clearly see the impact of this operation.

Filling Missing Data

In situations where dropping data may lead to loss of valuable information, filling in missing data is a preferable solution. We can fill missing values with predetermined or calculated values, such as using the median, mean, or a constant value.

Output:

In this example, the missing values in the Age column are filled with the median age, which is the middle value of a sorted dataset. The missing values in the Salary column are filled with the mean salary, and the Name column is filled with a default value of 'Unknown'.

fillna() allows for both in-place updates with inplace=True and assignment, offering flexibility in how you manage your DataFrame. While in-place updates simplify code by modifying the original DataFrame directly, using assignment is often preferred for better control and to avoid unintended data modification. This approach ensures that the original data remains unchanged unless explicitly overwritten. If you choose to use inplace=True, the original DataFrame will be modified directly without needing to reassign it.

Concluding Remarks

In this lesson, we explored how to handle missing data using the pandas library in Python, introduced techniques for identifying, dropping, and filling in missing data. These methods are the backbone of ensuring data integrity and accuracy in subsequent analysis. Let’s now apply these techniques in the upcoming practice section to solidify your understanding and skills.

Previous Lesson

Next Lesson: Handling Duplicates in Data Using Pandas

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal