Today, we target duplicates and outliers to clean our data for more accurate analysis.
Let's consider a dataset from a school containing students' details. If a student's information appears more than once, that is regarded as a duplicate. Duplicates distort data, leading to inaccurate statistics.
pandas library provides efficient and easy-to-use functions for dealing with duplicates.
The duplicated() function flags duplicate rows:
A True
in the output denotes a row in the DataFrame that repeats. Note, that one of the repeating rows is marked as False
– to keep one in case we decide to drop all the duplicates.
The drop_duplicates()
function helps to discard these duplicates:
There is no more duplicates, cool!
An outlier is a data point significantly different from others. In our dataset of primary school students' ages, we might find an age like 98 — this would be an outlier.
Outliers can be detected visually using tools like box plots, scatter plots, or statistical methods such as Z-score or IQR. Let's consider a data point that's significantly different from the rest. We'll use the IQR method for identifying outliers.
As a short reminder, we consider a value an outlier if it is either at least 1.5 * IQR
less than Q1
(first quartile) or at least 1.5 * IQR
greater than Q3
(third quartile).
Here's how you can utilize the IQR method with pandas. Let's start with defining the dataset of students' scores:
Now, compute Q1, Q3, and IQR:
After that, we can define the lower and upper bounds and find outliers:
Typically, there are two common strategies for dealing with outliers: remove them or replace them with a median value.
Removing outliers is the easiest method. However, there are better methods than this since you essentially throw away your data. To apply it, let's reverse the condition to choose everything except outliers.
The second strategy is replacing outliers with median values - they are less susceptible to outliers, so we can use them for replacement.
The easiest way to apply this replacement is to first replace outliers with np.nan and then use the fill method. It could lead to problems, as there could already be some missing values in the dataframe, which will also be filled.
Instead, we could use the np.where
function:
It works by choosing elements from df['scores']
if the condition is not met (e.g., value is not an outlier) and from median otherwise. In other words, whenever this function meets an outlier, it will ignore it and use median instead of it.
We've covered what duplicates and outliers are, their impact on data analysis, and how to manage them. A clean dataset is a prerequisite for accurate data analysis. Now, it's time to apply your skills to real-world data. Let's dive into some practical exercises!
