Today, we target duplicates and outliers to clean our data for more accurate analysis.
Let's consider a dataset from a school containing students' details. If a student's information appears more than once, that is regarded as a duplicate. Duplicates distort data, leading to inaccurate statistics.
pandas library provides efficient and easy-to-use functions for dealing with duplicates.
The duplicated() function flags duplicate rows:
A True
in the output denotes a row in the DataFrame that repeats. Note, that one of the repeating rows is marked as False
– to keep one in case we decide to drop all the duplicates.
The drop_duplicates()
function helps to discard these duplicates:
There is no more duplicates, cool!
An outlier is a data point significantly different from others. In our dataset of primary school students' ages, we might find an age like 98 — this would be an outlier.
