Besides missing values, another common data issue is duplicates. Imagine a customer accidentally submitting an order twice. If you don't remove the duplicate, your sales figures will be wrong. This is why finding and removing them is critical.
Engagement Message
How might duplicate customer orders affect your business decisions?
Pandas gives us a simple method to find these identical rows: .duplicated()
. This method scans your entire DataFrame and checks if any row is an exact copy of a row that has already appeared earlier in the dataset.
Engagement Message
What do you think the output of this method looks like?
Just like .isnull()
, the .duplicated()
method returns a boolean Series of True
or False
values. It marks the first occurrence of a row as False
and any subsequent identical rows as True
.
Engagement Message
Why do you think it's designed to keep the first instance and flag the others?
Once you've identified the duplicates, you need to remove them. For this, Pandas provides another convenient method: .drop_duplicates()
. It automatically removes all the rows that .duplicated()
would have marked as .
