Handling Duplicates and Outliers in Datasets

Topic Overview and Actualization

Today, we target duplicates and outliers to clean our data for more accurate analysis.

Understanding Duplicates in Data

Let's consider a dataset from a school containing students' details. If a student's information appears more than once, that is regarded as a duplicate. Duplicates distort data, leading to inaccurate statistics.

Python Tools for Handling Duplicates

pandas library provides efficient and easy-to-use functions for dealing with duplicates.

The duplicated() function flags duplicate rows:

A True in the output denotes a row in the DataFrame that repeats. Note, that one of the repeating rows is marked as False – to keep one in case we decide to drop all the duplicates.

The drop_duplicates() function helps to discard these duplicates:

There is no more duplicates, cool!

Understanding Outliers in Data

An outlier is a data point significantly different from others. In our dataset of primary school students' ages, we might find an age like 98 — this would be an outlier.

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal