Topic Overview and Actualization

In today's lesson, we will focus on identifying and handling duplicates and outliers to clean our dataset for a more precise analysis.

R Tools for Handling Duplicates

Consider a dataset containing students' details from a school. If a student's information is repeated in the dataset, we classify that as a duplicate. Duplicates can distort our data, leading to inaccurate results during the analysis.

R provides efficient functionalities to handle duplicates in a dataset. Here's how you can identify duplicates:

The duplicated() function in R flags duplicate rows. This function can also be used to remove duplicate rows:

After removing the duplicates, your data is clean and ready!

Identifying Outliers

An outlier is a data point that is anomalously different from other data points in the same dataset. For instance, in our dataset of primary school students' ages, discovering an age like 98 would be considered an outlier.

Outliers can be detected visually using tools like box plots and scatter plots, or even through statistical methods such as the Z-score or IQR. Today, we will use the IQR method to detect outliers:

Here's a brief reminder: a value is considered an outlier if it is at least 1.5 * IQR less than Q1 (first quartile) or at least 1.5 * IQR greater than Q3 (third quartile).

R Tools for Handling Outliers

Let's use the IQR method in R. First, let's define our dataset:

Now, let's compute the IQR, Q1, Q3, and detect outliers:

Here is the output:

Handling Outliers: Removal

There are generally two strategies for dealing with outliers — removing them or replacing them with a median value.

Removing outliers is the most straightforward method. However, you might opt for other methods as removing outliers can result in data loss. To apply it, let's reverse the condition to choose everything except outliers.

There is a resulting data, no outliers included!

Handling Outliers: Replacement

Alternatively, outliers can be replaced with median values. The median value is less susceptible to outliers and hence suitable for replacement.

Here, we select outliers using boolean selection and make them equal to the median score. The median is 50, hence outlier scores are replaced with 50:

Handling Outliers: Replacement with Mean

An alternative to replacing outliers with the median is using the dataset's mean, excluding the outliers. This method ensures that the replacement value reflects the central tendency of the main distribution of data without being skewed by the extreme values.

First, we need to calculate the mean of the data, excluding the outliers:

Then, replace the outliers with this mean value:

This approach replaces outliers with a mean score that is representative of the bulk of the data, ensuring a more balanced dataset:

Note that the mean value 51 (rounded for simplicity) is calculated without the outliers, offering a more accurate depiction of the central value of most data points.

Summary

This lesson discussed what duplicates and outliers are, their implications on data analysis, and how to handle them using R. The key to accurate data analysis is clean data. Now is the best time to apply these concepts to real-world data! Let's dive into some practical exercises!

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal