In today's lesson, we will focus on identifying and handling duplicates and outliers to clean our dataset for a more precise analysis.
Consider a dataset containing students' details from a school. If a student's information is repeated in the dataset, we classify that as a duplicate. Duplicates can distort our data, leading to inaccurate results during the analysis.
R provides efficient functionalities to handle duplicates in a dataset. Here's how you can identify duplicates:
The duplicated()
function in R flags duplicate rows. This function can also be used to remove duplicate rows:
After removing the duplicates, your data is clean and ready!
An outlier is a data point that is anomalously different from other data points in the same dataset. For instance, in our dataset of primary school students' ages, discovering an age like 98 would be considered an outlier.
Outliers can be detected visually using tools like box plots and scatter plots, or even through statistical methods such as the Z-score or IQR. Today, we will use the IQR method to detect outliers:
Here's a brief reminder: a value is considered an outlier if it is at least 1.5 * IQR
less than Q1
(first quartile) or at least 1.5 * IQR
greater than Q3
(third quartile).
Let's use the IQR method in R. First, let's define our dataset:
Now, let's compute the IQR, Q1, Q3, and detect outliers:
Here is the output:
There are generally two strategies for dealing with outliers — removing them or replacing them with a median value.
Removing outliers is the most straightforward method. However, you might opt for other methods as removing outliers can result in data loss. To apply it, let's reverse the condition to choose everything except outliers.
There is a resulting data, no outliers included!
Alternatively, outliers can be replaced with median values. The median value is less susceptible to outliers and hence suitable for replacement.
Here, we select outliers using boolean selection and make them equal to the median score. The median is 50
, hence outlier scores are replaced with 50
:
An alternative to replacing outliers with the median is using the dataset's mean, excluding the outliers. This method ensures that the replacement value reflects the central tendency of the main distribution of data without being skewed by the extreme values.
First, we need to calculate the mean of the data, excluding the outliers:
Then, replace the outliers with this mean value:
This approach replaces outliers with a mean score that is representative of the bulk of the data, ensuring a more balanced dataset:
Note that the mean value 51
(rounded for simplicity) is calculated without the outliers, offering a more accurate depiction of the central value of most data points.
This lesson discussed what duplicates and outliers are, their implications on data analysis, and how to handle them using R. The key to accurate data analysis is clean data. Now is the best time to apply these concepts to real-world data! Let's dive into some practical exercises!
