We step into the world of Data Cleaning and Transformation. Real-life data isn't always tidy; it has inconsistencies, missing data points, outliers, and even incorrect data! To extract meaningful insights or build reliable machine learning models, we clean and transform data.
In this session, we handle inconsistencies and outliers and apply various data transformations to enhance its readiness for analysis. Now, let's start this exploratory journey!
Why clean and transform data? Simple: unclean or inconsistent data can skew analysis or predictions. Weather data with missing temperatures, for instance, can lead to misleading climate predictions. The real world is full of such examples of analysis gone awry due to unclean data.
Let's delve into spotting inconsistencies. For instance, XL
, X-L
, xl
represent the same clothing size but are reported differently. Python's pandas
library comes in handy here.
Output:
To sort out inconsistencies, replace them with a standard value.
Output:
Scanning for outliers, or exceptional values, is the next step. Outliers can distort the analytical outcome. One common method to detect outliers is using the Interquartile Range (IQR).
As a short reminder, IQR method suggests that any value below and above are considered to be outliers. Where:
- – The first quartile
- – The third quartile
- – The Interquartile Range
Let's use the IQR method to identify and filter out outliers in a dataset.
Output:
The value 9
is considered an outlier and is excluded from the filtered dataset.
Now, data transformation is required when data needs adjustment to suit a specific analysis or model-building exercise. The need might be to bring skewed data to normality or harmonize differing scales of variables. For this purpose, the scikit-learn
library comes in handy. Though this course is not about this library, it is widely used with pandas dataframes, so we will take a look at it:
Output:
Standard scaler works simply: just create the StandardScaler
object, then use its fit_transform
method on data. We select the column with double square brackets [['Feature2']] to ensure it's treated as a DataFrame (required by scikit-learn transformers) rather than a Series. In the output, we see a new column Feature2_scaled
, which is values of Feature2
, but scaled so their mean is 0
and standard deviation is 1
.
Kudos! You've completed the Data Cleaning and Transformation lesson. You learned about handling data inconsistencies and outliers and performing data transformations using Python's pandas
and scikit-learn
libraries.
Practice exercises are next, focusing on consolidating concepts through application. They will reinforce your understanding and hone your skills. So, ready, set, explore!
