Navigating through Data Anomalies: Outliers Detection and Treatment

Exclusion: A straightforward method where outliers are simply removed. This is akin to discarding burnt pieces in a batch of cookies to maintain the overall quality.
Transformation: This method involves changing the data to reduce skewness, similarly to applying a filter to a photo to bring all objects to a common exposure level.

Introduction to Outliers Detection and Treatment

Welcome to our detailed exploration of outliers detection and treatment in predictive modeling. Using real-life scenarios such as uneven pricing in housing markets, we will delve into statistical methodologies to identify outliers. Imagine an apartment costing significantly less or a mansion priced substantially higher than the standard in an area; these data points can skew the average, affecting our predictive analysis. In this session, we’re going to employ the California Housing Dataset to identify these critical data points and effectively execute robust treatment strategies.

Detecting Outliers with Z-Scores

To systematically identify outliers, we start by implementing the z-score method—a statistical measure that quantifies how many standard deviations a data point is from the mean. In mathematical terms, for a given data point (x), the z-score $(z)$ is calculated as:

z = \frac{(x - \mu)}{\sigma}

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal