Setting the Stage: Outlier Detection

Welcome to another informative lesson. Today, we're diving deep into the domain of outliers: how to detect and handle them effectively using Python. As always, we'll use our Titanic dataset to illustrate these concepts.

Why are outliers significant, you might wonder? Outliers are anomalous or unusual values that significantly deviate from other observations. They can adversely impact the performance of our machine-learning models by introducing bias or skewness. Detecting outliers helps us maintain our dataset's integrity by ensuring all data falls within a reasonable range of values.

Going back to our Titanic example. What if some passengers had absurdly high ages, like 800, or an impossible fare of $50,000? We can't just ignore these anomalies. We must deal with them appropriately, ensuring our models learn from accurate, realistic data.

The Z-score Method

A commonly used method to detect outliers in a dataset is the Z-score method. Given a set of values, the Z-score of a value is the distance between that value and the dataset's mean, expressed in terms of the standard deviation.

A Z-score of 0 indicates that the data point is identical to the mean score. A Z-score of 1.0 indicates a value that is one standard deviation from the mean. Higher Z-scores denote farther (and potentially outlier) values.

Let's use this method to detect any potential outliers in the age feature of our Titanic dataset. We'll only consider positive Z-scores, as negative ages are illogical in our context.

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal