Loading...

Applying Mathematical Transformations to Data

Welcome back! In the previous lessons, we explored foundational techniques in feature engineering, focusing on encoding categorical data and converting continuous data into discrete categories through feature binning. Now, we will delve into another vital technique: applying mathematical transformations to data. These transformations are essential for modifying data distribution, handling skewness, and ultimately improving the performance of machine learning models. Today, you'll learn how to apply log, square root, and cube root transformations specifically to the 'fare' column in the Titanic dataset. This lesson is part of our broader objective to shape and transform features effectively, building directly on your prior knowledge.

Why Apply Mathematical Transformations?

Mathematical transformations are a powerful tool in data preprocessing. They help to stabilize variance, normalize distributions, and make patterns more visible, which can enhance model performance. For instance, log transformation is commonly used when data exhibits exponential growth or right skewness, as it compresses the range of variable values, pulling high values closer and magnifying low values. Square root and cube root transformations are useful in reducing skewness of a moderate nature. By applying these transformations, you can make your data more suitable for modeling and ensure that features are on an appropriate scale. Understanding when and why to apply these transformations is key to effective data preparation.

Loading and Exploring the Dataset

Let’s start by loading and exploring the dataset, focusing specifically on the 'fare' column from the Titanic dataset to set the stage for our transformations. This initial exploration helps us understand the data distribution and determine the need for transformations.

This code snippet loads the dataset, outputs the first few entries of the 'fare' column, and prints the variance of the original 'fare'. Variance is a measure of how much the values in a dataset differ from the mean. It provides insight into the data's spread; a high variance indicates that the data points are spread out over a wider range of values. The formula for variance is:

\sigma^2 = \frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2

Where $\sigma^2$ is the variance, $N$ is the number of data points, $x_i$ represents each data point, and $\mu$ is the mean of the data. Observing its distribution and variance is crucial to deciding the appropriate transformation technique for normalization.

The output reveals a high variance in the original 'fare', highlighting the need for transformation to manage spread and skewness.

Log Transformation

Identifying skewness or wide spread in the 'fare' column prompts us to apply a log transformation. Log transformation is effective in compressing high-range values and stabilizing variance. It works by applying the natural logarithm (ln) to each data point, which can help in managing skewness and making the data more normally distributed. The formula for a log transformation is:

y = \ln(x)

In this context, we use np.log1p() instead of just np.log() because np.log1p() computes the natural logarithm of (x + 1), represented as:

y = \ln(x + 1)

This approach is beneficial for handling small values and avoiding issues when x = 0, as the logarithm of zero is undefined.

Using np.log1p(), we achieve a logarithmic transformation of the 'fare' values. This adjustment mitigates skewness and reduces variance, making features more amenable to modeling.

By significantly reducing variance, the log transformation offers a more normalized dataset, facilitating better pattern visibility.

Square Root Transformation

While log transformation addresses high variance effectively, square root transformation offers a moderate approach, suitable for less skewed data. The square root transformation is applied using the formula $y = \sqrt{x}$ , which helps in reducing skewness by transforming each data point to its square root, providing a balanced transformation impact.

The square root transformation provides another layer of adjustment by softening moderate skewness, giving a balanced transformation impact.

The variance decreases moderately, confirming the transformation’s suitability for moderately skewed data.

Cube Root Transformation

To examine a less aggressive transformation, the cube root transformation offers a gentle normalization that can balance between preserving data characteristics and achieving normalization. The cube root transformation is applied using the formula $y = \sqrt[3]{x}$ , which smooths skewness by transforming each data point to its cube root, making it versatile for various modeling contexts.

Applying np.cbrt(), this transformation retains much of the original data's spread while smoothing skewness, making it versatile for various modeling contexts.

The modest reduction in variance demonstrates the cube root transformation’s ability to subtly normalize data.

Comparing Transformations

Upon applying these transformations to the 'fare' column, let’s compare their impact on variance to evaluate their effectiveness:

Original Fare: 2469.44
Log Transformed Fare: 0.94 (significantly reduced)
Square Root Transformed Fare: 8.68 (moderately reduced)
Cube Root Transformed Fare: 1.15 (subtly reduced)

Each transformation offers a unique mechanism to refine data distribution based on specific modeling goals.

Review and Summary

Congratulations on advancing to this stage of the course! In this lesson, we covered the theory and practice of applying mathematical transformations like log, square root, and cube root to manipulate data distributions. By working through the 'fare' column of the Titanic dataset, you learned how these transformations adjust data for better pattern visibility and model performance. These skills are essential in your data preprocessing toolkit.

As you progress to the practice exercises, you'll have the opportunity to apply these transformations to datasets yourself, solidifying your understanding by transforming data features independently. This hands-on practice will not only reinforce today's concepts but also prepare you for more advanced feature engineering tasks in future studies. Keep up the excellent work, and enjoy the practice!

Previous Lesson

Next Lesson: Creating New Features from Existing Data

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal