Welcome back! In the previous lessons, we explored foundational techniques in feature engineering, focusing on encoding categorical data and converting continuous data into discrete categories through feature binning. Now, we will delve into another vital technique: applying mathematical transformations to data. These transformations are essential for modifying data distribution, handling skewness, and ultimately improving the performance of machine learning models. Today, you'll learn how to apply log, square root, and cube root transformations specifically to the 'fare'
column in the Titanic dataset. This lesson is part of our broader objective to shape and transform features effectively, building directly on your prior knowledge.
Mathematical transformations are a powerful tool in data preprocessing. They help to stabilize variance, normalize distributions, and make patterns more visible, which can enhance model performance. For instance, log transformation is commonly used when data exhibits exponential growth or right skewness, as it compresses the range of variable values, pulling high values closer and magnifying low values. Square root and cube root transformations are useful in reducing skewness of a moderate nature. By applying these transformations, you can make your data more suitable for modeling and ensure that features are on an appropriate scale. Understanding when and why to apply these transformations is key to effective data preparation.
Let’s start by loading and exploring the dataset, focusing specifically on the 'fare'
column from the Titanic dataset to set the stage for our transformations. This initial exploration helps us understand the data distribution and determine the need for transformations.
Python1import pandas as pd 2import numpy as np 3 4# Load the dataset 5df = pd.read_csv("titanic.csv") 6 7# Original fare 8print("Original Fare:") 9print(df[['fare']].head()) 10print("Variance of Original Fare:", df['fare'].var())
This code snippet loads the dataset, outputs the first few entries of the 'fare'
column, and prints the variance of the original 'fare'
. Variance is a measure of how much the values in a dataset differ from the mean. It provides insight into the data's spread; a high variance indicates that the data points are spread out over a wider range of values. The formula for variance is:
Where is the variance, is the number of data points, represents each data point, and is the mean of the data. Observing its distribution and variance is crucial to deciding the appropriate transformation technique for normalization.
Plain text1Original Fare: 2 fare 30 7.2500 41 71.2833 52 7.9250 63 53.1000 74 8.0500 8Variance of Original Fare: 2469.436845743116
The output reveals a high variance in the original 'fare'
, highlighting the need for transformation to manage spread and skewness.
Identifying skewness or wide spread in the 'fare'
column prompts us to apply a log transformation. Log transformation is effective in compressing high-range values and stabilizing variance. It works by applying the natural logarithm (ln) to each data point, which can help in managing skewness and making the data more normally distributed. The formula for a log transformation is:
In this context, we use np.log1p()
instead of just np.log()
because np.log1p()
computes the natural logarithm of (x + 1), represented as:
This approach is beneficial for handling small values and avoiding issues when x = 0, as the logarithm of zero is undefined.
Python1# Apply log transformation to fare 2df['fare_log'] = np.log1p(df['fare']) 3print("\nOriginal Fare and Log Transformed Fare:") 4print(df[['fare', 'fare_log']].head()) 5print("Variance of Log Transformed Fare:", df['fare_log'].var())
Using np.log1p()
, we achieve a logarithmic transformation of the 'fare'
values. This adjustment mitigates skewness and reduces variance, making features more amenable to modeling.
Plain text1Original Fare and Log Transformed Fare: 2 fare fare_log 30 7.2500 2.110213 41 71.2833 4.280593 52 7.9250 2.188856 63 53.1000 3.990834 74 8.0500 2.202765 8Variance of Log Transformed Fare: 0.9390545498271524
By significantly reducing variance, the log transformation offers a more normalized dataset, facilitating better pattern visibility.
While log transformation addresses high variance effectively, square root transformation offers a moderate approach, suitable for less skewed data. The square root transformation is applied using the formula , which helps in reducing skewness by transforming each data point to its square root, providing a balanced transformation impact.
Python1# Apply square root transformation to fare 2df['fare_sqrt'] = np.sqrt(df['fare']) 3print("\nOriginal Fare and Square Root Transformed Fare:") 4print(df[['fare', 'fare_sqrt']].head()) 5print("Variance of Square Root Transformed Fare:", df['fare_sqrt'].var())
The square root transformation provides another layer of adjustment by softening moderate skewness, giving a balanced transformation impact.
Plain text1Original Fare and Square Root Transformed Fare: 2 fare fare_sqrt 30 7.2500 2.692582 41 71.2833 8.442944 52 7.9250 2.815138 63 53.1000 7.286975 74 8.0500 2.837252 8Variance of Square Root Transformed Fare: 8.679618578881051
The variance decreases moderately, confirming the transformation’s suitability for moderately skewed data.
To examine a less aggressive transformation, the cube root transformation offers a gentle normalization that can balance between preserving data characteristics and achieving normalization. The cube root transformation is applied using the formula , which smooths skewness by transforming each data point to its cube root, making it versatile for various modeling contexts.
Python1# Apply cube root transformation to fare 2df['fare_cbrt'] = np.cbrt(df['fare']) 3print("\nOriginal Fare and Cube Root Transformed Fare:") 4print(df[['fare', 'fare_cbrt']].head()) 5print("Variance of Cube Root Transformed Fare:", df['fare_cbrt'].var())
Applying np.cbrt()
, this transformation retains much of the original data's spread while smoothing skewness, making it versatile for various modeling contexts.
Plain text1Original Fare and Cube Root Transformed Fare: 2 fare fare_cbrt 30 7.2500 1.935438 41 71.2833 4.146318 52 7.9250 1.993730 63 53.1000 3.758647 74 8.0500 2.004158 8Variance of Cube Root Transformed Fare: 1.1502270097208227
The modest reduction in variance demonstrates the cube root transformation’s ability to subtly normalize data.
Upon applying these transformations to the 'fare'
column, let’s compare their impact on variance to evaluate their effectiveness:
- Original Fare: 2469.44
- Log Transformed Fare: 0.94 (significantly reduced)
- Square Root Transformed Fare: 8.68 (moderately reduced)
- Cube Root Transformed Fare: 1.15 (subtly reduced)
Each transformation offers a unique mechanism to refine data distribution based on specific modeling goals.
Congratulations on advancing to this stage of the course! In this lesson, we covered the theory and practice of applying mathematical transformations like log, square root, and cube root to manipulate data distributions. By working through the 'fare'
column of the Titanic dataset, you learned how these transformations adjust data for better pattern visibility and model performance. These skills are essential in your data preprocessing toolkit.
As you progress to the practice exercises, you'll have the opportunity to apply these transformations to datasets yourself, solidifying your understanding by transforming data features independently. This hands-on practice will not only reinforce today's concepts but also prepare you for more advanced feature engineering tasks in future studies. Keep up the excellent work, and enjoy the practice!