Welcome! In this lesson, we delve into creating derived features, a crucial component for enhancing the value of datasets in data analysis and machine learning. Previously, you've built a solid foundation by learning to encode categorical data and apply mathematical transformations. Now, we focus on creating new features from existing data, adding another layer to your feature engineering toolkit. By learning to derive new features, you'll unlock deeper insights and potentially improve your model's predictive power. Understanding these techniques equips you to better structure data for analysis.
In this lesson, we continue working with the Titanic dataset. We will create new features by focusing on the following columns:
sibsp
: Number of siblings or spouses aboardparch
: Number of parents or children aboardfare
: Fare paid for the journeyage
: Age of the passenger
By transforming these features, you'll gain new insights that can enhance survival prediction models.
Let's start with the creation of the family_size
feature, representing the total number of family members aboard, including the passenger. This feature is obtained by summing the sibsp
and parch
columns and adding one to include the passenger themselves. Here's how you can create this feature:
The output below shows the new family_size
column, allowing you to understand passengers' family dynamics onboard:
This feature is valuable for survival analysis since families may have had different survival outcomes compared to individuals.
Next, let's create the fare_per_person
feature by dividing the fare paid (fare
) by the newly-created family_size
. This feature provides a per-individual economic perspective:
Here's what you observe upon examining the data:
Analyzing fare per person can provide insights into passengers' socio-economic backgrounds, potentially aiding predictions of survival.
Now that we have created new features from our existing data, it's important to take a closer look at these features to understand their structure and values. By inspecting the initial rows, we can ensure the transformations were applied correctly. Here’s a peek into the first few rows of our modified dataset with the new features included:
This output demonstrates how the new features like family_size
and fare_per_person
integrate with the existing ones:
After confirming the new feature creation, let's analyze their statistical properties to grasp the numeric ranges and distributions present. Viewing these summary statistics allows us to understand the nature and variability of the new features. Here we can see key measures such as the average, minimum, maximum, and other statistics that can guide our understanding:
This statistical overview provides critical insight into the dataset’s new dimensions:
From this statistical summary, we observe that the average family size aboard the Titanic was approximately 1.9, with a maximum family size reaching up to 11 members. Similarly, the fare per person follows a broad range, reflecting significant socio-economic diversity onboard. These insights can be crucial for interpreting different patterns within the dataset and enhancing predictive models.
Congratulations on creating new features! You've added depth to our analysis by crafting the family_size
and fare_per_person
features from existing dataset elements. These features offer fresh insights and can significantly impact model outcomes by better representing the relationships within the data. As you move to the practice exercises, apply these concepts to solidify your understanding.
