Lesson 4
Creating New Features from Existing Data
Introduction to Creating New Features

Welcome! In this lesson, we delve into creating derived features, a crucial component for enhancing the value of datasets in data analysis and machine learning. Previously, you've built a solid foundation by learning to encode categorical data and apply mathematical transformations. Now, we focus on creating new features from existing data, adding another layer to your feature engineering toolkit. By learning to derive new features, you'll unlock deeper insights and potentially improve your model's predictive power. Understanding these techniques equips you to better structure data for analysis.

Utilizing the Titanic Dataset for Feature Creation

In this lesson, we continue working with the Titanic dataset. We will create new features by focusing on the following columns:

  • sibsp: Number of siblings or spouses aboard
  • parch: Number of parents or children aboard
  • fare: Fare paid for the journey
  • age: Age of the passenger

By transforming these features, you'll gain new insights that can enhance survival prediction models.

Crafting Total Family Size Feature

Let's start with the creation of the family_size feature, representing the total number of family members aboard, including the passenger. This feature is obtained by summing the sibsp and parch columns and adding one to include the passenger themselves. Here's how you can create this feature:

Python
1import pandas as pd 2 3# Load the dataset 4df = pd.read_csv("titanic.csv") 5 6# Create total family size feature (including the passenger) 7df['family_size'] = df['sibsp'] + df['parch'] + 1 8 9# Display the first few rows to verify the new feature 10print(df[['sibsp', 'parch', 'family_size']].head())

The output below shows the new family_size column, allowing you to understand passengers' family dynamics onboard:

Plain text
1 sibsp parch family_size 20 1 0 2 31 1 0 2 42 0 0 1 53 1 0 2 64 0 0 1

This feature is valuable for survival analysis since families may have had different survival outcomes compared to individuals.

Calculating Fare Per Person

Next, let's create the fare_per_person feature by dividing the fare paid (fare) by the newly-created family_size. This feature provides a per-individual economic perspective:

Python
1# Create fare per family member 2df['fare_per_person'] = df['fare'] / df['family_size'] 3 4# Display the first few rows to verify the new feature 5print(df[['fare', 'family_size', 'fare_per_person']].head())

Here's what you observe upon examining the data:

Plain text
1 fare family_size fare_per_person 20 7.2500 2 3.62500 31 71.2833 2 35.64165 42 7.9250 1 7.92500 53 53.1000 2 26.55000 64 8.0500 1 8.05000

Analyzing fare per person can provide insights into passengers' socio-economic backgrounds, potentially aiding predictions of survival.

Viewing and Understanding Derived Features

Now that we have created new features from our existing data, it's important to take a closer look at these features to understand their structure and values. By inspecting the initial rows, we can ensure the transformations were applied correctly. Here’s a peek into the first few rows of our modified dataset with the new features included:

Python
1# Show new derived features 2print("First few rows with new features:") 3print(df[['sibsp', 'parch', 'family_size', 'fare_per_person']].head())

This output demonstrates how the new features like family_size and fare_per_person integrate with the existing ones:

Plain text
1First few rows with new features: 2 sibsp parch family_size fare_per_person 30 1 0 2 3.62500 41 1 0 2 35.64165 52 0 0 1 7.92500 63 1 0 2 26.55000 74 0 0 1 8.05000
Summary Statistics of New Features

After confirming the new feature creation, let's analyze their statistical properties to grasp the numeric ranges and distributions present. Viewing these summary statistics allows us to understand the nature and variability of the new features. Here we can see key measures such as the average, minimum, maximum, and other statistics that can guide our understanding:

Python
1# Display summary statistics of new numerical features 2print("Summary statistics of new features:") 3print(df[['family_size', 'fare_per_person']].describe())

This statistical overview provides critical insight into the dataset’s new dimensions:

Plain text
1Summary statistics of new features: 2 family_size fare_per_person 3count 891.000000 891.000000 4mean 1.904602 19.916375 5std 1.613459 35.841257 6min 1.000000 0.000000 725% 1.000000 7.250000 850% 1.000000 8.300000 975% 2.000000 23.666667 10max 11.000000 512.329200

From this statistical summary, we observe that the average family size aboard the Titanic was approximately 1.9, with a maximum family size reaching up to 11 members. Similarly, the fare per person follows a broad range, reflecting significant socio-economic diversity onboard. These insights can be crucial for interpreting different patterns within the dataset and enhancing predictive models.

Conclusion and Summary

Congratulations on creating new features! You've added depth to our analysis by crafting the family_size and fare_per_person features from existing dataset elements. These features offer fresh insights and can significantly impact model outcomes by better representing the relationships within the data. As you move to the practice exercises, apply these concepts to solidify your understanding.

Enjoy this lesson? Now it's time to practice with Cosmo!
Practice is how you turn knowledge into actual skills.