Loading...

Introduction to Creating New Features

Welcome! In this lesson, we delve into creating derived features, a crucial component for enhancing the value of datasets in data analysis and machine learning. Previously, you've built a solid foundation by learning to encode categorical data and apply mathematical transformations. Now, we focus on creating new features from existing data, adding another layer to your feature engineering toolkit. By learning to derive new features, you'll unlock deeper insights and potentially improve your model's predictive power. Understanding these techniques equips you to better structure data for analysis.

Utilizing the Titanic Dataset for Feature Creation

In this lesson, we continue working with the Titanic dataset. We will create new features by focusing on the following columns:

sibsp: Number of siblings or spouses aboard
parch: Number of parents or children aboard
fare: Fare paid for the journey
age: Age of the passenger

By transforming these features, you'll gain new insights that can enhance survival prediction models.

Crafting Total Family Size Feature

Let's start with the creation of the family_size feature, representing the total number of family members aboard, including the passenger. This feature is obtained by summing the sibsp and parch columns and adding one to include the passenger themselves. Here's how you can create this feature:

The output below shows the new family_size column, allowing you to understand passengers' family dynamics onboard:

This feature is valuable for survival analysis since families may have had different survival outcomes compared to individuals.

Calculating Fare Per Person

Next, let's create the fare_per_person feature by dividing the fare paid (fare) by the newly-created family_size. This feature provides a per-individual economic perspective:

Here's what you observe upon examining the data:

Analyzing fare per person can provide insights into passengers' socio-economic backgrounds, potentially aiding predictions of survival.

Viewing and Understanding Derived Features

Now that we have created new features from our existing data, it's important to take a closer look at these features to understand their structure and values. By inspecting the initial rows, we can ensure the transformations were applied correctly. Here’s a peek into the first few rows of our modified dataset with the new features included:

This output demonstrates how the new features like family_size and fare_per_person integrate with the existing ones:

Summary Statistics of New Features

After confirming the new feature creation, let's analyze their statistical properties to grasp the numeric ranges and distributions present. Viewing these summary statistics allows us to understand the nature and variability of the new features. Here we can see key measures such as the average, minimum, maximum, and other statistics that can guide our understanding:

This statistical overview provides critical insight into the dataset’s new dimensions:

From this statistical summary, we observe that the average family size aboard the Titanic was approximately 1.9, with a maximum family size reaching up to 11 members. Similarly, the fare per person follows a broad range, reflecting significant socio-economic diversity onboard. These insights can be crucial for interpreting different patterns within the dataset and enhancing predictive models.

Conclusion and Summary

Congratulations on creating new features! You've added depth to our analysis by crafting the family_size and fare_per_person features from existing dataset elements. These features offer fresh insights and can significantly impact model outcomes by better representing the relationships within the data. As you move to the practice exercises, apply these concepts to solidify your understanding.

Previous Lesson

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal