Welcome to the first lesson in our course on Feature Selection, Reduction and Streamlining. In this unit, we're setting the stage for your journey into feature selection by ensuring your dataset is primed and ready. We'll revisit familiar data preparation techniques, this time with a sharper focus on feature selection. Using the Titanic dataset as our example, remember that well-prepared data is the linchpin to effective feature selection and successful analysis.
You’re already skilled at managing missing data, and it's time to apply this to the Titanic dataset. Ensuring data integrity begins with filling in gaps using statistical measures, such as the median for the age
column and the mode for categorical columns like embarked
and embark_town
. Refresh your technique with these steps:
Python1import pandas as pd 2 3# Load the dataset 4df = pd.read_csv("titanic.csv") 5 6# Handle missing values 7df['age'] = df['age'].fillna(df['age'].median()) 8df['embarked'] = df['embarked'].fillna(df['embarked'].mode()[0]) 9df['embark_town'] = df['embark_town'].fillna(df['embark_town'].mode()[0])
When a column in your dataset has an extensive amount of missing values, it can introduce more challenges than benefits. The deck
column in the Titanic dataset is such an example. In cases where the missing data is too significant to handle through imputation, it's often more practical to remove the column entirely. This decision helps maintain the dataset's integrity for analysis.
You can accomplish this using the drop
method in Pandas
. Here is how you can do it:
Python1# Drop 'deck' column due to excessive missing values 2df.drop(columns=['deck'], inplace=True)
The drop
method allows us to specify which columns to remove. By setting the inplace=True
parameter, the method updates the original DataFrame, meaning the deck
column is removed permanently without needing to create a new DataFrame. This approach keeps your code concise and the data management straightforward.
For the Titanic dataset, we're choosing Label Encoding to handle categorical data. This method is suitable when you want to keep the dataset compact without adding too many new columns, as can happen with One-Hot Encoding. Label Encoding transforms each category into a unique integer, simplifying the data while preserving its meaning.
To apply Label Encoding, we use a for
loop to go through each categorical column and replace the text categories with numbers:
Python1from sklearn.preprocessing import LabelEncoder 2 3# Initialize the label encoder 4label_encoder = LabelEncoder() 5 6# Identify categorical features and apply label encoding 7label_columns = df.select_dtypes(include=['object']).columns 8for column in label_columns: 9 df[column] = label_encoder.fit_transform(df[column])
By working through each column with a loop, we efficiently replace the categories with numerical values, making the dataset ready for machine learning models without increasing its size unnecessarily.
Now that you’ve polished your dataset, it’s crucial to save your work, a step that's new for us. By storing the processed data, you eliminate the need to repeat preparations each time you access the dataset. Here’s the simple way to do it with Pandas
:
Python1# Save the updated dataset 2df.to_csv("titanic_updated.csv", index=False)
By executing this command, your hard work is preserved in a CSV file, named "titanic_updated.csv"
ready for ongoing feature selection efforts.
After saving your polished dataset, it's important to verify the saved file to ensure that all the changes have been successfully recorded. You can open the saved CSV file and display the first few rows to confirm:
Python1# Read the updated dataset 2df_updated = pd.read_csv("titanic_updated.csv") 3 4# Display the first few rows 5print(df_updated.head())
This should produce the following output, showing the first few entries of your updated dataset:
Plain text1 survived pclass sex age ... adult_male embark_town alive alone 20 0 3 1 22.0 ... True 2 0 False 31 1 1 0 38.0 ... False 0 1 False 42 1 3 0 26.0 ... False 2 1 True 53 1 1 0 35.0 ... False 2 1 False 64 0 3 1 35.0 ... True 2 0 True
By running this code, you'll load the saved dataset and display the first few entries, giving you a quick visual confirmation that your data preparation steps were correctly applied and saved.
You've seamlessly reviewed essential steps in preparing your dataset, setting a solid groundwork for effective feature selection. Your adeptness at handling missing values, intelligently dropping columns, encoding categorical features, and saving your datasets forms the foundation for what's to come. As we advance, we'll delve into selecting and streamlining the most impactful features. Upcoming exercises will reinforce these techniques, propelling you toward mastering feature selection.