Exploratory Data Analysis and Visualization with Matplotlib and Seaborn

Introduction to Exploratory Data Analysis (EDA) Visualization

After your initial inspection of the podcast dataset using methods like .info() and .describe(), it's time to use data visualization to uncover deeper patterns and relationships. While summary statistics are useful, visualizations can quickly reveal trends, outliers, and feature relationships that numbers alone might miss. In this lesson, you'll use two key Python libraries: Matplotlib : The foundational plotting library in Python. Seaborn : A higher-level library for more attractive and informative plots. You'll learn to create three essential EDA visualizations: Histograms : To see the distribution of individual numerical features. KDE plots : For a smoother view of feature distributions. Correlation heatmaps : To visualize relationships between numerical features. In this lesson, we'll use the full dataset for visual exploration so you can focus on understanding patterns in the variables themselves. Later, when we move into preprocessing and modeling, we'll introduce train/test splits to evaluate models fairly on unseen data. Keeping those phases separate helps build a clean workflow: first understand the data, then prepare it, then model it.

Why Identify and Visualize Features?

Understanding which features are numerical or categorical is crucial for effective feature engineering —the process of selecting, transforming, or creating new features to improve model performance. Visualizing distributions and relationships helps you: Detect outliers and skewed features that may need transformation. Identify redundant or highly correlated features to avoid multicollinearity. Spot patterns that can inspire new, more predictive features. By mastering these visualization techniques, you'll be better equipped to clean, preprocess, and engineer features, setting a strong foundation for building effective machine learning models.

Identifying Numerical Features

Numerical features are columns that contain quantitative values—numbers you can perform mathematical operations on, such as addition or averaging. In the context of the podcast dataset, these might include things like episode length, popularity percentages, or the number of ads. To systematically identify numerical features, use pandas’ select_dtypes method with np.number: Pythonimport numpy as np numerical_features = data.select_dtypes(include=np.number).columns.tolist() print(f"Number of numerical features: {len(numerical_features)}") print("Numerical features:", numerical_features)import numpy as np numerical_features = data.select_dtypes(include=np.number).columns.tolist() print(f"Number of numerical features: {len(numerical_features)}") print("Numerical features:", numerical_features) Example output from the podcast dataset: Number of numerical features: 6 Numerical features: ['id', 'Episode_Length_minutes', 'Host_Popularity_percentage', 'Guest_Popularity_percentage', 'Number_of_Ads', 'Listening_Time_minutes']Number of numerical features: 6 Numerical features: ['id', 'Episode_Length_minutes', 'Host_Popularity_percentage', 'Guest_Popularity_percentage', 'Number_of_Ads', 'Listening_Time_minutes'] These features are suitable for visualizations like histograms, KDE plots, and correlation heatmaps, which help you understand their distributions and relationships.

Identifying Categorical Features

Categorical features represent qualitative values—labels or categories that describe qualities or groupings, such as podcast genre or publication day. These are typically stored as strings or objects in your dataset. To identify categorical features, use select_dtypes with 'object': Pythoncategorical_features = data.select_dtypes(include='object').columns.tolist() print(f"Number of categorical features: {len(categorical_features)}") print("Categorical features:", categorical_features)categorical_features = data.select_dtypes(include='object').columns.tolist() print(f"Number of categorical features: {len(categorical_features)}") print("Categorical features:", categorical_features) Example output from the podcast dataset: Number of categorical features: 6 Categorical features: ['Podcast_Name', 'Episode_Title', 'Genre', 'Publication_Day', 'Publication_Time', 'Episode_Sentiment']Number of categorical features: 6 Categorical features: ['Podcast_Name', 'Episode_Title', 'Genre', 'Publication_Day', 'Publication_Time', 'Episode_Sentiment'] Categorical features are best explored with bar charts or count plots, which help you see the frequency of each category and spot any imbalances or rare values. By clearly separating numerical and categorical features, you can choose the most effective visualization and analysis techniques for each type, making your EDA more insightful and targeted.

Visualizing Numerical Feature Distributions

Once you've identified your numerical features, the next step is to visualize their distributions. This helps you: Spot outliers See if features are skewed or normally distributed Decide if preprocessing (like normalization or transformation) is needed

Histograms

Combined Histogram and KDE Plots

To get a more detailed view, you can combine histograms and KDE (Kernel Density Estimation) plots. This allows you to see both the frequency (histogram) and the smoothed distribution (KDE) for each feature. A KDE (Kernel Density Estimation) plot shows a smooth curve that represents the distribution of your data. Unlike a histogram, which uses bars and bins, a KDE plot draws a continuous line to show where values are concentrated. KDE works by placing a small, smooth curve at each data point and adding them up to make one overall smooth line. The "bandwidth" controls how smooth the line is—a smaller bandwidth follows the data closely (more wiggly), while a larger one makes the curve smoother. A common approach is to use subplots to show several features at once: Pythonimport seaborn as sns import numpy as np fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(15, 10)) axes = axes.flatten() for i, col in enumerate(numerical_features): if i < len(axes): hist_color = f'C{i}' data[col].hist(ax=axes[i], bins=20, color=hist_color, alpha=0.5, density=True) sns.kdeplot(data=data, x=col, ax=axes[i], color=hist_color, alpha=0.8, linewidth=2) axes[i].set_title(f'Distribution of {col}') axes[i].set_xlabel(col) axes[i].set_ylabel('Density') plt.tight_layout() plt.suptitle('Numerical Feature Distributions with KDE Curves', y=1.02, fontsize=16) plt.show()import seaborn as sns import numpy as np fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(15, 10)) axes = axes.flatten() for i, col in enumerate(numerical_features): if i < len(axes): hist_color = f'C{i}' data[col].hist(ax=axes[i], bins=20, color=hist_color, alpha=0.5, density=True) sns.kdeplot(data=data, x=col, ax=axes[i], color=hist_color, alpha=0.8, linewidth=2) axes[i].set_title(f'Distribution of {col}') axes[i].set_xlabel(col) axes[i].set_ylabel('Density') plt.tight_layout() plt.suptitle('Numerical Feature Distributions with KDE Curves', y=1.02, fontsize=16) plt.show() What to look for: Skewness: Is the distribution symmetric or skewed? Outliers: Are there values far from the main cluster? Modality: Is there one peak or multiple peaks? KDE curves are especially helpful when comparing the overall shape of a distribution across features, because they smooth out the jagged look that histograms can have when the bin size changes. It is still important to remember that KDE is an estimate: the exact shape can vary depending on the amount of data and the smoothing bandwidth.

Analyzing Feature Relationships with Correlation Heatmaps

Summary and Practice Preview

In this lesson, you learned how to: Identify numerical and categorical features using pandas Visualize distributions with histograms and KDE plots (including combined plots) Analyze feature relationships with correlation heatmaps (including lower triangle masking) In the upcoming practice, you'll apply these skills to the podcast dataset by creating and customizing visualizations, identifying skewed features and outliers, and using correlation analysis to find strong predictors of listening time. These techniques will help you quickly uncover and communicate key data patterns, setting you up for effective data cleaning and modeling.