Introduction

Welcome to today's discussion on Hierarchical Clustering. We will be studying its effectiveness using the Silhouette Score, the Davies-Bouldin Index, and Cross-Tabulation Analysis. We will utilize Python's powerful libraries, scikit-learn and pandas, to equip you with practical and useful skills for evaluating clustering models.

Hierarchical Clustering and Scikit-learn Introduction

Scikit-learn is a widely used Python library for machine learning. In this lesson, we will be using its powerful built-in methods, including the silhouette_score and davies_bouldin_score. Additionally, we will implement Hierarchical Clustering from scikit-learn on some data:

This function applies Hierarchical Clustering to our dataset. The formed cluster labels can be accessed via clustering.labels_.

Silhouette Score

The Silhouette Score offers a measure to evaluate the effectiveness of our clustering. This score gauges how similar a point is to its own cluster compared to other clusters. Higher scores indicate better clustering.

We will implement the silhouette_score function from the sklearn library on our data:

The output provides a single score showing the effectiveness of our clustering.

Davies-Bouldin Index

The Davies-Bouldin index evaluates the average similarity between clusters. It bears an inverse relationship to model performance, meaning that a lower index value indicates a better model.

We will use the davies_bouldin_score function in sklearn as follows:

The Davies-Bouldin Index thus obtained serves as another measure of our clustering effectiveness.

Visualizing and Assessing Clustered Data

Visualizing the clustered data points provides an intuitive understanding of our clusters. For this, we will use matplotlib along with Cross-Tabulation Analysis using pandas' crosstab method.

Cross-Tabulation Analysis provides an overview of how labels have been clustered together.

The resulting table showcases the distribution of data points across our clusters, whereas the scatter plot visualized using matplotlib presents colored data points according to their respective clusters.

Taken together, these representations provide a clear and direct view of the various clusters formed based on our data:

image

Summary and Practice

You are now equipped with the skills to apply Silhouette Score, the Davies-Bouldin Index, and Cross-Tabulation Analysis in assessing Hierarchical Clustering results. These tools enable you to confidently interpret and evaluate clustering models. Remember, these skills are applicable beyond Hierarchical Clustering. So, let's continue refining these capabilities through practice. Keep learning!

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal