Understanding and Implementing Distance Metrics in Hierarchical Clustering

Introduction

Welcome to our lesson on distance metrics in hierarchical clustering! Today, we will delve into the definition and importance of distance metrics, particularly in hierarchical clustering. You will learn about various types of distance metrics such as Euclidean, Manhattan, and Cosine Distance, and how to implement these in Python. After this, we will examine the impact of these distance measures on the resulting hierarchical clustering.

Introduction to Distance Metrics

Distance metrics are essentially measures used in mathematics to calculate the 'distance' between two points. In the context of clustering, we're interested in the distance between data points in our dataset or the distance between clusters of points. We often use metrics like Euclidean distance, Manhattan distance, and Cosine Distance, each with its unique set of characteristics and application scenarios.

Implementing Distance Metrics in Python: Euclidean Distance

The Euclidean distance, often referred to as the straight-line distance between two points in a Euclidean plane, is one of the most commonly used distance metrics in machine learning. Below is the formula and Python implementation of it.

Distance = \sqrt{\sum_{i=1}^{n} (p_i - q_i)^2}

Implementing Distance Metrics in Python: Manhattan Distance

The Manhattan distance gets its name from the block-like geographical layout of the Manhattan borough of New York City. The Manhattan distance between two points is the sum of the absolute differences of their coordinates. Here's the formula and Python code for calculating Manhattan distance:

Distance = \sum_{i=1}^{n} |p_i - q_i|

Implementing Distance Metrics in Python: Cosine Distance

The third type of distance metric that we will examine today is Cosine distance, but first let's understand the Cosine similarity. Unlike the other two, Cosine similarity measures the cosine of the angle between two vectors, which can be useful in certain multi-dimensional and text classification problems. From that we can calculate the Cosine Distance as $1 - \text{Cosine Similarity}$ . Here's the formula and Python function for calculating Cosine Distance:

Cosine Similarity:

Similarity = \frac{A \cdot B}{||A|| \cdot ||B||}

Implementing Hierarchical Clustering

Next, we'll see how hierarchical clustering aims to separate the dataset into clusters. The distance metric plays a key role in this process, determining the 'distance' or dissimilarity between data points. Let's tweak the agglomerative hierarchical clustering algorithm to incorporate different distance metrics as a parameter.

For that purpose, we'll tweak the distance matrix calculation function to accept a distance metric as an argument. Here's the Python code for the agglomerative hierarchical clustering algorithm:

Similarly, we can tweak the agglomerative clustering function to accept a distance metric as an argument. Here's the Python code for the agglomerative clustering algorithm:

Here, we've written a Python function, agglomerative_clustering, which implements agglomerative hierarchical clustering on a given dataset.

Studying the Impact of Distance Metrics

Let's first define the dataset that we will use for the clustering:

Next, we can perform clustering with different distance methods:

Lastly, we will understand how different distance measures can affect the result of hierarchical clustering. Let's visualize clustering results:

You can visualize the impact of distance metrics, exploring how different distance measures change the clustering outcomes.

Configuring Distance Metrics with Sklearn

Similarly, we can set different distance metrics when using Sklearn's AgglomerativeClustering model. Let's try it out.

If we plot the result the same way as in the custom implementation, we'll have the following result:

Lesson Summary and Practice

Excellent work! You've just mastered the concepts and the importance of distance metrics in hierarchical clustering. You've implemented these metrics in Python and applied them in the agglomerative clustering algorithm. In the end, you studied the impact of these distance metrics on the clustering results. Next, get ready to solidify this knowledge through related exercises!

Previous Lesson

Next Lesson: Understanding Linkage Criteria in Hierarchical Clustering

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal