Evaluating K-means Clustering

Introduction and Overview

Welcome back! In this lesson, we're seeking a more in-depth understanding of the K-means clustering algorithm by using a straightforward 2D dataset. We'll explore its implementation and evaluate its performance using a well-known measure of clustering accuracy: the Rand Index.

Understanding the Rand Index

As we progress, we will delve into the Rand Index, an external cluster validation measure that determines the similarity between two clustering structures. The Rand Index accounts for all pairs of samples and counts pairs that are assigned to the same or different clusters in the predicted and true clustering.

$RI = \frac{TP + TN} {TP+FP+FN + TN}$

Rand Index vs Adjusted Rand Index

Now, let's discuss an important distinction: the difference between the Rand Index and the Adjusted Rand Index. While the Rand Index gives an absolute measure of the similarity between two data samples, it doesn't take into account the chance groupings that might occur. In other words, the Rand Index may yield a high value due to randomness in the dataset, which is certainly not how we want to evaluate the performance of our algorithm.

The Adjusted Rand Index corrects the Rand Index by taking into account the expected similarity of two random data samples. The Adjusted Rand Index is given by:

$ARI = \frac {RI - Expected\_RI} {Max\_RI - Expected\_RI}$

Evaluating K-means with the Adjusted Rand Index in R

Now that we have learned about the Rand Index and Adjusted Rand Index, it's also beneficial to familiarize ourselves with some of the R packages that provide similar functionality. The mclust package in R offers the function adjustedRandIndex, which computes the Adjusted Rand Index for comparing two clusterings.

The adjustedRandIndex function takes two vectors as input: the true labels and the predicted cluster labels. It returns a numeric value representing the Adjusted Rand Index, which ranges from -1 (no agreement) to 1 (perfect agreement), with 0 indicating random labeling.

Here is how you can use the adjustedRandIndex function in R:

In the above snippet, we load the mclust package and use adjustedRandIndex to compute the Adjusted Rand Index. The inputs to the function are the true labels and the labels predicted by K-means. The function returns a numeric value representing the Adjusted Rand Index of the predicted clusters.

Just like in the Rand Index calculation, a higher Adjusted Rand Index means that our K-means algorithm has done a great job clustering.

Full Implementation: K-means and Evaluating with the Adjusted Rand Index in R

With all the pieces at hand, let's put everything together. We'll perform K-means clustering on our toy dataset using R's kmeans function, then evaluate the results using the Adjusted Rand Index from the mclust package. We'll also visualize the clusters and their centers using ggplot2 for a more modern and flexible plotting approach.

First, we initialize the data and perform clustering:

Checking Results

Now let's calculate the Adjusted Rand Index using mclust and print the results:

Example output:

Visualizing the clusters

Visualizing the clusters and centroids using ggplot2:

Output:

Here, we have effectively encapsulated our prior discussions on implementing K-means clustering, applying the Adjusted Rand Index, and bringing the insights to life through visual representations. R's kmeans function simplifies the K-means process into merely defining the model, fitting it to the data, and performing evaluations. By reflecting this streamlined process, the code highlights the importance of understanding essential concepts, navigating packages, and connecting functions to their origins.

Lesson Summary and Practice

The exploration of the K-means algorithm and the proper use of the Rand Index and Adjusted Rand Index has provided us with significant insights in the realm of unsupervised learning. The next phase will involve practical applications, cementing your understanding of these crucial concepts. Your understanding of these concepts, like the K-means algorithm that we discussed, will improve through multiple iterations. Happy practicing!

Previous Lesson

Next Lesson: Choosing Clusters and Centroids

Join the 1M+ learners on CodeSignal

Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal