Introduction

Greetings! Our journey into K-means clustering deepens as we explore two crucial elements: the selection of the number of clusters and the initialization of centroids. Our aim is to comprehend these aspects and put them into action using Python. Let's move forward!

Choosing Clusters and Initializing Centroids in K-means

The K in K-means signifies the number of clusters. Centroids, the centers of each cluster, are equally significant. Their initial placement in K-means is crucial. Poorly initialized centroids can lead to sub-optimal clustering — a reason why multiple runs with different initial placements are essential. This highlights the importance of choosing both the number of clusters and their initial centroids.

Revising K-means Algorithm

Sklearn's KMeans not only allows us to specify the number of clusters and maximum iterations but also provides an important parameter, init, where we can set the initial centroids to be used.

Let's import the KMeans class from scikit-learn library and see how we can initialize our centroids there.

In the code above,n_clusters sets the total number of clusters, init is an optional parameter that accepts the initial centroid positions. By using n_init=1, we disable sklearn's built-in multiple runs with different centroid seeds because we want to use our manually initiated centroids.

After defining the KMeans object with our specified parameters, we fit the model to our Iris dataset. kmeans.fit(data) computes K-means clustering using our data and initial centroid positions:

Here, kmeans.labels_ gives us the labels of each point, and kmeans.cluster_centers_ provides the coordinates of cluster centers. Like before, we represent the data points in different clusters by colors, and the centroids are marked in red. See how easy it is to use sklearn's KMeans once we understand the underlying theory! Using Python libraries like this enhances our efficiency and saves time, especially when working on more complex projects.

Understand the Implications

As we've seen, varied initial centroids and different choices for the number of clusters can lead to different results. Sklearn's KMeans with init='k-means++' (default) can be quite helpful in preventing poor initialization leading to inferior clustering; it initializes the centroids to be (generally) distant from each other, leading to better results on average.

Selection of the Number of Clusters

Let's delve deeper into the topic of selecting the number of clusters. The decision of how many clusters to use can drastically influence the outcomes of your K-means clustering algorithm.

Consider two scenarios where we use our earlier introduced dataset but decide to go with two clusters (k=2) in one case and four clusters (k=3) in the other.

Let's run the algorithm for k=2:

In this case, the algorithm will group the first six into one cluster and the last three into another.

Now let's see how the clusters change when we run the algorithm for k=3:

In this case, the algorithm will group the first three points into one cluster, the next three into a second cluster, and the last three into a third cluster.

Now lets illustrate two clustering side by side:

We will see the following plot:

image

These examples illustrate the significant role the number of clusters plays in forming the final clusters. We must carefully choose this number to accurately represent the underlying structure of our data. An incorrect number of clusters could lead to overfitting or underfitting, both of which could misrepresent your data.

Initial Centroid Initialization: Potential Pitfalls and Solutions

You may wonder, "Why can the initial centroid placement result in different clustering results?" Well, the K-means algorithm is an iterative procedure that minimizes the within-cluster sum of squares. However, it only guarantees finding a local minimum, not a global one. This implies that different starting positions can lead to distinct clustering outcomes.

To visualize this, imagine you're blindfolded in a hilly region where you're tasked to find the lowest point. By feeling the ground slope, you move downwards. But when there are many valleys (local minima), your starting position influences which valley (local minimum) you'll end up in - and not all valleys are equally deep. Initial centroids in K-means are akin to starting positions.

Fortunately, real-world applications rarely suffer from the infamous K-means' local minima issue. Plus, Python libraries such as scikit-learn go a long way in handling these concerns proficiently. In particular, scikit's KMeans function uses an intelligent initialization technique called "K-Means++" by default. This approach systematically finds a good set of initial centroids, reducing the likelihood of poor clustering due to unlucky centroid initialization. It's worth mentioning that creating an example that demonstrates sensitivity to initial centroid location is not straightforward due to KMeans' clever centroid initialization.

Nonetheless, it’s still good to be aware of the importance of initial centroids, as in more intricate clustering methods, centroid initialization may significantly impact results. We can illustrate this using a custom implementation of KMeans, since it's very basic and, therefore, it's very sensitive to the choice of initial centroids.

To do that, let's first prepare the data for our illustration:

Now, let's revisit our custom implementation from the beginning of this lesson:

The above code is the K-Means Clustering implementation, which we used in the previous units with a small difference — we now pass initial centroids as a parameter to our kmeans_clustering function. Now, let's perform clustering with different initial centroids:

The visualization below illustrates the significance of initial centroids:

image

Lesson Summary and Practice

We have journeyed through the principles of choosing clusters and initializing centroids in K-means while traversing the Python universe. Without a doubt, your understanding has deepened. Armed with this new knowledge, you're ready to tackle the exciting exercises that lie ahead. Practice will help you consolidate and master what you've learned in this lesson. Onwards, to more learning!

Sign up
Join the 1M+ learners on CodeSignal
Be a part of our community of 1M+ users who develop and demonstrate their skills on CodeSignal