Welcome to our Cluster Performance Unveiled course lesson! Here, we leverage Silhouette Scores, the Davies-Bouldin Index, and Cross-Tabulation Analysis to assess DBSCAN, a top-performing clustering algorithm focused on density. Exciting, right?
DBSCAN is especially useful when the number of clusters is unknown and density is a key factor in cluster formation. In R, the dbscan package provides a straightforward way to apply the DBSCAN algorithm.
In this implementation, eps specifies the maximum distance between neighboring points, and minPts sets the minimum number of points required to form a dense region (core point). After fitting the algorithm, we need a quantitative measure of clustering quality. The Silhouette Score is a robust indicator, reflecting how similar each point is to its own cluster compared to other clusters.
To calculate the Silhouette Score in R, we use the cluster package:
Why exclude noise points (cluster 0)?
In DBSCAN, points labeled as cluster 0 are considered "noise"—they do not belong to any cluster. Including these noise points in the Silhouette Score calculation would distort the metric, as noise points are, by definition, not assigned to any cluster and may be far from all clusters. This would artificially lower the average Silhouette Score and not accurately reflect the quality of the actual clusters. By excluding noise points, the Silhouette Score more accurately measures the cohesion and separation of the true clusters found by DBSCAN.
A higher Silhouette Score indicates that clusters are dense and well separated.
The Davies-Bouldin Index is another important metric for evaluating clustering quality. It measures the average similarity between each cluster and its most similar cluster, with lower values indicating better clustering. In R, you can compute this index using the clusterSim or clusterCrit packages.
Here is an example using the clusterSim package:
Why exclude noise points (cluster 0)?
Noise points are not part of any cluster and do not contribute to the within-cluster or between-cluster distances that the Davies-Bouldin Index measures. Including them would introduce undefined or misleading values, as the index is designed to compare actual clusters. Excluding noise points ensures that the Davies-Bouldin Index reflects only the quality of the clusters that DBSCAN has identified.
A lower Davies-Bouldin Index suggests better separation between clusters.
To further evaluate the clustering, we can perform a Cross-Tabulation Analysis to compare the clustering results with actual labels (if available):
Cross-Tabulation Analysis produces a matrix that compares the clustering assignments to the actual labels, helping assess the clustering performance. Here, the row for cluster 0 shows how many noise points correspond to each true label, which can be useful for understanding what kind of points DBSCAN considers as noise.
When interpreting these metrics, remember that noise points (cluster 0) are excluded from the Silhouette Score and Davies-Bouldin Index calculations to ensure these metrics accurately reflect the quality of the clusters themselves. A high Silhouette Score indicates effective clustering, while a lower Davies-Bouldin Index suggests better cluster separation. In Cross-Tabulation, the diagonal elements represent accurate classifications, and the row for cluster 0 shows the distribution of noise points across true labels.
