DBCVI - Density-Based Clustering Validation Index
The Density-Based Clustering Validation Index (DBCVI) index is an internal evaluation metric specifically designed for density-based clustering algorithms (such as DBSCAN or OPTICS). Unlike traditional metrics (e.g., Silhouette, Davies-Bouldin) which assume spherical clusters and rely on centroids, DBCVI evaluates arbitrary-shaped clusters by analyzing the density connectedness of the data points using Minimum Spanning Trees (MST).
Intuitively, DBCVI answers the question: “Are the clusters truly dense continuous regions separated by areas of lower density, regardless of their geometric shape?”
Mathematical Formulation
DBCVI operates entirely without cluster centroids. Instead, it relies on the concept of Mutual Reachability Distance (MRD) to construct a shape-aware density graph.
1. All-Points Core Distance To estimate the local density of an object \(o\) within its cluster \(C_i\), DBCVI calculates the all-points core distance:
Where \(d\) is the dimensionality of the feature space, and the summation runs over all other points in the same cluster.
2. Mutual Reachability Distance (MRD) The distance between two objects considering their density properties is defined as:
3. Density Sparseness and Separation A Minimum Spanning Tree (\(\text{MST}_{MRD}\)) is constructed for each cluster using the MRD.
Density Sparseness (DSC): The maximum edge weight of the internal edges in the \(\text{MST}_{MRD}\) of cluster \(C_i\). It represents the lowest density (sparsest) region inside the cluster.
Density Separation (DSPC): The minimum MRD between the internal nodes of cluster \(C_i\) and cluster \(C_j\). It represents the highest density region between the clusters.
4. Cluster Validity
The validity of a single cluster \(C_i\) is bounded in [-1, 1]:
5. Global DBCVI Index The final score is the weighted average of all cluster validities, heavily penalizing unclustered noise objects (\(|O|\) includes noise):
Handling Edge Cases & API
Noise Objects: Standard density algorithms often label outliers as noise (typically
-1). DBCVI naturally handles noise by excluding noise points from the MST construction, but penalizing the final score by scaling it down by the ratio of clustered points to total points.force_finite (bool): If
True, catches mathematical edge cases (e.g., fewer than 2 valid clusters) and returns a safe fallback value. Default isTrue.finite_value (float): The fallback value returned when calculation fails. Default is
0.0.- return_type (str): Controls the output format of the evaluation.
"global"(Default): Returns a singlefloatrepresenting the weighted overall DBCV score."per-cluster": Returns adictmapping each valid cluster label to its individual validity score. Useful for debugging which specific cluster is underperforming."both": Returns a tuple containing(global_score, per_cluster_dict).
Properties
Best possible score:
1.0(Indicates perfectly dense clusters perfectly separated by empty space).Worst possible score:
-1.0(Indicates that the density between clusters is higher than the density within clusters).Shape Invariance: Effectively evaluates non-convex, elongated, or arbitrarily shaped clusters.
Example Usage
from permetrics.clustering import ClusteringMetric
import numpy as np
# ==============================================================================
# SCENARIO: Density-based evaluation with noise handling
# ==============================================================================
# Assume a dataset where DBSCAN grouped 90 points into 2 clusters,
# and labeled 10 outlier points as noise (-1)
X = np.random.rand(100, 2)
y_pred = np.array([0]*45 + [1]*45 + [-1]*10)
cm = ClusteringMetric(X=X, y_pred=y_pred)
# 1. Global Score (Default)
dbcv_global = cm.DBCV()
print(f"Global DBCV: {dbcv_global:.4f}")
# 2. Per-cluster Validity Scores
dbcv_clusters = cm.DBCV(return_type="per-cluster")
print(f"Per-cluster Scores: {dbcv_clusters}")
# Output example: {0: 0.8123, 1: 0.7501}
# 3. Extracting Both Simultaneously
gb_score, cluster_dict = cm.DBCV(return_type="both")