DBCVI - Density-Based Clustering Validation Index ================================================= .. toctree:: :maxdepth: 3 .. contents:: Table of Contents :local: :depth: 2 The **Density-Based Clustering Validation Index (DBCVI)** index is an internal evaluation metric specifically designed for density-based clustering algorithms (such as DBSCAN or OPTICS). Unlike traditional metrics (e.g., Silhouette, Davies-Bouldin) which assume spherical clusters and rely on centroids, DBCVI evaluates arbitrary-shaped clusters by analyzing the density connectedness of the data points using Minimum Spanning Trees (MST). Intuitively, DBCVI answers the question: "Are the clusters truly dense continuous regions separated by areas of lower density, regardless of their geometric shape?" ------------------------------------------------------------------------------- Mathematical Formulation ------------------------ DBCVI operates entirely without cluster centroids. Instead, it relies on the concept of **Mutual Reachability Distance (MRD)** to construct a shape-aware density graph. **1. All-Points Core Distance** To estimate the local density of an object :math:`o` within its cluster :math:`C_i`, DBCVI calculates the all-points core distance: .. math:: a_{pts\_coredist}(o) = \left( \frac{\sum_{k=2}^{n_i} \left( KNN(o, k) \right)^d}{n_i - 1} \right)^{1/d} Where :math:`d` is the dimensionality of the feature space, and the summation runs over all other points in the same cluster. **2. Mutual Reachability Distance (MRD)** The distance between two objects considering their density properties is defined as: .. math:: d_{mreach}(o_i, o_j) = \max \left( a_{pts\_coredist}(o_i), a_{pts\_coredist}(o_j), d(o_i, o_j) \right) **3. Density Sparseness and Separation** A Minimum Spanning Tree (:math:`\text{MST}_{MRD}`) is constructed for each cluster using the MRD. * **Density Sparseness (DSC):** The maximum edge weight of the internal edges in the :math:`\text{MST}_{MRD}` of cluster :math:`C_i`. It represents the lowest density (sparsest) region *inside* the cluster. * **Density Separation (DSPC):** The minimum MRD between the internal nodes of cluster :math:`C_i` and cluster :math:`C_j`. It represents the highest density region *between* the clusters. **4. Cluster Validity** The validity of a single cluster :math:`C_i` is bounded in ``[-1, 1]``: .. math:: V_C(C_i) = \frac{\min_{j \neq i} (DSPC(C_i, C_j)) - DSC(C_i)}{\max \left( \min_{j \neq i} (DSPC(C_i, C_j)), DSC(C_i) \right)} **5. Global DBCVI Index** The final score is the weighted average of all cluster validities, heavily penalizing unclustered noise objects (:math:`|O|` includes noise): .. math:: \text{DBCVI}(C) = \sum_{i=1}^{l} \frac{|C_i|}{|O|} V_C(C_i) ------------------------------------------------------------------------------- Handling Edge Cases & API ------------------------- * **Noise Objects:** Standard density algorithms often label outliers as noise (typically ``-1``). DBCVI naturally handles noise by excluding noise points from the MST construction, but penalizing the final score by scaling it down by the ratio of clustered points to total points. * **force_finite (bool):** If ``True``, catches mathematical edge cases (e.g., fewer than 2 valid clusters) and returns a safe fallback value. Default is ``True``. * **finite_value (float):** The fallback value returned when calculation fails. Default is ``0.0``. * **return_type (str):** Controls the output format of the evaluation. * ``"global"`` (Default): Returns a single ``float`` representing the weighted overall DBCV score. * ``"per-cluster"``: Returns a ``dict`` mapping each valid cluster label to its individual validity score. Useful for debugging which specific cluster is underperforming. * ``"both"``: Returns a tuple containing ``(global_score, per_cluster_dict)``. ------------------------------------------------------------------------------- Properties ---------- * **Best possible score:** ``1.0`` (Indicates perfectly dense clusters perfectly separated by empty space). * **Worst possible score:** ``-1.0`` (Indicates that the density between clusters is higher than the density within clusters). * **Shape Invariance:** Effectively evaluates non-convex, elongated, or arbitrarily shaped clusters. * **References:** `Moulavi, D., Jaskowiak, P. A., Campello, R. J., Zimek, A., & Sander, J. (2014, April). Density-based clustering validation. In Proceedings of the 2014 SIAM international conference on data mining (pp. 839-847). Society for Industrial and Applied Mathematics. `_ ------------------------------------------------------------------------------- Example Usage ------------- .. code-block:: python :emphasize-lines: 12,15,19,24 from permetrics.clustering import ClusteringMetric import numpy as np # ============================================================================== # SCENARIO: Density-based evaluation with noise handling # ============================================================================== # Assume a dataset where DBSCAN grouped 90 points into 2 clusters, # and labeled 10 outlier points as noise (-1) X = np.random.rand(100, 2) y_pred = np.array([0]*45 + [1]*45 + [-1]*10) cm = ClusteringMetric(X=X, y_pred=y_pred) # 1. Global Score (Default) dbcv_global = cm.DBCV() print(f"Global DBCV: {dbcv_global:.4f}") # 2. Per-cluster Validity Scores dbcv_clusters = cm.DBCV(return_type="per-cluster") print(f"Per-cluster Scores: {dbcv_clusters}") # Output example: {0: 0.8123, 1: 0.7501} # 3. Extracting Both Simultaneously gb_score, cluster_dict = cm.DBCV(return_type="both")