SI - Silhouette Index ===================== .. toctree:: :maxdepth: 3 .. contents:: Table of Contents :local: :depth: 2 The **Silhouette Index (SI)** :cite:`rousseeuw1987silhouettes` is a highly popular internal clustering evaluation metric. It measures how similar an object is to its own cluster (cohesion) compared to other clusters (separation). Intuitively, SI answers the question: *"How well does each data point fit into its assigned cluster compared to the next best alternative cluster?"* A high silhouette value indicates that the object is well-matched to its own cluster and poorly matched to neighboring clusters. .. math:: s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))} Where: * :math:`a(i)` is the mean distance between data point :math:`i` and all other points in the same cluster (intra-cluster distance). * :math:`b(i)` is the smallest mean distance between data point :math:`i` and all points in any other cluster, of which :math:`i` is not a member (nearest-cluster distance). * The global Silhouette Index is the mean of the silhouette widths :math:`s(i)` for all data points. ------------------------------------------------------------------------------- Algorithmic Variations (Memory Optimization) -------------------------------------------- Calculating the exact Silhouette Score normally requires instantiating a full distance matrix, which has a space complexity of :math:`O(N^2)`. To prevent Out-Of-Memory (OOM) errors on large datasets (e.g., :math:`N > 100,000`), this implementation utilizes a highly optimized **chunk-based processing** strategy. * **chunk_size (int):** Processes the pairwise distances in bounded batches (default: ``5000``). This tightly caps the RAM usage to a safe limit while mathematically guaranteeing the exact same result as the standard approach. ------------------------------------------------------------------------------- Handling Edge Cases (Finite Values) ----------------------------------- The Silhouette Index is mathematically undefined when there is only one cluster (:math:`K = 1`). In this scenario, neither cohesion nor separation can be fully established. * **force_finite (bool):** If ``True``, catches the undefined operation and returns a safe fallback. Default is ``True``. * **finite_value (float):** The specific fallback value returned when ``force_finite=True``. Since the worst possible silhouette score is -1, the default fallback is a penalty value of ``-1.0``. * **multi_output (bool):** If ``True``, the function returns an array of silhouette scores for each individual data point instead of a single global mean. ------------------------------------------------------------------------------- Properties ---------- * **Best possible score:** ``1.0`` (Points are perfectly clustered and far away from neighboring clusters). * **Worst possible score:** ``-1.0`` (Points are consistently assigned to the wrong clusters). * **Values near 0:** Indicate overlapping clusters where points are situated on the decision boundary between two groups. * **Range:** ``[-1.0, 1.0]`` * **References:** `Scikit-Learn Silhouette Score `_ ------------------------------------------------------------------------------- Example Usage ------------- .. code-block:: python :emphasize-lines: 12,14,23,32,35 from permetrics.clustering import ClusteringMetric import numpy as np # ============================================================================== # SCENARIO 1: Normal Evaluation (Global Mean Silhouette) # ============================================================================== print("--- 1. BASIC SILHOUETTE INDEX EXAMPLE ---") X_data = np.array([[1, 2], [1, 4], [1, 0], [10, 2], [10, 4], [10, 0]]) y_pred_labels = np.array([0, 0, 0, 1, 1, 1]) cm = ClusteringMetric(X=X_data, y_pred=y_pred_labels) # Calculates the global mean silhouette score si_score = cm.SI() print(f"Silhouette Index: {si_score}") # ============================================================================== # SCENARIO 2: Multi-output (Per-sample Silhouette Scores) # ============================================================================== print("\n--- 2. PER-SAMPLE SILHOUETTE SCORES ---") # Returns an array containing the silhouette score for each data point si_samples = cm.SI(multi_output=True) print(f"Silhouette Scores per sample:\n{si_samples}") # ============================================================================== # SCENARIO 3: Edge Case with 1 Cluster # ============================================================================== print("\n--- 3. EDGE CASE (1 CLUSTER) EXAMPLE ---") y_pred_single = np.array([0, 0, 0, 0, 0, 0]) cm_single = ClusteringMetric(X=X_data, y_pred=y_pred_single) # Returns the penalty finite_value (-1.0) instead of crashing si_safe = cm_single.SI(force_finite=True, finite_value=-1.0) print(f"SI with 1 cluster (Safe Mode): {si_safe}")