SI - Silhouette Index
The Silhouette Index (SI) [41] is a highly popular internal clustering evaluation metric. It measures how similar an object is to its own cluster (cohesion) compared to other clusters (separation).
Intuitively, SI answers the question: “How well does each data point fit into its assigned cluster compared to the next best alternative cluster?” A high silhouette value indicates that the object is well-matched to its own cluster and poorly matched to neighboring clusters.
Where:
\(a(i)\) is the mean distance between data point \(i\) and all other points in the same cluster (intra-cluster distance).
\(b(i)\) is the smallest mean distance between data point \(i\) and all points in any other cluster, of which \(i\) is not a member (nearest-cluster distance).
The global Silhouette Index is the mean of the silhouette widths \(s(i)\) for all data points.
Algorithmic Variations (Memory Optimization)
Calculating the exact Silhouette Score normally requires instantiating a full distance matrix, which has a space complexity of \(O(N^2)\). To prevent Out-Of-Memory (OOM) errors on large datasets (e.g., \(N > 100,000\)), this implementation utilizes a highly optimized chunk-based processing strategy.
chunk_size (int): Processes the pairwise distances in bounded batches (default:
5000). This tightly caps the RAM usage to a safe limit while mathematically guaranteeing the exact same result as the standard approach.
Handling Edge Cases (Finite Values)
The Silhouette Index is mathematically undefined when there is only one cluster (\(K = 1\)). In this scenario, neither cohesion nor separation can be fully established.
force_finite (bool): If
True, catches the undefined operation and returns a safe fallback. Default isTrue.finite_value (float): The specific fallback value returned when
force_finite=True. Since the worst possible silhouette score is -1, the default fallback is a penalty value of-1.0.multi_output (bool): If
True, the function returns an array of silhouette scores for each individual data point instead of a single global mean.
Properties
Best possible score:
1.0(Points are perfectly clustered and far away from neighboring clusters).Worst possible score:
-1.0(Points are consistently assigned to the wrong clusters).Values near 0: Indicate overlapping clusters where points are situated on the decision boundary between two groups.
Range:
[-1.0, 1.0]References: Scikit-Learn Silhouette Score
Example Usage
from permetrics.clustering import ClusteringMetric
import numpy as np
# ==============================================================================
# SCENARIO 1: Normal Evaluation (Global Mean Silhouette)
# ==============================================================================
print("--- 1. BASIC SILHOUETTE INDEX EXAMPLE ---")
X_data = np.array([[1, 2], [1, 4], [1, 0], [10, 2], [10, 4], [10, 0]])
y_pred_labels = np.array([0, 0, 0, 1, 1, 1])
cm = ClusteringMetric(X=X_data, y_pred=y_pred_labels)
# Calculates the global mean silhouette score
si_score = cm.SI()
print(f"Silhouette Index: {si_score}")
# ==============================================================================
# SCENARIO 2: Multi-output (Per-sample Silhouette Scores)
# ==============================================================================
print("\n--- 2. PER-SAMPLE SILHOUETTE SCORES ---")
# Returns an array containing the silhouette score for each data point
si_samples = cm.SI(multi_output=True)
print(f"Silhouette Scores per sample:\n{si_samples}")
# ==============================================================================
# SCENARIO 3: Edge Case with 1 Cluster
# ==============================================================================
print("\n--- 3. EDGE CASE (1 CLUSTER) EXAMPLE ---")
y_pred_single = np.array([0, 0, 0, 0, 0, 0])
cm_single = ClusteringMetric(X=X_data, y_pred=y_pred_single)
# Returns the penalty finite_value (-1.0) instead of crashing
si_safe = cm_single.SI(force_finite=True, finite_value=-1.0)
print(f"SI with 1 cluster (Safe Mode): {si_safe}")