SI - Silhouette Index

The Silhouette Index (SI) [41] is a highly popular internal clustering evaluation metric. It measures how similar an object is to its own cluster (cohesion) compared to other clusters (separation).

Intuitively, SI answers the question: “How well does each data point fit into its assigned cluster compared to the next best alternative cluster?” A high silhouette value indicates that the object is well-matched to its own cluster and poorly matched to neighboring clusters.

\[s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}\]

Where:

\(a(i)\) is the mean distance between data point \(i\) and all other points in the same cluster (intra-cluster distance).
\(b(i)\) is the smallest mean distance between data point \(i\) and all points in any other cluster, of which \(i\) is not a member (nearest-cluster distance).
The global Silhouette Index is the mean of the silhouette widths \(s(i)\) for all data points.

Algorithmic Variations (Memory Optimization)

Calculating the exact Silhouette Score normally requires instantiating a full distance matrix, which has a space complexity of \(O(N^2)\). To prevent Out-Of-Memory (OOM) errors on large datasets (e.g., \(N > 100,000\)), this implementation utilizes a highly optimized chunk-based processing strategy.

chunk_size (int): Processes the pairwise distances in bounded batches (default: 5000). This tightly caps the RAM usage to a safe limit while mathematically guaranteeing the exact same result as the standard approach.

Handling Edge Cases (Finite Values)

The Silhouette Index is mathematically undefined when there is only one cluster (\(K = 1\)). In this scenario, neither cohesion nor separation can be fully established.

force_finite (bool): If True, catches the undefined operation and returns a safe fallback. Default is True.
finite_value (float): The specific fallback value returned when force_finite=True. Since the worst possible silhouette score is -1, the default fallback is a penalty value of -1.0.
multi_output (bool): If True, the function returns an array of silhouette scores for each individual data point instead of a single global mean.

Properties 

Best possible score: 1.0 (Points are perfectly clustered and far away from neighboring clusters).
Worst possible score: -1.0 (Points are consistently assigned to the wrong clusters).
Values near 0: Indicate overlapping clusters where points are situated on the decision boundary between two groups.
Range: [-1.0, 1.0]
References: Scikit-Learn Silhouette Score

Example Usage 

from permetrics.clustering import ClusteringMetric
import numpy as np

# ==============================================================================
# SCENARIO 1: Normal Evaluation (Global Mean Silhouette)
# ==============================================================================
print("--- 1. BASIC SILHOUETTE INDEX EXAMPLE ---")

X_data = np.array([[1, 2], [1, 4], [1, 0], [10, 2], [10, 4], [10, 0]])
y_pred_labels = np.array([0, 0, 0, 1, 1, 1])

cm = ClusteringMetric(X=X_data, y_pred=y_pred_labels)
# Calculates the global mean silhouette score
si_score = cm.SI()
print(f"Silhouette Index: {si_score}")

# ==============================================================================
# SCENARIO 2: Multi-output (Per-sample Silhouette Scores)
# ==============================================================================
print("\n--- 2. PER-SAMPLE SILHOUETTE SCORES ---")

# Returns an array containing the silhouette score for each data point
si_samples = cm.SI(multi_output=True)
print(f"Silhouette Scores per sample:\n{si_samples}")

# ==============================================================================
# SCENARIO 3: Edge Case with 1 Cluster
# ==============================================================================
print("\n--- 3. EDGE CASE (1 CLUSTER) EXAMPLE ---")

y_pred_single = np.array([0, 0, 0, 0, 0, 0])
cm_single = ClusteringMetric(X=X_data, y_pred=y_pred_single)

# Returns the penalty finite_value (-1.0) instead of crashing
si_safe = cm_single.SI(force_finite=True, finite_value=-1.0)
print(f"SI with 1 cluster (Safe Mode): {si_safe}")

SI - Silhouette Index

Algorithmic Variations (Memory Optimization)

Handling Edge Cases (Finite Values)

Properties

Example Usage

Properties 

Example Usage 