RSI - R-Squared Index

The R-Squared Index (RSI) (also known as the Coefficient of Determination for Clustering) is an internal evaluation metric. It measures the proportion of the total variance in the dataset that is explained by the clustering partition.

Intuitively, RSI acts similarly to the R-squared score in linear regression models. It answers the question: “How much of the total dispersion of the data is captured by grouping the points into these clusters?” A higher RSI value indicates a more optimal partition, implying that clusters are tightly cohesive and capture the vast majority of the dataset’s variation.

\[\text{RSI} = \frac{\text{TSS} - \text{WGSS}}{\text{TSS}} = \frac{\text{BGSS}}{\text{TSS}}\]

Where:

  • \(\text{TSS}\) is the Total Sum of Squares (total dispersion of all data points around the global dataset centroid).

  • \(\text{WGSS}\) is the Within-Group Sum of Squares (total dispersion of data points around their respective cluster centroids).

  • \(\text{BGSS}\) is the Between-Group Sum of Squares (\(\text{TSS} - \text{WGSS}\)).


Properties

  • Best possible score: 1.0 (Higher value is better, indicating that 100% of the data variance is explained by the cluster centers).

  • Worst possible score: 0.0 (Indicates the clustering captures zero variance, performing no better than a single global mean).

  • Range: [0.0, 1.0]


Example Usage

from permetrics.clustering import ClusteringMetric
import numpy as np

# ==============================================================================
# SCENARIO 1: Basic Evaluation
# ==============================================================================
print("--- 1. BASIC R-SQUARED INDEX EXAMPLE ---")

X_data = np.array([[1, 2], [1, 4], [1, 0], [10, 2], [10, 4], [10, 0]])
y_pred_labels = np.array([0, 0, 0, 1, 1, 1])

cm = ClusteringMetric(X=X_data, y_pred=y_pred_labels)
rsi_score = cm.RSI()
print(f"R-Squared Index: {rsi_score}")

# ==============================================================================
# SCENARIO 2: Single Cluster Evaluation (Zero variance explained)
# ==============================================================================
print("\n--- 2. SINGLE CLUSTER EXAMPLE ---")

y_pred_single = np.array([0, 0, 0, 0, 0, 0])
cm_single = ClusteringMetric(X=X_data, y_pred=y_pred_single)
rsi_single = cm_single.RSI()
print(f"RSI with 1 cluster: {rsi_single}")