CHI - Calinski-Harabasz Index

The Calinski-Harabasz Index (CHI) also known as the Variance Ratio Criterion, is an internal clustering evaluation metric [34] . It computes the ratio of the sum of between-cluster dispersion to within-cluster dispersion for all clusters.

Intuitively, CHI evaluates the validity of a clustering based on the average between- and within-cluster sum of squares. It answers the question: “How well-separated are the clusters relative to how compact they are?” A higher score implies that clusters are dense (low within-cluster variance) and well-separated (high between-cluster variance).

\[\text{CHI} = \frac{\text{Tr}(B_K)}{\text{Tr}(W_K)} \times \frac{N - K}{K - 1}\]

Where:

  • \(N\) is the total number of data points (samples).

  • \(K\) is the number of clusters.

  • \(\text{Tr}(B_K)\) is the trace of the between-group dispersion matrix.

  • \(\text{Tr}(W_K)\) is the trace of the within-cluster dispersion matrix.


Handling Edge Cases (Finite Values)

The Calinski-Harabasz index is mathematically undefined when there is only one cluster (\(K = 1\)), as the denominator \((K - 1)\) becomes zero. The function provides parameters to safely handle this scenario:

  • force_finite (bool): If True, the function will catch the undefined mathematical operation and return a safe, finite number instead of raising an exception. Default is True.

  • finite_value (float): The specific fallback value returned when force_finite=True and the clustering has only 1 cluster. Default is 0.0.


Properties

  • Best possible score: No strict upper bound (Higher value is better).

  • Worst possible score: 0.0

  • Range: [0.0, +inf)

  • Notes: This metric in scikit-learn library is wrong in calculate the intra_disp variable (WGSS) Scikit-Learn Calinski-Harabasz


Example Usage

from permetrics.clustering import ClusteringMetric
import numpy as np

# ==============================================================================
# SCENARIO 1: Normal Clustering Evaluation
# ==============================================================================
print("--- 1. BASIC CALINSKI-HARABASZ INDEX EXAMPLE ---")

X_data = np.array([[1, 2], [1, 4], [1, 0], [10, 2], [10, 4], [10, 0]])
y_pred_labels = np.array([0, 0, 0, 1, 1, 1])

cm = ClusteringMetric(X=X_data, y_pred=y_pred_labels)
chi_score = cm.CHI()
print(f"Calinski-Harabasz Index: {chi_score}")

# ==============================================================================
# SCENARIO 2: Edge Case with 1 Cluster (Demonstrating force_finite)
# ==============================================================================
print("\n--- 2. EDGE CASE (1 CLUSTER) EXAMPLE ---")

# All data points are predicted to be in the same single cluster (label 0)
y_pred_single = np.array([0, 0, 0, 0, 0, 0])
cm_single = ClusteringMetric(X=X_data, y_pred=y_pred_single)

# Returns the finite_value (0.0) instead of crashing
chi_safe = cm_single.CHI(force_finite=True, finite_value=0.0)
print(f"CHI with 1 cluster (Safe Mode): {chi_safe}")