HGS - Hubert Gamma Score
The Hubert Gamma Score (HGS) is an external clustering evaluation metric. It measures the similarity between two clustering partitions by computing the correlation between two binary indicator variables representing whether pairs of samples are co-clustered or separated.
Intuitively, HGS evaluates clustering agreement by treating partition comparison as a quadratic assignment problem. It evaluates the net excess of concordant sample pairs over discordant pairs, answering the question: “Are two points that are grouped together in the ground truth also consistently placed together by the model?”
Where across all \(N_T = \binom{N}{2}\) possible pairs of distinct data points:
\(X_1\) and \(X_2\) are binary indicator variables for the ground truth (\(y_{true}\)) and predicted partition (\(y_{pred}\)), respectively. Their value is 1 if points \(i\) and \(j\) are in the same cluster, and 0 otherwise.
\(\mu_{X_1}\) and \(\mu_{X_2}\) are the expected means of these indicator variables across all pairs.
Expressed directly via the pair counts derived from the contingency matrix:
Note: In clusterCrit literature, the normalized variant \(\hat{\Gamma}\) divides this raw value by the standard deviations of the indicator variables, bounding the score between -1 and 1.
Algorithmic Optimizations (Performance Note)
Iterating over all sample combinations to construct the indicator matrices incurs an expensive runtime complexity of \(O(N^2)\).
This implementation bypasses explicit pairwise matrix creation. By extracting the positive pair totals (\(yy\), \(yn\), and \(ny\)) directly from the algebraic dot products of the Contingency Matrix marginals, it computes the exact Hubert Gamma statistic in :math:`O(N)` time complexity. This ensures high-speed benchmarking on massive datasets.
Handling Edge Cases (Finite Values)
The normalized Hubert Gamma Score involves division by the standard deviations of the indicator variables. If either partition consists exclusively of a single universal cluster or strictly of isolated singletons, the variance of the partition’s indicator variable evaluates to zero, triggering an undefined mathematical division.
force_finite (bool): If
True, catches the zero-division error and returns a safe fallback value instead of raising aZeroDivisionError. Default isTrue.finite_value (float): The fallback value returned when
force_finite=Trueand the calculation fails. Since the worst possible normalized score is -1.0, the default fallback is0.0.
Properties
Best possible score: Depends on dataset size for unnormalized HGS; strictly
1.0for normalized \(\hat{\Gamma}\) (indicating absolute agreement).Worst possible score: Strictly
-1.0for normalized \(\hat{\Gamma}\) (indicating severe disagreement).Permutation Invariance: Completely invariant to permutations of cluster labels.
Symmetry: The metric is strictly symmetric: \(\text{HGS}(y_{true}, y_{pred}) = \text{HGS}(y_{pred}, y_{true})\).
Range:
[-1.0, 1.0](for normalized \(\hat{\Gamma}\)).References:
Example Usage
from permetrics.clustering import ClusteringMetric
# ==============================================================================
# SCENARIO 1: Basic Evaluation
# ==============================================================================
print("--- 1. BASIC HUBERT GAMMA SCORE EXAMPLE ---")
y_true = [0, 0, 1, 1, 2, 2]
y_pred = [0, 0, 1, 1, 1, 2]
cm = ClusteringMetric(y_true=y_true, y_pred=y_pred)
hgs_score = cm.HGS()
print(f"Hubert Gamma Score: {hgs_score}")
# ==============================================================================
# SCENARIO 2: Symmetry Verification
# ==============================================================================
print("\n--- 2. SYMMETRY EXAMPLE ---")
hgs_reverse = cm.HGS(y_true=y_pred, y_pred=y_true)
print(f"Is HGS exactly symmetric? {np.isclose(hgs_score, hgs_reverse)}")