TauS - Tau Score
The Tau Score (TauS) (adapted from Kendall’s Rank Correlation Coefficient \(\tau\) for partition comparison) is an external clustering evaluation metric designed for complex and mixed-type data. It evaluates clustering quality by measuring the normalized net difference between concordant and discordant sample pairs.
Intuitively, TauS answers the question: “When treating cluster assignment as a pairwise ranking problem across all data points, what is the exact rank correlation between the ground truth groupings and the model’s predicted groupings?” A score of 1.0 indicates absolute concordance (identical partitions).
Where across all \(N_d = \binom{N}{2}\) possible distinct sample pairs:
\(S_+ = a + d\) (Concordant pairs): Pairs placed in the same cluster in both partitions (\(a\)), plus pairs placed in different clusters in both partitions (\(d\)).
\(S_- = b + c\) (Discordant pairs): Pairs co-clustered in ground truth but split in prediction (\(b\)), plus pairs split in ground truth but co-clustered in prediction (\(c\)).
\(t = a\): The number of mutual co-clustered tie pairs shared by both reference partitions.
Algorithmic Optimizations (Overflow Protection & Speed)
Standard textbook formulas evaluate Kendall’s Tau by nested pairwise comparisons, scaling at an unfeasible \(O(N^2)\) time complexity. Furthermore, on large datasets (e.g., \(N > 100,000\)), the total pair count \(N_d\) approaches 5 billion, causing standard 32-bit integer arrays to silently overflow into negative values.
This implementation bypasses explicit pair iteration entirely. By executing combinatorial reductions over the Contingency Matrix marginals and explicitly casting intermediate totals to high-capacity 64-bit integers, it evaluates the exact Tau correlation in \(O(N)\) time complexity with guaranteed numerical stability.
Handling Edge Cases (Finite Values)
The calculation involves division by \(\sqrt{N_d \cdot (N_d - t)}\). This denominator evaluates to zero under two specific conditions:
Identical Partitions: When the predicted clustering perfectly matches the ground truth, all concordant positive pairs are ties (\(t = N_d \rightarrow N_d - t = 0\)).
Trivial Datasets: When the input contains fewer than 2 samples (\(N < 2 \rightarrow N_d = 0\)).
force_finite (bool): If
True, catches the zero-division error and returns a safe fallback value instead of raising aZeroDivisionError. Default isTrue.finite_value (float): The fallback value returned when calculation fails. Since identical partitions represent peak theoretical rank correlation, the default fallback is strictly
1.0.
Properties
Best possible score:
1.0(Indicates absolute concordance / identical clustering structures).Worst possible score:
-1.0(Indicates absolute discordance / perfect inverse grouping).Permutation Invariance: Strictly invariant to permutations of cluster labels.
Symmetry: Strictly symmetric: \(\text{TauS}(y_{true}, y_{pred}) = \text{TauS}(y_{pred}, y_{true})\).
Range:
[-1.0, 1.0]References:
Example Usage
from permetrics.clustering import ClusteringMetric
# ==============================================================================
# SCENARIO 1: Basic Evaluation
# ==============================================================================
print("--- 1. BASIC TAU SCORE EXAMPLE ---")
y_true = [0, 0, 1, 1, 2, 2]
y_pred = [0, 0, 1, 1, 1, 2]
cm = ClusteringMetric(y_true=y_true, y_pred=y_pred)
taus_score = cm.TauS()
print(f"Tau Score: {taus_score:.4f}")
# ==============================================================================
# SCENARIO 2: Verifying Peak Correlation on Identical Inputs
# ==============================================================================
print("\n--- 2. IDENTICAL PARTITION CHECK ---")
cm_perfect = ClusteringMetric(y_true=[0, 1, 2, 3], y_pred=[10, 20, 30, 40])
print(f"Perfect Match Tau Score: {cm_perfect.TauS()}")