TauS - Tau Score ================ .. toctree:: :maxdepth: 3 .. contents:: Table of Contents :local: :depth: 2 The **Tau Score (TauS)** (adapted from Kendall's Rank Correlation Coefficient :math:`\tau` for partition comparison) is an external clustering evaluation metric designed for complex and mixed-type data. It evaluates clustering quality by measuring the normalized net difference between concordant and discordant sample pairs. Intuitively, TauS answers the question: *"When treating cluster assignment as a pairwise ranking problem across all data points, what is the exact rank correlation between the ground truth groupings and the model's predicted groupings?"* A score of ``1.0`` indicates absolute concordance (identical partitions). .. math:: \text{TauS} = \frac{S_+ - S_-}{\sqrt{N_d \cdot (N_d - t)}} Where across all :math:`N_d = \binom{N}{2}` possible distinct sample pairs: * :math:`S_+ = a + d` (Concordant pairs): Pairs placed in the **same** cluster in both partitions (:math:`a`), plus pairs placed in **different** clusters in both partitions (:math:`d`). * :math:`S_- = b + c` (Discordant pairs): Pairs co-clustered in ground truth but split in prediction (:math:`b`), plus pairs split in ground truth but co-clustered in prediction (:math:`c`). * :math:`t = a`: The number of mutual co-clustered tie pairs shared by both reference partitions. ------------------------------------------------------------------------------- Algorithmic Optimizations (Overflow Protection & Speed) ------------------------------------------------------- Standard textbook formulas evaluate Kendall's Tau by nested pairwise comparisons, scaling at an unfeasible :math:`O(N^2)` time complexity. Furthermore, on large datasets (e.g., :math:`N > 100,000`), the total pair count :math:`N_d` approaches 5 billion, causing standard 32-bit integer arrays to silently overflow into negative values. This implementation bypasses explicit pair iteration entirely. By executing **combinatorial reductions over the Contingency Matrix** marginals and explicitly casting intermediate totals to high-capacity 64-bit integers, it evaluates the exact Tau correlation in :math:`O(N)` time complexity with guaranteed numerical stability. ------------------------------------------------------------------------------- Handling Edge Cases (Finite Values) ----------------------------------- The calculation involves division by :math:`\sqrt{N_d \cdot (N_d - t)}`. This denominator evaluates to zero under two specific conditions: 1. **Identical Partitions:** When the predicted clustering perfectly matches the ground truth, all concordant positive pairs are ties (:math:`t = N_d \rightarrow N_d - t = 0`). 2. **Trivial Datasets:** When the input contains fewer than 2 samples (:math:`N < 2 \rightarrow N_d = 0`). * **force_finite (bool):** If ``True``, catches the zero-division error and returns a safe fallback value instead of raising a ``ZeroDivisionError``. Default is ``True``. * **finite_value (float):** The fallback value returned when calculation fails. Since identical partitions represent peak theoretical rank correlation, the default fallback is strictly ``1.0``. ------------------------------------------------------------------------------- Properties ---------- * **Best possible score:** ``1.0`` (Indicates absolute concordance / identical clustering structures). * **Worst possible score:** ``-1.0`` (Indicates absolute discordance / perfect inverse grouping). * **Permutation Invariance:** Strictly invariant to permutations of cluster labels. * **Symmetry:** Strictly symmetric: :math:`\text{TauS}(y_{true}, y_{pred}) = \text{TauS}(y_{pred}, y_{true})`. * **Range:** ``[-1.0, 1.0]`` * **References:** * `Kendall, Maurice G. "A new measure of rank correlation." Biometrika 30.1-2 (1938): 81-93. `_ * `Ahmad, Amir, and Lipika Dey. "A k-mean clustering algorithm for mixed numeric and categorical data." Data & Knowledge Engineering 63.2 (2007): 503-527. `_ ------------------------------------------------------------------------------- Example Usage ------------- .. code-block:: python :emphasize-lines: 11,12,20,21 from permetrics.clustering import ClusteringMetric # ============================================================================== # SCENARIO 1: Basic Evaluation # ============================================================================== print("--- 1. BASIC TAU SCORE EXAMPLE ---") y_true = [0, 0, 1, 1, 2, 2] y_pred = [0, 0, 1, 1, 1, 2] cm = ClusteringMetric(y_true=y_true, y_pred=y_pred) taus_score = cm.TauS() print(f"Tau Score: {taus_score:.4f}") # ============================================================================== # SCENARIO 2: Verifying Peak Correlation on Identical Inputs # ============================================================================== print("\n--- 2. IDENTICAL PARTITION CHECK ---") cm_perfect = ClusteringMetric(y_true=[0, 1, 2, 3], y_pred=[10, 20, 30, 40]) print(f"Perfect Match Tau Score: {cm_perfect.TauS()}")