PrS - Precision Score ===================== .. toctree:: :maxdepth: 3 .. contents:: Table of Contents :local: :depth: 2 The **Precision Score (PrS)** is an external clustering evaluation metric. Derived from information retrieval and classification theory, it evaluates the quality of a clustering partition by measuring the proportion of sample pairs placed in the same cluster that are truly supposed to be grouped together according to the ground truth. Intuitively, PrS answers the question: *"Of all the pairs of points that my model decided to put into the same cluster, what percentage actually belong to the same true class?"* A score of ``1.0`` indicates that every single cluster created by the model is completely pure. .. math:: \text{PrS} = \frac{yy}{yy + ny} Where across all pairs of distinct data points: * :math:`yy` (True Positives): Number of pairs placed in the **same** cluster in both the ground truth (:math:`y_{true}`) and the prediction (:math:`y_{pred}`). * :math:`ny` (False Positives): Number of pairs placed in **different** classes in :math:`y_{true}`, but incorrectly grouped into the **same** cluster in :math:`y_{pred}`. * The denominator :math:`yy + ny` represents the total number of intra-cluster pairs generated by the prediction. Expressed in conditional probability notation (as formulated in clusterCrit): .. math:: \text{PrS} = P(gp_1 | gp_2) Where :math:`gp_1` and :math:`gp_2` represent the events that two points are grouped together in the ground truth and the predicted partition, respectively. ------------------------------------------------------------------------------- Algorithmic Optimizations (Performance Note) -------------------------------------------- Standard pairwise evaluation requires checking all :math:`\binom{N}{2}` combinations, resulting in an :math:`O(N^2)` computational bottleneck. This implementation bypasses explicit pair enumeration. By utilizing the dot products of the **Contingency Matrix** and its marginal sums, it extracts the exact pair totals (:math:`yy` and :math:`ny`) in **:math:`O(N)` time complexity**. This guarantees high-speed benchmarking even on large-scale datasets. ------------------------------------------------------------------------------- Handling Edge Cases (Finite Values) ----------------------------------- The Precision Score involves division by the total number of predicted intra-cluster pairs (:math:`yy + ny`). If the predicted partition consists exclusively of singletons (every cluster contains exactly 1 sample), the model groups zero pairs together, making the denominator zero and causing an undefined mathematical division. * **force_finite (bool):** If ``True``, the function catches the zero-division error and returns a safe fallback value instead of raising a ``ZeroDivisionError``. Default is ``True``. * **finite_value (float):** The fallback value returned when ``force_finite=True`` and the model predicts no intra-cluster pairs. Since predicting only singletons yields no false positive groupings, the default fallback is ``1.0``. ------------------------------------------------------------------------------- Properties ---------- * **Best possible score:** ``1.0`` (Higher value is better, indicating zero false-positive co-clusterings). * **Worst possible score:** ``0.0`` (None of the co-clustered pairs actually belong to the same ground truth class). * **Permutation Invariance:** The score is completely invariant to permutations of cluster labels. * **Asymmetric:** In general, :math:`\text{PrS}(y_{true}, y_{pred}) \neq \text{PrS}(y_{pred}, y_{true})`. Switching the reference partition yields the **Recall Score (ReS)**. * **Range:** ``[0.0, 1.0]`` * **References:** `Desgraupes, Bernard. "Clustering indices." University of Paris Ouest-Lab Modal’X 1.1 (2013): 34. `_ ------------------------------------------------------------------------------- Example Usage ------------- .. code-block:: python :emphasize-lines: 11,12,21,22 from permetrics.clustering import ClusteringMetric # ============================================================================== # SCENARIO 1: Basic Evaluation # ============================================================================== print("--- 1. BASIC PRECISION SCORE EXAMPLE ---") y_true = [0, 0, 1, 1, 2, 2] y_pred = [0, 0, 1, 1, 1, 2] cm = ClusteringMetric(y_true=y_true, y_pred=y_pred) prs_score = cm.PrS() print(f"Precision Score: {prs_score}") # ============================================================================== # SCENARIO 2: Precision vs Recall Asymmetry # ============================================================================== print("\n--- 2. ASYMMETRY EXAMPLE ---") # Putting everything into 1 single cluster gives low Precision but 100% Recall cm_single = ClusteringMetric(y_true=[0, 0, 1, 1], y_pred=[0, 0, 0, 0]) print(f"Single Cluster Precision: {cm_single.PrS()}")