PuS - Purity Score ================== .. toctree:: :maxdepth: 3 .. contents:: Table of Contents :local: :depth: 2 The **Purity Score (PuS)** is a simple, transparent external clustering evaluation metric. It quantifies the extent to which each predicted cluster contains data points belonging primarily to a single ground truth class. Intuitively, PuS answers the question: *"If we assign each predicted cluster to the ground truth class that appears most frequently within it, what would be our overall classification accuracy across the entire dataset?"* .. math:: \text{PuS} = \frac{1}{N} \sum_{j=1}^{|P|} \max_{i} \left( C_{i, j} \right) Where: * :math:`N` is the total number of data points. * :math:`C` is the :math:`|Y| \times |P|` **Contingency Matrix**, where :math:`C_{i, j}` represents the number of samples belonging to true class :math:`i` that were assigned to predicted cluster :math:`j`. * :math:`\max_{i} (C_{i, j})` extracts the sample count of the majority true class inside cluster :math:`j`. ------------------------------------------------------------------------------- Algorithmic Optimizations (Performance Note) -------------------------------------------- Standard textbook implementations compute Purity by iterating over every unique class and cluster label, utilizing boolean array masking. This incurs an inefficient runtime complexity of :math:`O(N \cdot |P| \cdot |Y|)`. This implementation vectorizes the entire process. By constructing the Contingency Matrix and executing a single column-wise maximum reduction (``np.amax(C, axis=0)``), it evaluates the exact score in **:math:`O(N)` time complexity**. Furthermore, it is strictly safe for arbitrary label formats (strings, UUIDs, or non-consecutive integers). ------------------------------------------------------------------------------- Theoretical Limitation (The "Singleton Illusion") ------------------------------------------------- While PuS is easy to interpret, **it cannot be used to trade off the quality of clustering against the number of clusters**. If a model assigns every single sample into its own individual cluster (:math:`|P| = N`), each cluster contains exactly 1 sample. The majority class count for every cluster trivially equals 1, resulting in an artificial :math:`\text{PuS} = 1.0`. Therefore, Purity should always be evaluated alongside a complexity-penalizing metric like the **Adjusted Rand Score (ARS)** or **Normalized Mutual Information (NMIS)**. ------------------------------------------------------------------------------- Properties ---------- * **Best possible score:** ``1.0`` (Perfectly pure clusters). * **Worst possible score:** Bounded below by :math:`\frac{\max_i(|Y_i|)}{N}`. (If the model groups all data into 1 single cluster, the purity simply equals the proportion of the largest true class in the dataset). * **Permutation Invariance:** Strictly invariant to permutations of cluster labels. * **Not Symmetric:** In general, :math:`\text{PuS}(y_{true}, y_{pred}) \neq \text{PuS}(y_{pred}, y_{true})`. * **Range:** ``[0.0, 1.0]`` * **References:** `Manning, Christopher D. Introduction to information retrieval. Syngress Publishing,, 2008. `_ ------------------------------------------------------------------------------- Example Usage ------------- .. code-block:: python :emphasize-lines: 11,12,25,26 from permetrics.clustering import ClusteringMetric # ============================================================================== # SCENARIO 1: Basic Evaluation # ============================================================================== print("--- 1. BASIC PURITY SCORE EXAMPLE ---") y_true = [0, 0, 0, 1, 1, 1] y_pred = [0, 0, 1, 1, 2, 2] cm = ClusteringMetric(y_true=y_true, y_pred=y_pred) pus_score = cm.PuS() print(f"Purity Score: {pus_score:.4f}") # Returns 0.8333 (5 out of 6 pure) # ============================================================================== # SCENARIO 2: Demonstrating the "Singleton Illusion" # ============================================================================== print("\n--- 2. THE SINGLETON ILLUSION ---") # 100 completely random true classes y_true_random = [0, 0, 1, 1, 2, 2] # Model cheats by putting every point into its own separate cluster y_pred_cheating = [10, 11, 12, 13, 14, 15] cm_cheat = ClusteringMetric(y_true=y_true_random, y_pred=y_pred_cheating) print(f"Cheating Model Purity: {cm_cheat.PuS()}") # Returns 1.0 (Beware!)