PuS - Purity Score

The Purity Score (PuS) is a simple, transparent external clustering evaluation metric. It quantifies the extent to which each predicted cluster contains data points belonging primarily to a single ground truth class.

Intuitively, PuS answers the question: “If we assign each predicted cluster to the ground truth class that appears most frequently within it, what would be our overall classification accuracy across the entire dataset?”

\[\text{PuS} = \frac{1}{N} \sum_{j=1}^{|P|} \max_{i} \left( C_{i, j} \right)\]

Where:

\(N\) is the total number of data points.
\(C\) is the \(|Y| \times |P|\) Contingency Matrix, where \(C_{i, j}\) represents the number of samples belonging to true class \(i\) that were assigned to predicted cluster \(j\).
\(\max_{i} (C_{i, j})\) extracts the sample count of the majority true class inside cluster \(j\).

Algorithmic Optimizations (Performance Note)

Standard textbook implementations compute Purity by iterating over every unique class and cluster label, utilizing boolean array masking. This incurs an inefficient runtime complexity of \(O(N \cdot |P| \cdot |Y|)\).

This implementation vectorizes the entire process. By constructing the Contingency Matrix and executing a single column-wise maximum reduction (np.amax(C, axis=0)), it evaluates the exact score in :math:`O(N)` time complexity. Furthermore, it is strictly safe for arbitrary label formats (strings, UUIDs, or non-consecutive integers).

Theoretical Limitation (The “Singleton Illusion”)

While PuS is easy to interpret, it cannot be used to trade off the quality of clustering against the number of clusters.

If a model assigns every single sample into its own individual cluster (\(|P| = N\)), each cluster contains exactly 1 sample. The majority class count for every cluster trivially equals 1, resulting in an artificial \(\text{PuS} = 1.0\). Therefore, Purity should always be evaluated alongside a complexity-penalizing metric like the Adjusted Rand Score (ARS) or Normalized Mutual Information (NMIS).

Properties 

Best possible score: 1.0 (Perfectly pure clusters).
Worst possible score: Bounded below by \(\frac{\max_i(|Y_i|)}{N}\). (If the model groups all data into 1 single cluster, the purity simply equals the proportion of the largest true class in the dataset).
Permutation Invariance: Strictly invariant to permutations of cluster labels.
Not Symmetric: In general, \(\text{PuS}(y_{true}, y_{pred}) \neq \text{PuS}(y_{pred}, y_{true})\).
Range: [0.0, 1.0]
References: Manning, Christopher D. Introduction to information retrieval. Syngress Publishing,, 2008.

Example Usage 

from permetrics.clustering import ClusteringMetric

# ==============================================================================
# SCENARIO 1: Basic Evaluation
# ==============================================================================
print("--- 1. BASIC PURITY SCORE EXAMPLE ---")

y_true = [0, 0, 0, 1, 1, 1]
y_pred = [0, 0, 1, 1, 2, 2]

cm = ClusteringMetric(y_true=y_true, y_pred=y_pred)
pus_score = cm.PuS()
print(f"Purity Score: {pus_score:.4f}")  # Returns 0.8333 (5 out of 6 pure)

# ==============================================================================
# SCENARIO 2: Demonstrating the "Singleton Illusion"
# ==============================================================================
print("\n--- 2. THE SINGLETON ILLUSION ---")

# 100 completely random true classes
y_true_random = [0, 0, 1, 1, 2, 2]
# Model cheats by putting every point into its own separate cluster
y_pred_cheating = [10, 11, 12, 13, 14, 15]

cm_cheat = ClusteringMetric(y_true=y_true_random, y_pred=y_pred_cheating)
print(f"Cheating Model Purity: {cm_cheat.PuS()}")  # Returns 1.0 (Beware!)

PuS - Purity Score

Algorithmic Optimizations (Performance Note)

Theoretical Limitation (The “Singleton Illusion”)

Properties

Example Usage

Properties 

Example Usage 