EnS - Entropy Score =================== .. toctree:: :maxdepth: 3 .. contents:: Table of Contents :local: :depth: 2 The **Entropy Score (EnS)** is an information-theoretic external clustering evaluation metric. Derived from Shannon Entropy, it evaluates the quality of a clustering partition by measuring the degree of class disorder (uncertainty) within each predicted cluster, weighted by the cluster's relative size. Intuitively, EnS answers the question: "If I randomly pick a data point from predicted cluster :math:`j`, how uncertain am I about its true ground truth class?" A score of ``0.0`` indicates absolute perfection (every cluster contains exclusively samples from a single class), while higher scores indicate chaotic, impure groupings. .. math:: \text{EnS} = \sum_{j=1}^{|P|} \frac{|P_j|}{N} \cdot E(P_j) Where the entropy :math:`E(P_j)` of an individual predicted cluster :math:`j` is defined as: .. math:: E(P_j) = - \sum_{i=1}^{|Y|} p_{i, j} \log_2 \left( p_{i, j} \right) And: * :math:`N` is the total number of data points. * :math:`|P_j|` is the number of samples assigned to predicted cluster :math:`j`. * :math:`p_{i, j} = \frac{C_{i, j}}{|P_j|}` is the empirical probability that a member of predicted cluster :math:`j` belongs to ground truth class :math:`i` (derived directly from the **Contingency Matrix** :math:`C`). ------------------------------------------------------------------------------- Algorithmic Optimizations (Performance Note) -------------------------------------------- Naive implementations evaluate cluster entropy by nested iterations over all predicted clusters and true classes, independently slicing and filtering arrays. This incurs an expensive runtime complexity of :math:`O(N \cdot |P| \cdot |Y|)`. This implementation fully vectorizes the entropy computation. By constructing the global Contingency Matrix, broadcasting the cluster marginal sums, and evaluating the vectorized Shannon entropy formula across non-zero probability elements, it calculates the exact global score in **:math:`O(N)` time complexity**. Furthermore, it safely neutralizes empty clusters and mathematically eliminates undefined :math:`\log_2(0)` operations. ------------------------------------------------------------------------------- Handling Edge Cases (Finite Values) ----------------------------------- * **Empty Clusters:** If a predicted cluster contains zero samples (:math:`|P_j| = 0`), its empirical probability distribution is undefined. This implementation explicitly filters out empty partitions prior to matrix division. * **Empty Datasets:** If the input arrays contain zero elements (:math:`N = 0`), the metric returns ``0.0`` by definition. ------------------------------------------------------------------------------- Properties ---------- * **Best possible score:** ``0.0`` (Lower is better; indicates zero class uncertainty inside any cluster). * **Worst possible score:** :math:`\log_2(|Y|)` (Occurs when every single predicted cluster contains a perfectly uniform, equal distribution of all ground truth classes). * **Permutation Invariance:** Strictly invariant to permutations of cluster labels. * **Not Symmetric:** In general, :math:`\text{EnS}(y_{true}, y_{pred}) \neq \text{EnS}(y_{pred}, y_{true})`. * **Duality with Purity:** While Purity evaluates the *mode* (maximum frequency) of a cluster's class distribution, Entropy evaluates the *entire distribution shape*. * **References:** * `Shannon, Claude Elwood. "A mathematical theory of communication." The Bell system technical journal 27.3 (1948): 379-423. `_ * `Meilă, Marina. "Comparing clusterings—an information based distance." Journal of multivariate analysis 98.5 (2007): 873-895. `_ ------------------------------------------------------------------------------- Example Usage ------------- .. code-block:: python :emphasize-lines: 11,12,16,17,25,26 from permetrics.clustering import ClusteringMetric # ============================================================================== # SCENARIO 1: Basic Evaluation (Perfect vs Impure) # ============================================================================== print("--- 1. BASIC ENTROPY SCORE EXAMPLE ---") y_true = [0, 0, 0, 1, 1, 1] y_pred = [0, 0, 0, 1, 1, 1] cm = ClusteringMetric(y_true=y_true, y_pred=y_pred) print(f"Perfect Clustering Entropy: {cm.EnS():.4f}") # Returns 0.0000 # Introduce chaos (cluster 1 now has mixed classes) y_pred_impure = [0, 0, 1, 1, 1, 0] cm_impure = ClusteringMetric(y_true=y_true, y_pred=y_pred_impure) print(f"Impure Clustering Entropy: {cm_impure.EnS():.4f}") # Returns higher value # ============================================================================== # SCENARIO 2: Maximum Uncertainty Benchmark # ============================================================================== print("\n--- 2. MAXIMUM UNCERTAINTY EXAMPLE ---") # Every predicted cluster is a 50/50 coin flip of class 0 and class 1 cm_chaos = ClusteringMetric(y_true=[0, 1, 0, 1], y_pred=[0, 0, 1, 1]) print(f"Maximum Entropy Score: {cm_chaos.EnS():.4f}") # Returns 2.0