EnS - Entropy Score

The Entropy Score (EnS) is an information-theoretic external clustering evaluation metric. Derived from Shannon Entropy, it evaluates the quality of a clustering partition by measuring the degree of class disorder (uncertainty) within each predicted cluster, weighted by the cluster’s relative size.

Intuitively, EnS answers the question: “If I randomly pick a data point from predicted cluster \(j\), how uncertain am I about its true ground truth class?” A score of 0.0 indicates absolute perfection (every cluster contains exclusively samples from a single class), while higher scores indicate chaotic, impure groupings.

\[\text{EnS} = \sum_{j=1}^{|P|} \frac{|P_j|}{N} \cdot E(P_j)\]

Where the entropy \(E(P_j)\) of an individual predicted cluster \(j\) is defined as:

\[E(P_j) = - \sum_{i=1}^{|Y|} p_{i, j} \log_2 \left( p_{i, j} \right)\]

And:

\(N\) is the total number of data points.
\(|P_j|\) is the number of samples assigned to predicted cluster \(j\).
\(p_{i, j} = \frac{C_{i, j}}{|P_j|}\) is the empirical probability that a member of predicted cluster \(j\) belongs to ground truth class \(i\) (derived directly from the Contingency Matrix \(C\)).

Algorithmic Optimizations (Performance Note)

Naive implementations evaluate cluster entropy by nested iterations over all predicted clusters and true classes, independently slicing and filtering arrays. This incurs an expensive runtime complexity of \(O(N \cdot |P| \cdot |Y|)\).

This implementation fully vectorizes the entropy computation. By constructing the global Contingency Matrix, broadcasting the cluster marginal sums, and evaluating the vectorized Shannon entropy formula across non-zero probability elements, it calculates the exact global score in :math:`O(N)` time complexity. Furthermore, it safely neutralizes empty clusters and mathematically eliminates undefined \(\log_2(0)\) operations.

Handling Edge Cases (Finite Values)

Empty Clusters: If a predicted cluster contains zero samples (\(|P_j| = 0\)), its empirical probability distribution is undefined. This implementation explicitly filters out empty partitions prior to matrix division.
Empty Datasets: If the input arrays contain zero elements (\(N = 0\)), the metric returns 0.0 by definition.

Properties 

Best possible score: 0.0 (Lower is better; indicates zero class uncertainty inside any cluster).
Worst possible score: \(\log_2(|Y|)\) (Occurs when every single predicted cluster contains a perfectly uniform, equal distribution of all ground truth classes).
Permutation Invariance: Strictly invariant to permutations of cluster labels.
Not Symmetric: In general, \(\text{EnS}(y_{true}, y_{pred}) \neq \text{EnS}(y_{pred}, y_{true})\).
Duality with Purity: While Purity evaluates the mode (maximum frequency) of a cluster’s class distribution, Entropy evaluates the entire distribution shape.
References:
- Shannon, Claude Elwood. “A mathematical theory of communication.” The Bell system technical journal 27.3 (1948): 379-423.
- Meilă, Marina. “Comparing clusterings—an information based distance.” Journal of multivariate analysis 98.5 (2007): 873-895.

Example Usage 

from permetrics.clustering import ClusteringMetric

# ==============================================================================
# SCENARIO 1: Basic Evaluation (Perfect vs Impure)
# ==============================================================================
print("--- 1. BASIC ENTROPY SCORE EXAMPLE ---")

y_true = [0, 0, 0, 1, 1, 1]
y_pred = [0, 0, 0, 1, 1, 1]

cm = ClusteringMetric(y_true=y_true, y_pred=y_pred)
print(f"Perfect Clustering Entropy: {cm.EnS():.4f}")  # Returns 0.0000

# Introduce chaos (cluster 1 now has mixed classes)
y_pred_impure = [0, 0, 1, 1, 1, 0]
cm_impure = ClusteringMetric(y_true=y_true, y_pred=y_pred_impure)
print(f"Impure Clustering Entropy:  {cm_impure.EnS():.4f}")  # Returns higher value

# ==============================================================================
# SCENARIO 2: Maximum Uncertainty Benchmark
# ==============================================================================
print("\n--- 2. MAXIMUM UNCERTAINTY EXAMPLE ---")

# Every predicted cluster is a 50/50 coin flip of class 0 and class 1
cm_chaos = ClusteringMetric(y_true=[0, 1, 0, 1], y_pred=[0, 0, 1, 1])
print(f"Maximum Entropy Score: {cm_chaos.EnS():.4f}")  # Returns 2.0

EnS - Entropy Score

Algorithmic Optimizations (Performance Note)

Handling Edge Cases (Finite Values)

Properties

Example Usage

Properties 

Example Usage 