CS - Completeness Score

The Completeness Score (CS) is an external clustering evaluation metric based on conditional entropy. A clustering partition satisfies completeness if all data points that are members of a given ground truth class are assigned to the exact same predicted cluster.

Intuitively, CS answers the question: “Are all samples of class X put into the same cluster?” A score of 1.0 indicates perfectly complete clustering, while 0.0 indicates that the cluster assignments fail to group identical classes together.

\[\text{CS} = 1 - \frac{\text{H}(P | Y)}{\text{H}(P)}\]

Where:

  • \(\text{H}(P | Y)\) is the conditional entropy of the predicted clusters \(P\) given the ground truth classes \(Y\). It quantifies the remaining uncertainty about which cluster a sample belongs to, given knowledge of its true class.

  • \(\text{H}(P)\) is the entropy of the predicted clusters.

Expressed directly via the Mutual Information Score (\(\text{MIS}\)):

\[\text{CS} = \frac{\text{MIS}(Y, P)}{\text{H}(P)}\]

Handling Edge Cases (Finite Values)

The calculation of CS involves division by the entropy of the predicted clusters (\(\text{H}(P)\)). If the model assigns every single sample into 1 universal cluster (\(|P| = 1\)), the entropy \(\text{H}(P)\) evaluates to zero, making the mathematical division undefined.

  • force_finite (bool): If True, the function catches the zero-division error when \(\text{H}(P) = 0\) and returns a safe fallback value instead of raising a ValueError or ZeroDivisionError. Default is True.

  • finite_value (float): The specific fallback value returned when force_finite=True and the prediction has only 1 cluster. Since placing all samples into a single cluster trivially guarantees that all members of any true class end up in the same place, the default fallback is 1.0.


Properties


Example Usage

from permetrics.clustering import ClusteringMetric

# ==============================================================================
# SCENARIO 1: Basic Evaluation
# ==============================================================================
print("--- 1. BASIC COMPLETENESS SCORE EXAMPLE ---")

y_true = [0, 0, 1, 1, 2, 2]
y_pred = [0, 0, 1, 1, 2, 2]

cm = ClusteringMetric(y_true=y_true, y_pred=y_pred)
cs_score = cm.CS()
print(f"Completeness Score: {cs_score}")

# ==============================================================================
# SCENARIO 2: Completeness vs Homogeneity Distinction
# ==============================================================================
print("\n--- 2. SINGLE CLUSTER (UNDER-SPLITTING) EXAMPLE ---")

# Putting all distinct true classes into 1 single cluster gives 100% Completeness
cm_single = ClusteringMetric(y_true=[0, 1, 2, 3], y_pred=[0, 0, 0, 0])
print(f"Single Cluster CS: {cm_single.CS()}")
print(f"Single Cluster HS: {cm_single.HS()}")