CDS - Czekanowski-Dice Score

The Czekanowski-Dice Score (CDS) (widely known as the Sørensen–Dice Coefficient or Ochiai Index, and mathematically identical to the balanced F1-Score) is an external clustering evaluation metric. It measures the similarity between two clustering partitions by computing the harmonic mean of the pairwise Precision and Recall.

Intuitively, CDS quantifies the ratio of shared agreements to the total number of co-clustered pairs across both partitions. It answers the question: “Of all the times either model decided to group two points together, what proportion of those decisions were mutual?” A score of 1.0 indicates identical clustering structures.

\[\text{CDS} = \frac{2yy}{2yy + yn + ny}\]

Where across all pairs of distinct data points:

  • \(yy\) (True Positives): Number of pairs assigned to the same cluster in both the ground truth (\(y_{true}\)) and the prediction (\(y_{pred}\)).

  • \(yn\) (False Negatives): Pairs co-clustered in \(y_{true}\), but split in \(y_{pred}\).

  • \(ny\) (False Positives): Pairs co-clustered in \(y_{pred}\), but split in \(y_{true}\).

Expressed directly via the pairwise Precision (\(\text{PrS}\)) and Recall (\(\text{ReS}\)):

\[\text{CDS} = \frac{2 \times \text{PrS} \times \text{ReS}}{\text{PrS} + \text{ReS}}\]

Algorithmic Optimizations (Performance Note)

Brute-force iteration over all possible sample pairs scales quadratically at \(O(N^2)\), which causes severe bottlenecks on larger datasets.

This implementation derives the exact pair totals (\(yy\), \(yn\), and \(ny\)) directly from the algebraic properties of the Contingency Matrix marginals. This reduces the runtime complexity to :math:`O(N)`, ensuring lightning-fast execution.


Handling Edge Cases (Finite Values)

The calculation involves division by \(2yy + yn + ny\). If both partitions consist entirely of singletons (every cluster has exactly 1 data point), neither model groups any pairs together. The denominator evaluates to zero, triggering an undefined mathematical operation.

  • force_finite (bool): If True, catches the zero-division error and returns a safe fallback value instead of raising a ZeroDivisionError. Default is True.

  • finite_value (float): The fallback value returned when force_finite=True and the calculation fails. Since the worst possible valid score is 0.0, the default fallback is 0.0.


Properties


Example Usage

from permetrics.clustering import ClusteringMetric

# ==============================================================================
# SCENARIO 1: Basic Evaluation
# ==============================================================================
print("--- 1. BASIC CZEKANOWSKI-DICE SCORE EXAMPLE ---")

y_true = [0, 0, 1, 1, 2, 2]
y_pred = [0, 0, 1, 1, 1, 2]

cm = ClusteringMetric(y_true=y_true, y_pred=y_pred)
cds_score = cm.CDS()
print(f"Czekanowski-Dice Score: {cds_score}")

# ==============================================================================
# SCENARIO 2: Demonstrating Identity with F1-Score
# ==============================================================================
print("\n--- 2. IDENTITY CHECK EXAMPLE ---")

f1_score = cm.FmS(beta=1.0)
print(f"Are CDS and FmS(beta=1) exactly equal? {np.isclose(cds_score, f1_score)}")