JS - Jaccard Score
The Jaccard Score (JS) (also known as the Jaccard Index or Tanimoto Coefficient) is an external clustering evaluation metric. It quantifies the similarity between two clustering partitions by measuring the ratio of truly co-clustered sample pairs to the total number of pairs that were grouped together by at least one of the partitions.
Intuitively, JS answers the question: “Of all the pairs of points that were grouped together in either the ground truth or the model’s prediction, what proportion of them were grouped together in both?” Unlike the Rand Score, the Jaccard Score completely ignores True Negatives (\(nn\)). This makes it exceptionally useful when analyzing datasets with a large number of clusters, where the vast majority of sample pairs belong to different clusters and would otherwise artificially inflate the similarity score.
Where across all pairs of distinct data points:
\(yy\) (True Positives): Number of pairs placed in the same cluster in both the ground truth (\(y_{true}\)) and the prediction (\(y_{pred}\)).
\(yn\) (False Negatives): Pairs co-clustered in \(y_{true}\), but separated in \(y_{pred}\).
\(ny\) (False Positives): Pairs co-clustered in \(y_{pred}\), but separated in \(y_{true}\).
Algorithmic Optimizations (Performance Note)
Iterating through all possible pair combinations to evaluate \(yy\), \(yn\), and \(ny\) scales quadratically at \(O(N^2)\).
This implementation derives the exact pair totals directly from the algebraic dot products of the Contingency Matrix marginals. This reduces the computational complexity to :math:`O(N)` time, allowing instantaneous evaluation on massive datasets.
Handling Edge Cases (Finite Values)
The Jaccard Score involves division by \(yy + yn + ny\). If both partitions consist entirely of isolated singletons (every cluster has exactly 1 data point), neither partition groups any points together. The denominator evaluates to zero, causing an undefined mathematical division.
force_finite (bool): If
True, catches the zero-division error and returns a safe fallback value instead of raising aZeroDivisionError. Default isTrue.finite_value (float): The fallback value returned when
force_finite=Trueand the calculation fails. Since the worst possible valid score is 0.0, the default fallback is0.0.
Properties
Best possible score:
1.0(Indicates identical clustering partitions).Worst possible score:
0.0(The two partitions share zero co-clustered pairs).Permutation Invariance: The metric is completely invariant to permutations of cluster labels.
Symmetry: Strictly symmetric: \(\text{JS}(y_{true}, y_{pred}) = \text{JS}(y_{pred}, y_{true})\).
Range:
[0.0, 1.0]References:
Example Usage
from permetrics.clustering import ClusteringMetric
# ==============================================================================
# SCENARIO 1: Basic Evaluation
# ==============================================================================
print("--- 1. BASIC JACCARD SCORE EXAMPLE ---")
y_true = [0, 0, 1, 1, 2, 2]
y_pred = [0, 0, 1, 1, 1, 2]
cm = ClusteringMetric(y_true=y_true, y_pred=y_pred)
js_score = cm.JS()
print(f"Jaccard Score: {js_score}")
# ==============================================================================
# SCENARIO 2: Jaccard vs Rand Score on Highly Dispersed Clusters
# ==============================================================================
print("\n--- 2. TRUE NEGATIVE IGNORANCE EXAMPLE ---")
# When mostly singletons exist, Rand Score stays high due to 'nn', but JS drops
cm_sparse = ClusteringMetric(y_true=[0, 1, 2, 3, 4], y_pred=[0, 0, 1, 2, 3])
print(f"Rand Score (Inflated): {cm_sparse.RaS()}")
print(f"Jaccard Score (Strict): {cm_sparse.JS()}")