DHI - Duda-Hart Index
The Duda-Hart Index (DHI) (adapted formulation) is an internal clustering validation metric. It assesses the quality of a clustering partition by evaluating the ratio of the average intra-cluster pairwise distances to the average inter-cluster pairwise distances.
Intuitively, DHI answers the question: “How close are points within the same cluster compared to their distance to points in all other clusters?” A smaller DHI score implies an optimal clustering structure, where clusters are highly cohesive (small numerator) and well-isolated from one another (large denominator).
Where:
\(K\) is the total number of clusters.
\(D_{\text{intra}}(C_k)\) is the mean Euclidean distance between all pairs of points within cluster \(C_k\).
\(D_{\text{inter}}(C_k)\) is the mean Euclidean distance between points in cluster \(C_k\) and points in all other distinct clusters.
Algorithmic Variations (Memory Optimization)
Calculating pairwise distances for the Duda-Hart index typically requires instantiating an \(N \times N\) distance matrix, which causes severe Out-Of-Memory (OOM) errors on large datasets (e.g., \(N > 40,000\) consuming 10GB+ RAM).
This implementation employs a highly optimized chunk-based matrix multiplication strategy:
chunk_size (int): Computes distances in batches (default:
2000). This algorithm tightly caps RAM consumption to a safe minimum (~1.5GB for 100K samples) while mathematically guaranteeing exact parity with the standard \(O(N^2)\) memory approach.
Handling Edge Cases (Finite Values)
The calculation of DHI requires comparing distances between at least two distinct clusters. If all data points are assigned to a single cluster (\(K = 1\)), the inter-cluster distance (denominator) is zero, rendering the metric mathematically undefined.
force_finite (bool): If
True, the function catches the undefined operation and returns a safe, finite number instead of raising aValueErrororZeroDivisionError. Default isTrue.finite_value (float): The fallback value returned when
force_finite=Trueand the clustering fails edge-case checks. Since a smaller score is better for DHI, the default fallback is a massive penalty value (1e10).
Properties
Best possible score:
0.0(Smaller value is better. A score approaching 0 means clusters are perfectly dense dots separated by vast distances).Worst possible score:
+inf(or the defined penaltyfinite_value).Range:
[0.0, +inf)References: Duda, R. O., & Hart, P. E. (1973). Pattern classification and scene analysis. John Wiley & Sons.
Example Usage
from permetrics.clustering import ClusteringMetric
import numpy as np
# ==============================================================================
# SCENARIO 1: Normal Evaluation (Memory-Optimized)
# ==============================================================================
print("--- 1. BASIC DUDA-HART INDEX EXAMPLE ---")
X_data = np.array([[1, 2], [1, 4], [1, 0], [10, 2], [10, 4], [10, 0]])
y_pred_labels = np.array([0, 0, 0, 1, 1, 1])
cm = ClusteringMetric(X=X_data, y_pred=y_pred_labels)
# Calculates DHI (implicitly uses safe chunk_size=2000)
dhi_score = cm.DHI()
print(f"Duda-Hart Index: {dhi_score}")
# ==============================================================================
# SCENARIO 2: Adjusting Chunk Size for Hardware Restraints
# ==============================================================================
print("\n--- 2. HARDWARE OPTIMIZATION EXAMPLE ---")
# Lower the chunk_size if running on extremely constrained RAM environments
dhi_constrained = cm.DHI(chunk_size=1000)
print(f"Duda-Hart Index (Lower RAM cap): {dhi_constrained}")
# ==============================================================================
# SCENARIO 3: Edge Case with 1 Cluster
# ==============================================================================
print("\n--- 3. EDGE CASE (1 CLUSTER) EXAMPLE ---")
y_pred_single = np.array([0, 0, 0, 0, 0, 0])
cm_single = ClusteringMetric(X=X_data, y_pred=y_pred_single)
# Returns the penalty finite_value (1e10) instead of crashing
dhi_safe = cm_single.DHI(force_finite=True, finite_value=1e10)
print(f"DHI with 1 cluster (Safe Mode): {dhi_safe}")