CKS - Cohen's Kappa Score ========================= .. toctree:: :maxdepth: 3 .. contents:: Table of Contents :local: :depth: 2 **Cohen's Kappa Score (CKS)** :cite:`cohen1960coefficient` is a robust statistical measure of inter-rater agreement for categorical items. In classification benchmarking, it measures the level of agreement between the *Predicted Labels* and the *True Ground Truth*, while strictly **compensating for the agreement that could happen purely by chance**. .. image:: /_static/images/class_score_1.png :align: center :alt: Cohen's Kappa Agreement Illustration .. math:: \kappa = \frac{p_o - p_e}{1 - p_e} Where: * :math:`p_o` is the relative observed agreement among raters (identical to accuracy: :math:`\frac{TP + TN}{N}`). * :math:`p_e` is the hypothetical probability of chance agreement, calculated using the marginal probabilities of each class. ------------------------------------------------------------------------------- Engineering Insight: The "Lucky Guess" Filter --------------------------------------------- Accuracy completely fails on imbalanced datasets because it treats lucky guesses as skill. Imagine an automated fraud detection engine evaluating 100 transactions (95 legitimate, 5 fraudulent). A broken model simply outputs `"legitimate"` 100% of the time: * **Accuracy** (:math:`p_o`): ``0.95`` * **Expected Chance Agreement** (:math:`p_e`): The ground truth has 95% legitimate. The model predicts 100% legitimate. The probability of both randomly matching `"legitimate"` is :math:`0.95 \times 1.0 = 0.95`. * **Cohen's Kappa:** :math:`\frac{0.95 - 0.95}{1 - 0.95} = \mathbf{0.0}` While Accuracy awards the broken model a 95%, Cohen's Kappa returns a brutal ``0.0``, mathematically proving that the model possesses zero predictive intelligence beyond baseline chance. ------------------------------------------------------------------------------- Architectural Design: One-vs-Rest Decomposition ----------------------------------------------- Standard statistical literature defines multiclass Kappa over a single global :math:`K \times K` matrix. ``permetrics`` extends this paradigm by decomposing multiclass problems into independent **One-vs-Rest (OvR)** :math:`2 \times 2` confusion matrices per class, calculating class-specific Kappa scores, and aggregating them via the `average` parameter: * **None:** Returns a dictionary/array of independent chance-corrected agreement scores for each target class. * **macro:** Computes the unweighted mean of the One-vs-Rest Kappa scores. This highlights models that maintain genuine predictive skill across rare minority classes. * **micro:** Calculates globally across the aggregate matrix. * **weighted:** Computes the mean of the OvR Kappa scores weighted by true class support. ------------------------------------------------------------------------------- Benchmark Interpretation Scale ------------------------------ According to the landmark guidelines by Landis & Koch (1977), Kappa values are categorized as follows: =========== ================================== Kappa Score Strength of Agreement =========== ================================== < 0.00 Poor (Systematic Disagreement) 0.00 - 0.20 Slight Agreement 0.21 - 0.40 Fair Agreement 0.41 - 0.60 Moderate Agreement 0.61 - 0.80 Substantial Agreement 0.81 - 1.00 Almost Perfect Agreement =========== ================================== ------------------------------------------------------------------------------- Properties ---------- * **Best possible score:** ``1.0`` (Perfect agreement between predictions and reality). * **Baseline score:** ``0.0`` (Agreement is exactly what would be expected by random chance). * **Worst possible score:** ``-1.0`` (Systematic inverse agreement; predictions are systematically wronger than random chance). * **Range:** ``[-1.0, 1.0]`` ------------------------------------------------------------------------------- Example Usage ------------- .. code-block:: python :emphasize-lines: 11,14,18,22,32,34-37,40-43,52,54-57 from permetrics.classification import ClassificationMetric # ============================================================================== # SCENARIO 1: Binary Classification # The default 'binary' mode requires a specific positive class (pos_label) # ============================================================================== print("--- 1. BINARY CLASSIFICATION EXAMPLES ---") y_true_bin = [0, 1, 0, 0, 1, 0] y_pred_bin = [0, 1, 0, 0, 0, 1] cm_bin = ClassificationMetric(y_true_bin, y_pred_bin) # 1. Default configuration: average="binary", pos_label=1 cks_bin_default = cm_bin.CKS() print(f"Default (average='binary', pos_label=1): {cks_bin_default}") # 2. Change pos_label to 0 cks_bin_pos0 = cm_bin.CKS(average="binary", pos_label=0) print(f"Binary with pos_label=0 : {cks_bin_pos0}") # 3. Independent chance-adjusted scores per class cks_bin_none = cm_bin.CKS(average=None) print(f"Binary with average=None : {cks_bin_none}") # ============================================================================== # SCENARIO 2: Multiclass Classification with Integer Labels # ============================================================================== print("\n--- 2. MULTICLASS (INTEGER LABELS) EXAMPLES ---") y_true_multi_int = [0, 1, 2, 0, 1, 2, 0, 2] y_pred_multi_int = [0, 2, 1, 0, 1, 1, 0, 2] cm_multi_int = ClassificationMetric(y_true_multi_int, y_pred_multi_int) print(f"average=None : {cm_multi_int.CKS(average=None)}") print(f"average='macro' : {cm_multi_int.CKS(average='macro')}") print(f"average='micro' : {cm_multi_int.CKS(average='micro')}") print(f"average='weighted' : {cm_multi_int.CKS(average='weighted')}") # Filter specific classes print(f"Filter classes [1, 2] (average=None) : {cm_multi_int.CKS(labels=[1, 2], average=None)}") print(f"Filter classes [1, 2] (average='macro') : {cm_multi_int.CKS(labels=[1, 2], average='macro')}") print(f"Filter classes [1, 2] (average='micro') : {cm_multi_int.CKS(labels=[1, 2], average='micro')}") print(f"Filter classes [1, 2] (average='weighted'): {cm_multi_int.CKS(labels=[1, 2], average='weighted')}") # ============================================================================== # SCENARIO 3: Multiclass Classification with Categorical/String Labels # ============================================================================== print("\n--- 3. MULTICLASS (CATEGORICAL/STRING LABELS) EXAMPLES ---") y_true_str = ["cat", "ant", "cat", "cat", "ant", "bird", "bird", "bird"] y_pred_str = ["ant", "ant", "cat", "cat", "ant", "cat", "bird", "ant"] cm_str = ClassificationMetric(y_true_str, y_pred_str) print(f"average=None (Class dict) : {cm_str.CKS(average=None)}") print(f"average='macro' : {cm_str.CKS(average='macro')}") print(f"average='micro' : {cm_str.CKS(average='micro')}") print(f"average='weighted' : {cm_str.CKS(average='weighted')}")