WI - Willmott Index of Agreement

The Willmott Index [14], widely known in scientific literature as the Index of Agreement (d), was developed by Cort J. Willmott (1981) to overcome the insensitivity of correlation-based measures to differences in the observed and predicted means and variances.

It represents the ratio of the mean square error to the “potential error,” providing a standardized measure of the degree of model prediction error.

\[\text{WI}(y, \hat{y}) = 1 - \frac{ \sum_{i=1}^{N} (y_i - \hat{y}_i)^2 }{ \sum_{i=1}^{N} \left( |\hat{y}_i - \bar{y}| + |y_i - \bar{y}| \right)^2 }\]

Note: \(\bar{y}\) represents the mean of the actual observed values. The denominator represents the maximum possible sum of squared errors.


Description

Key Insight: WI vs. Pearson Correlation (R) The Pearson Correlation (R) can be misleadingly high even if a model’s predictions are systematically biased (e.g., if the model always predicts exactly double the true value, R will still be 1.0). Willmott’s Index of Agreement explicitly solves this by penalizing additive and proportional differences in the observed and simulated means and variances. It strictly measures absolute agreement, not just linear correlation.

Advantages:
  • Strict Bounding: Unlike NSE or R2, which can approach negative infinity, WI is strictly bounded between 0.0 and 1.0. This makes it extremely stable for cross-model comparisons and multi-site averaging without the risk of a single catastrophic model skewing the mean.

  • Hydrological Standard: It is a mandatory evaluation metric in many high-impact climate, evapotranspiration, and hydrology journals.

Disadvantages:
  • Outlier Sensitivity: Because both the numerator and denominator square the errors, the standard WI is highly sensitive to extreme outliers. (Willmott later proposed a “modified index of agreement” using absolute values to address this, but the squared version remains the most widely cited).

  • High-Value Bias: WI tends to yield relatively high values (e.g., > 0.6) even for poor models, meaning the visual interpretation of a “good” score must be strictly calibrated (often requiring scores > 0.85 to be considered acceptable).


Properties


Example Usage

from numpy import array
from permetrics.regression import RegressionMetric

## 1. For 1-D array (Single-output)
y_true = array([3, -0.5, 2, 7])
y_pred = array([2.5, 0.0, 2, 8])

evaluator = RegressionMetric(y_true, y_pred)
# Calculate Willmott Index of Agreement
print("WI: ", evaluator.WI())

## 2. For > 1-D array (Multi-output)
y_true = array([[0.5, 1], [-1, 1], [7, -6]])
y_pred = array([[0, 2], [-1, 2], [8, -5]])

evaluator = RegressionMetric(y_true, y_pred)
# Return an array of scores for each column
print("WI (Multi-output): ", evaluator.WI(multi_output="raw_values"))