COR - Correlation Coefficient

Correlation (COR) [22] (specifically Pearson’s Correlation Coefficient) measures the strength and direction of the linear relationship between the actual values (\(y\)) and predicted values (\(\hat{y}\)).

It is calculated by dividing the covariance of the variables by the product of their standard deviations. This normalization scales the output to a specific range, making it dimensionless and entirely independent of the units of measurement.

\[\text{COR}(y, \hat{y}) = \frac{\text{cov}(y, \hat{y})}{\sigma_y \sigma_{\hat{y}}} = \frac{\sum_{i=1}^{N} (y_i - \bar{y})(\hat{y}_i - \bar{\hat{y}})}{\sqrt{\sum_{i=1}^{N} (y_i - \bar{y})^2 \sum_{i=1}^{N} (\hat{y}_i - \bar{\hat{y}})^2}}\]

Description

Advantages:
  • Scale-invariant: Because it is dimensionless, you can use COR to compare the predictive behavior of models across entirely different datasets and domains.

  • Trend evaluation: Highly effective at identifying whether the model correctly captures the directional trend of the data (e.g., when the actual value goes up, the prediction also goes up).

Disadvantages:
  • Ignores magnitude (Crucial limitation): COR measures linear tracking, not absolute accuracy. If actual values are [1, 2, 3] and predictions are [1000, 2000, 3000], the COR will be a perfect 1.0, even though the absolute errors are massive. It should always be paired with an error metric like RMSE or MAE.

  • Linearity assumption: It only evaluates linear relationships. A model might capture a perfect non-linear relationship (e.g., quadratic) but still score a low or zero Pearson correlation.

  • Outlier sensitivity: A single massive outlier can heavily distort the covariance, drastically shifting the correlation score (as demonstrated in Anscombe’s quartet).


Properties

  • Best possible score: 1.0 (Perfect positive correlation). A score of -1.0 indicates perfect negative correlation, and 0.0 indicates no linear correlation.

  • Range: [-1.0, 1.0]

  • Mathematical Reference: Corporate Finance Institute


Example Usage

from numpy import array
from permetrics.regression import RegressionMetric

## 1. For 1-D array (Single-output)
y_true = array([3, -0.5, 2, 7])
y_pred = array([2.5, 0.0, 2, 8])

evaluator = RegressionMetric(y_true, y_pred)
# Calculate Correlation Coefficient
print("COR: ", evaluator.COR())

## 2. For > 1-D array (Multi-output)
y_true = array([[0.5, 1], [-1, 1], [7, -6]])
y_pred = array([[0, 2], [-1, 2], [8, -5]])

evaluator = RegressionMetric(y_true, y_pred)
# Return an array of scores for each column
print("COR (Multi-output): ", evaluator.COR(multi_output="raw_values"))