R - Pearson’s Correlation Coefficient

Pearson’s Correlation Coefficient (often denoted as R or PCC) [15] is a foundational statistical measure that quantifies the strength and direction of the linear relationship between two variables (the actual values and the predicted values).

\[\text{R}(y, \hat{y}) = \frac{\sum_{i=1}^{N} (y_i - \bar{y})(\hat{y}_i - \bar{\hat{y}})}{\sqrt{\sum_{i=1}^{N} (y_i - \bar{y})^2} \sqrt{\sum_{i=1}^{N} (\hat{y}_i - \bar{\hat{y}})^2}}\]

Note: \(\bar{y}\) and \(\bar{\hat{y}}\) represent the mean of the actual and predicted values, respectively.


Description

Advantages:
  • Trend identification: Excellent at evaluating whether the model correctly captures the directional trend of the data. A value close to +1 indicates a strong positive correlation, while a value close to -1 indicates a strong negative correlation.

  • Scale-invariant: Because it normalizes the covariance by the standard deviations, R is dimensionless. You can seamlessly compare the linear correlation of models trained on entirely different datasets.

Disadvantages:
  • Linear constraint: The most critical limitation is that R only evaluates linear relationships. A model might capture a perfect, highly complex non-linear relationship (e.g., sinusoidal or quadratic) but still score an R near 0.0.

  • Ignores absolute error: R measures linear tracking, not absolute accuracy. If actual values are [1, 2, 3] and predictions are [10, 20, 30], the R score will be a perfect 1.0, completely ignoring the massive magnitude gap. It must be paired with error metrics like RMSE or MAE.


Properties

  • Best possible score: 1.0 (Perfect positive linear relationship) or -1.0 (Perfect negative linear relationship). A value of 0.0 indicates no linear relationship.

  • Range: [-1.0, 1.0]


Example Usage

from numpy import array
from permetrics.regression import RegressionMetric

## 1. For 1-D array (Single-output)
y_true = array([3, -0.5, 2, 7])
y_pred = array([2.5, 0.0, 2, 8])

evaluator = RegressionMetric(y_true, y_pred)
# Calculate Pearson's Correlation Coefficient
print("R: ", evaluator.pearson_correlation_coefficient())

## 2. For > 1-D array (Multi-output)
y_true = array([[0.5, 1], [-1, 1], [7, -6]])
y_pred = array([[0, 2], [-1, 2], [8, -5]])

evaluator = RegressionMetric(y_true, y_pred)
# Return an array of scores for each column
print("R (Multi-output): ", evaluator.R(multi_output="raw_values"))