Understanding and Implementing Key Regression Evaluation Metrics

Introduction

There are different types of Machine learning (ML) problems based on whether the dataset has a target feature (Supervised and Unsupervised) and whether the target feature is a continuous numerical value or a categorical variable (Regression or Classification).

Regression analysis is a subdivision of Supervised ML. It is a form of ML that aims to predict a continuous numeric target variable. Regression analysis includes predicting: - Healthcare cost estimation - Student test performance - Daily electricity consumption

It is necessary to build a robust and generalized model. You can only achieve that by evaluating the model built on the test dataset to see how well they performed. For regression models, there is no one-size-fits-all metric for evaluation. The choice of evaluation metric largely depends on the dataset and use case, as each metric has distinct features that contribute to its strengths and weaknesses.

Image from Google Gemini

Residuals

Before discussing the various regression evaluation metrics, let’s consider residuals and the fundamental principle on which they work.

Residuals are the differences between actual values and predicted values, the fundamental principle on which the regression evaluation metrics work.

The mathematical formula for calculating Residuals:

$$e_i = y_i - \hat{y}_i$$

Realistically, it is unlikely that a regression model predicts an exact value for a problem statement, the prediction will either be higher or lower than the actual value. The most important thing is to build a model with low residuals, i.e., small differences between the actual and predicted values. The closer the residuals are to zero, the better the model’s predictions.

Evaluation Metrics

Mean Absolute Error (MAE)

Mean Absolute Error (MAE): MAE computes the average absolute difference between the predicted and actual values of the target variable. The Absolute is a mathematical function that forces the residual which may be negative to be positive.

The MAE can be calculated using the mathematical formula below:

$$\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} | y_i - \hat{y}_i |$$

A low MAE signifies a good model, whereas a high MAE indicates that the model performs poorly in making predictions.

Characteristics

  • It is easier to interpret, as it is expressed in the scale and unit of the target value.

  • Less likely to be affected by extreme values i.e., outliers.

  • It may not be the most effective metric for datasets with numerous outliers. It does not penalize large errors

  • It disregards the direction of the error, so it treats both negative and positive errors as the same.

Implementation

from sklearn.metrics import mean_absolute_error 
mae = mean_absolute_error(y_test, y_pred)
  • y_test is the actual values while y_pred is the predicted values.

  • Using the mean_absolute_error() function from the scikit learn library, the MAE requires the y_test and the y_pred for its computation.

Mean Squared Error (MSE)

Mean Squared Error (MSE): MSE calculates the average of the squares of the difference (errors) between the actual and predicted values across the dataset. It is one of the widely used evaluation metrics.

The MAE can be calculated using the mathematical formula:

$$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$

The lower the MSE score, the better the model predicts the target value. A good model will have an MSE score closer to zero.

Characteristics

  • Sensitive to extreme values i.e., outliers. It penalizes large errors.

  • It is harder to interpret, as it is measured in the squares of the original units and scale.

  • MSE value will never be negative but can range from zero to infinity

  • MSE changes and increases with a change and increase in residuals

Implementation

from sklearn.metrics import mean_squared_error 
mse = mean_squared_error(y_test, y_pred)
  • y_test is the actual values while y_pred is the predicted values.

  • Using the mean_squared_error() function from the scikit learn library, the MSE requires the y_test and the y_pred for its computation.

Root Mean Squared Error (RMSE)

Root Mean Squared Error (RMSE): RSME is the square root of the mean squared error. It returns the MSE to the target variable's original units and scale.

RMSE can be calculated with the following mathematical formula:

$$\text{RMSE} = \sqrt{ \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 }$$

A good model has a low RMSE value relative to the scale and unit of the target variable. The closer it is to zero, the better the model predictions.

Characteristics

  • It is easier to interpret as it is in the same unit as the target variable.

  • The results will always be positive

  • Sensitive to extreme values i.e, outliers

  • RMSE increases with an increase in residuals.

Implementation

import numpy as np 
from sklearn.metrics import mean_squared_error 

# square root of MSE 
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
  • Scikit-learn does not have a RMSE evaluation function. You can compute RMSE by using the Numpy square root method on the MSE metric

  • Computing RMSE requires:

  • y_test - actual values

  • y_pred - predicted values.

  • np.sqrt() - Numpy square root method

R Squared (R2)

R Squared( R2): R2 is also called the coefficient of determination. R^2 is a statistical metric that measures how much variation in the target variable can be explained by the predictors (independent variable). It is the proportion of variance in the target variable captured by the model. It is the most widely used metric for evaluating regression models.

The formula for the coefficient of determination (R2) is:

$$R^2 = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}i)^2}{\sum{i=1}^{n} (y_i - \bar{y})^2}$$

Where:

  • SSR = Sum Squares of Residuals

  • SST = Total Sum of Squares

  • y_i are the actual values

  • y_i(hat) are the predicted values

  • y(bar) is the mean of the actual values

A model is good as the R2 score moves closer to 1. A higher R2 score signifies a better fit of the model.

Characteristics

  • R2 is a relative metric,

  • R2 is always non-negative and ranges from Zero (0) to One (1)

  • R2 is independent of the target variable scale or unit

  • R2 is heavily sensitive to extreme values.

  • R2 generally increases with the number of predictors leading to overfitting.

  • R2 allows the goodness-of-fit comparison of different models applied to the same dataset.

Implementation

from sklearn.metrics import r2_score 
r2 = r2_score(y_test, y_pred)
  • y_test is the actual values while y_pred is the predicted values.

  • Using the r2_score() function from the scikit learn library, the $ R^2 $ requires the y_test and the y_pred for its computation.

Adjusted R Squared (R2 )

Adjusted R Squared(R2): Adjusted R2 was created to address a fundamental problem in R2. When new predictors are added to a model, R2 generally tends to increase leading to a false belief of better performance with more predictors. This can result in models that may not perform optimally on new, unseen data.

Adjusted R2 solves this by penalizing the addition of new irrelevant predictors to the model. Adjusted R2 value decreases, thus providing a more accurate measure of model performance.

The formula for the Adjusted R2 is:

$$R^2_{\text{adj}} = 1 - \left( \frac{1 - R^2}{n - k - 1} \right) \times (n - 1)$$

Where:

  • R2 is the coefficient of determination

  • n is the number of observations

  • k is the number of predictors (independent variables)

Characteristics

  • Useful for comparing models with different numbers of predictors

  • Less sensitive to overfitting. It helps prevent model complexity.

  • Increases only if new predictor improves the model

Implementation

# import r2_score 
from sklearn.metrics import r2_score 
r2 = r2_score(y_test, y_pred)

# Number of observations and predictors
n = X_test.shape[0]
k = X_test.shape[1]

# Adjusted R² calculation
adjusted_r2 = 1 - (1-r2) * (n-1) / (n-p-1)
  • Adjusted_r2 is calculated using the number of observations, predictors, and the r2_score.

Summary: Notable Points from These Evaluation Metrics

  1. A regression model with lower MAE, MSE, and RMSE values is more accurate. On the other hand, a regression model with a higher R2 value is considered desirable.

  2. The RMSE will always be bigger or equal to the MAE. If RMSE = MAE, then all the residuals have the same magnitude

  3. The MAE and RMSE can vary from 0 to infinity. They are unaffected by the direction of the residuals.

  4. RMSE is sensitive to extreme values and penalizes them more than MAE. MAE is more robust

  5. R2 and Adjusted R2 measure the regression model’s independent variables' capacity to explain the variability in the dependent variable.

  6. Adjusted R2 addresses the flaw in R2. Adjusted R2 incorporates the number of predictors in its calculation. When \( R^2 \) increases, adding any independent variables that are not statistically significant will cause the Adjusted \( R^2 \) to decrease.

  7. Try different evaluation metrics depending on your use case

Conclusion

Regression models are evaluated using evaluation metrics like MAE, MSE, RMSE, R2, and Adjusted R2 to assess their performance and ensure their optimal generability on unseen data.

Each metric has unique characteristics that make for its strengths and weaknesses. So, no metric is suitable for all kinds of datasets and scenarios. Understanding these metrics and how to implement them is vital in selecting which one to use in different use cases.

However, because each metric has unique strengths and weaknesses, it is advisable to use a combination of these metrics to test the model's quality and reliability.