Forecast Evaluation Methods

June 28, 2025 | by Bloom Code Studio

Learning Outcomes

By the end of this section, you should be able to:

5.4.1 Explain the nature of error in forecasting a time series.
5.4.2 Compute common error measures for time series models.
5.4.3 Produce prediction intervals in a forecasting example.

A time series forecast essentially predicts the middle of the road values for future terms of the series. For the purposes of prediction, the model considers only the trend and any cyclical or seasonal variation that could be detected within the known data. Although noise and random variation certainly do influence future values of the time series, their precise effects cannot be predicted (by their very nature). Thus, a forecasted value is the best guess of the model in the absence of an error term. On the other hand, error does affect how certain we can be of the model’s forecasts. In this section, we focus on quantifying the error in forecasting using statistical tools such as prediction intervals.

Forecasting Error

Suppose that you have a model (xˆn) for a given time series (xn). Recall from Components of Time Series Analysis that the residuals quantify the error between the time series the model and serve as an estimate of the noise or random variation.

εn=xn−xˆn

The better the fit of the model to the observed data, the smaller the values of εn will be. However, even very good models can turn out to be poor predictors of future values of the time series. Ironically, the better a model does at fitting to known data, the worse it may be at predicting future data. There is a danger of overfitting. (This term is defined in detail in Decision-Making Using Machine Learning Basics, but for now we do not need to go deeply into this topic.) A common technique to avoid overfitting is to build the model using only a portion of the known data, and then the model’s accuracy can be tested on the remaining data that was held out.

To discuss the accuracy of a model, we should ask the complementary question: How far away are the predictions from the true values? In other words, we should try to quantify the total error. There are several measures of error, or measures of fit, the metrics used to assess how well a model’s predictions align with the observed data. We will only discuss a handful of them here that are most useful for time series. In each formula, xi refers to the ith term of the time series, and εi is the error between the actual ith term and the predicted ith term of the series.

Mean absolute error (MAE): 1n∑ni=1|εi|. A measure of the average magnitude of errors.
Root mean squared error (RMSE): 1n∑ni=1(εi2)−−−−−−−−−−√. A measure of the standard deviation of errors, penalizing larger errors more heavily than MAE does.
Mean absolute percentage error (MAPE): 1n∑ni=1|εixi|. A measure of the average relative errors—that is, a percentage between the predicted values and the actual values (on average).
Symmetric mean absolute percentage error (sMAPE): 1n∑ni=12|εi||xi|+|xˆi|. Similar to MAPE, a measure of the average relative errors, but scaled so that errors are measured in relation to both actual values and predicted values.

Of these error measures, MAE and RMSE are scale-dependent, meaning that the error is in direct proportion to the data itself. In other words, if all terms of the time series and the model were scaled by a factor of k, then the MAE and RMSE would both be multiplied by k as well. On the other hand, MAPE and sMAPE are not scale-dependent. These measures of errors are often expressed as percentages (by multiplying the result of the formula by 100%). However, neither MAPE nor sMAPE should be used for data that is measured on a scale containing 0 and negative numbers. For example, it would not be wise to use MAPE or sMAPE as a measure of error for a time series model of Celsius temperature readings. For all of these measures of error, lower values indicate less error and hence more accuracy of the model. This is useful when comparing two or more models.

Example 5.9

Problem

Compute the MAE, RMSE, MAPE, and sMAPE for the EMA smoothing model for the S&P Index time series, shown in Table 5.3.

Solution

The first step is to find the residuals, εn, and take their absolute values. For RMSE, we will need the squared residuals, included in Table 5.7. We also include the terms |εixi| and 2|εi||xi|+|xˆi|, which are used for computing MAPE and sMAPE, respectively. Notice all formulas are an average of some kind, but RMSE requires an additional step of taking a square root.

Year	S&P Index at Year-End	EMA Estimate	Abs. Residuals \|εn\|=\|xn−xˆn\|	(εn)2	\|εixi\|	2\|εi\|\|xi\|+\|xˆi\|
2013	1848.36	1848.36	0	0	0	0
2014	2058.9	1848.36	210.54	44327.09	0.102	0.108
2015	2043.94	2006.27	37.67	1419.03	0.018	0.019
2016	2238.83	2034.52	204.31	41742.58	0.091	0.096
2017	2673.61	2187.75	485.86	236059.94	0.182	0.2
2018	2506.85	2552.15	45.3	2052.09	0.018	0.018
2019	3230.78	2518.17	712.61	507813.01	0.221	0.248
2020	3756.07	3052.63	703.44	494827.83	0.187	0.207
2021	4766.18	3580.21	1185.97	1406524.84	0.249	0.284
2022	3839.5	4469.69	630.19	397139.44	0.164	0.152
2023	4769.83	3997.05	772.78	597188.93	0.162	0.176
N/A	N/A	Average:	453.52	339008.62	0.127	0.137

Table 5.7 Data Needed to Compute the MAE, RMSE, MAPE, and SMAPE for EMA Smoothing Model

MAE=453.52. (Predicted values are [on average] about 450 units away from true values.)

RMSE=339008.62−−−−−−−−√=582.24. (The errors between predicted and average values have a standard deviation of about 580 from the ideal error of 0.)

MAPE=0.127,or 12.7%. (Predicted values are [on average] within about 13% of their true values.)

sMAPE=0.137,or 13.7%. (When looking at both predicted values and true values, their differences are on average about 14% from one another.)

These results may be used to compare the EMA estimate to some other estimation/prediction method to find out which one is more accurate. The lower the errors, the more accurate the method.

Prediction Intervals

A forecast is often accompanied by a prediction interval giving a range of values the variable could take with some level of probability. For example, if the prediction interval of a forecast is indicated to be 80%, then the prediction interval contains a range of values that should include the actual future value with a probability of 0.8. Prediction intervals were introduced in Analysis of Variance (ANOVA) in the context of linear regression, so we won’t get into the details of the mathematics here. The key point is that we want a measure of margin of error, En (depending on n), such that future observations of the data will be within the interval xˆn±En with probability α, where α is a chosen level of confidence.

Here, we will demonstrate how to use Python to obtain prediction intervals.

The Python library statsmodels.tsa.arima.model contains functions for finding confidence intervals as well. Note that the command get_forecast() is used here rather than forecast(), as the former contains more functionality.

Python Code

    ### Please run all code from previous section before running this ###
    
    # Set alpha to 0.2 for 80% confidence interval
    forecast_steps = 24
    forecast_results = results.get_forecast(steps=forecast_steps, alpha=0.2) 
    
    # Extract forecast values and confidence intervals
    forecast_values = forecast_results.predicted_mean
    confidence_intervals = forecast_results.conf_int()
    
    # Plot the results
    plt.figure(figsize=(10, 6))
    
    # Plot original time series
    plt.plot(df['Value'], label='Original Time Series')
    
    # Plot fitted values
    plt.plot(results.fittedvalues, color='red', label='Fitted Values')
    
    # Plot forecasted values with confidence intervals
    plt.plot(forecast_values, color='red', linestyle='dashed', label='Forecasted Values')
    plt.fill_between(
      range(len(df), len(df) + forecast_steps),
      confidence_intervals.iloc[:, 0],
      confidence_intervals.iloc[:, 1],
      color='red', alpha=0.2,
      label='80% Confidence Interval' 
    )
    
    # Set labels and legend
    plt.xlabel('Months')
    plt.title('Monthly Consumption of Coal for Electricity Generation in the United States from 2016 to 2022')
    plt.legend()
    
    # Apply the formatter to the Y-axis
    plt.gca().yaxis.set_major_formatter(FuncFormatter(y_format))
    plt.show()

The resulting output will look like this:

Time series plot titled Monthly consumption of coal for electricity generation in the United States from 2016 to 2022. Y-axis ranges from -50,000 to 125,000, x-axis from 0 to 100. The blue line represents the actual coal consumption, which fluctuates seasonally. The red line represents the fitted values, which smooth out the seasonal fluctuations, and the dashed red line represents the forecasted values. The shaded area represents the 80% confidence interval for the forecasted values. Coal consumption shows a general downward trend from around 125,000 tons per month in 2016 to around 0 tons per month in 2022.

The forecast data (dashed curve) is now surrounded by a shaded region. With 80% probability, all future observations should fit into the shaded region. Of course, the further into the future we try to go, the more uncertain our forecasts will be, which is indicated by the wider and wider confidence interval region.

View all

Forecast Evaluation Methods

Learning Outcomes

Forecasting Error

Example 5.9

Problem

Solution

Prediction Intervals

Python Code

RELATED POSTS

Components of Time Series Analysis

Introduction to Time Series Analysis

Time Series Forecasting Methods