Electricity demand forecasting is a crucial task for grid operators. Indeed the production must balance the consumption as storage capacities are still negligible compared to the load. Time series methods have been applied to address that problem, relying on calendar information and lags of the electricity consumption. Statistical and machine learning models have been designed to use exogenous information such as meteorological forecasts (the load usually depends on the temperature for instance, due to electric heating and cooling).
The field has been thoroughly studied the past decades and we will not propose here an exhaustive bibliographic study. Instead, we choose to focus on recent results in the different forecasting challenges related to this the field. The Global Energy Forecasting Competitions (GEFCOM) ([hong2014global], [HONG2016896] and [hong2019global]
) provide a large benchmark of popular and efficient load forecasting methods. Black box machine learning models such as gradient boosting machines[lloyd2014gefcom2012]
and neural networks[ryu2017deep, dimoulkas2019neural] rank among the first as well as statistical models like Generalized Additive Models (GAM) [nedellec2014gefcom2012, dordonnat2016gefcom2014] or parametric regression models [CHARLTON2014364, ZIEL20191400]. Ensemble methods or expert aggregation are also a common practice for competitors [gaillard2016additive, smyl2019machine].
The behaviour of the consumption changed abruptly during the coronavirus crisis, especially during lockdowns imposed by many governments. These changes of consumption mode have been challenging for electricity grid operators as historical forecasting procedures performed poorly. It is to be noted that purely time series methods like autoregressives didn’t drift as they are very adaptive in essence, but they fail to capture the dependence of the load to, for instance, meteorological variables. Therefore designing new forecasting strategies to take that evolution into account is important to reduce the cost of forecasting errors and to ensure the stability of the network in the future.
We claim that state-space models allow the best of both worlds. First, machine learning models trained on historical data are used to design new feature representations. Second, a state-space representation yields a methodology to adapt these complex forecasting models.
Our work extends a previous study on the French electricity load [obst2021adaptive]
where a state-space approach to adapt generalized additive models was presented. The novelty of this article lies both in the method and in the application. First, besides generalized additive models we apply our procedure to other widely used machine learning models including neural networks. Second, after applying a standard Kalman filter we apply VIKING (Variational Bayesian Variance Tracking), another state-space approach allowing to estimate jointly the state and the variances[tsw]. Third, our procedure resulted in the winning strategy in a competition on post-covid day-ahead electricity demand forecasting111https://ieee-dataport.org/competitions/day-ahead-electricity-demand-forecasting-post-covid-paradigm, motivating the efficiency of the proposed approach.
Section II is an introduction to the competition and we detail how we handled the data provided. In Section III we discuss the meteorological variables and we present standard forecasting methods. The core of our strategy is Section IV where we propose a generic state-space framework to adapt these methods. We discuss the numerical performances of the various models in Section V and we combine them through aggregation of experts to leverage each model’s advantages.
Ii Data Presentation and Pre-Processing
The objective of the competition was to predict the electricity load of an undisclosed location of average consumption 1.1 GW, that is of the order of one million people in western countries. The break in the electricity demand in March 2020 is clear in Figure 1. The objective of the competition was to design new strategies for day-ahead forecasting in order to be robust to this unstable period.
Ii-a Time segmentation
The competition’s setting was to forecast the hourly load 16 to 40 hours ahead in an online manner. Precisely we had to predict the consumption of each hour of day d with data up to 8AM day d-1. After our prediction was sent a new batch of data up to 8AM day d was released so that we had to predict day d+1 …
The evaluation was based on the Mean Average Error on the period ranging from January 18th to February 16th 2021. To build a forecasting model, the historical load starting from March 18th 2017 was provided, as well as meteorological forecasts and realizations during the same period.
Ii-B Meteorological Forecasts
Aside from calendar variables, is is usual that the most important exogenous factor explaining the electricity demand is meteorology. The dependence of the load to the temperature for example is due to electric heating and cooling. Moreover, the dependence of the electricity demand to meteorology is augmented by the development of decentralized renewables. Indeed, small renewable production is often used by its owner, yielding a net consumption that highly depends on wind or solar radiation. Therefore the error of a forecasting model for the electricity demand crucially depends on the performance of the meteorological forecasts.
The data of the competition includes forecasts and realization of the temperature, the cloud cover, the pressure, the wind direction and speed. These forecasts are assumed to be known 48 hours in advance and invariant after, thus they can be used to forecast the load at the 16 to 40 hours horizon.
However from the statistical properties of the meteorological forecasting residuals (c.f. Figure 2), we conjecture that the forecasts come from physical models that need to be statistically corrected. Indeed, as the forecasts are available 48 hours in advance, if a statistical correction had been applied then auto-correlations of the residuals over 48 hours would be negligible.
We thus use correction models close to autoregressives on the residuals.
Formally, let be any of the meteorological variable and the forecast given in the data set. Then we use the model
where is the hour of the day of time and
In other words, we forecast the residual of the variable of interest with a linear model on
the last available daily lags of the residual,
the last available lags of the residual (up to 7AM of the previous day),
the last daily lag of the variable of interest.
We optimize the coefficients separately for each hour of the day for the temperature, whereas we use the same coefficients at each hour of the day for the cloud cover, pressure and wind speed (except the intercept term). We don’t correct the wind direction. The parameters and are selected based on BIC. We display in Table I the error of the initial forecast, compared to simply using the last daily lag of the variable of interest, and our corrected forecast.
|Initial||Last daily lag||Corrected|
|Cloud cover (%)||17.28||18.74||14.99|
|Wind Speed (km/h)||4.53||3.49||2.53|
The first column is the forecast given in the data set. The second one consists in using the variable of interest with a 24 or 48-hour delay. The last is our corrected forecast. We evaluate through the mean average error during 2020 while we train the corrections on the data before 2020.
Iii Time-invariant Experts
We summarize the explanatory variables used in our forecasting models:
calendar variables: the hour of the day, the day of the week, the time of year () growing linearly from 0 on January 1st to 1 on December 31st, and a variable growing linearly with time to account for a trend,
meteorological forecasts after statistical correction: the temperature along with exponential smoothing variants of parameters 0.95 and 0.99 (respectively and ), the cloud cover, the pressure, the wind direction and speed,
lags of the electricity load: the load a week ago and the last load available (a day ago for the forecast before 8AM and two days ago after 8AM, this constraint coming from the availability of the online data during the competition).
Iii-a Statistical and Machine Learning Methods
Based on the covariates we experiment a few classical predictive models. We define independent models for the different hours of the day as is usual in electricity load forecasting. For each model we use the same structure for the different hours but we learn the model parameters independently for each time of day based on the training data of that particular time of day. In what follows we denote by the load at time .
. We consider a seasonal autoregressive model based on the daily and weekly lags of the load:
Linear regression. We use a linear model with the following variables: temperature, cloud cover, pressure, wind direction and speed, day type (7 booleans), time of year, linear trend variable, and the two lags and .
Generalized additive model (GAM). We propose a Gaussian generalized additive model [wood2017generalized]:
where is obtained by penalized regression on cubic cyclic splines and on cubic regression splines.
Random Forest (RF_GAM). We also correct the GAM using a random forest on the GAM residuals, with the same covariates as in RF to which we add the GAM effects , , as well as lags (one week, one or two days) of the GAM residuals.
Iii-B Intraday correction
Although it performs better to use different models at the different hours of the day, let it be noted that the correlation between different hours is important. To capture intraday information we fit on the residuals of each model an autoregressive model incorporating lags of the 24 last available hours and optimized for each forecast horizon. This follows from the intuition that to predict the load at 8AM, instead of using as the last available data a delay of 48 hours, we can use a 25-hour delay.
Iv Adaptation using State-Space Models
To adapt the models in time, we rely on linear gaussian state-space models, summarized as
where is the latent state, the process noise covariance matrix and is the observation variance.
Iv-a Definition of
This state-space representation is natural for linear regression for which
is the vector containing the explanatory variables detailed in SectionIII-A. Autoregressive models also fit directly in that framework, as they are in fact linear models on lags of the load, see Equation (1). To adapt GAM and MLP we linearize the models and is just another feature representation. We freeze the non-linear effects in the GAM as in [obst2021adaptive], and contains the different effects, linear and non-linear. We apply a similar approach for the MLP, for which we freeze the deepest layers and we learn the last one, that is is the final hidden state, see Figure 5.
The state-space approach is not applied to the random forest. For the latter we compare with incremental offline random forests, consisting in re-training the random forest each day with all the data available at the time.
Iv-B Kalman Filter
Bayesian estimation of the state in linear gaussian state-space models is well understood under known variances . The best estimator is obtained by the well known Kalman Filter [kalman1961new]. It yields recursive exact estimation of the mean and covariance matrix of the state given the past observations, denoted respectively by and . However there is no consensus in the literature as to how to tune the hyper-parameters, see for instance [brockwell1991time, durbin2012time, fahrmeir1992posterior]
. The widely used Expectation-Maximization algorithm is an iterative algorithm that guarantees convergence to a local maximum of the likelihood. However there is no global guarantee and in our case it performs poorly. We propose instead the following settings, building on[obst2021adaptive]:
Static. We consider the degenerate setting where and .
Static break. We consider a break at March 1st 2020 by setting except where is March 1st 2020.
Dynamic. We approximate the maximum-likelihood for constant . We set and we observe that for a given we have closed-form solutions for . Then we restrict ourselves to diagonal matrices whose nonzero coefficients are in and we apply a greedy procedure: starting from we change at each step the coefficient improving the most the likelihood. That procedure is designed to optimize on the training data (up to January 1st 2020).
Dynamic break. We use similar as in the dynamic setting except where is March 1st 2020.
Dynamic big. We simply use and a matrix proportional to defined based on the 2020 data.
Also, it is important that we estimate a gaussian posterior distribution, therefore we have a probabilistic forecast for the load. Precisely, our estimate is , thus we have . The likelihood that is optimized to obtain the dynamic setting is built on that probabilistic forecast of
given the past observations. In the competition we added quantiles of these gaussian distributions as forecasters in an expert aggregation.
Iv-C Dynamical Variances
The idea behind the break settings introduced in the previous paragraph is that we would like the model to adapt faster during an evolving period such as a lockdown than before. However it consists in modelling a break in the data, a sudden change of state resulting from a noise of much bigger variance at a specific time specified a priori. A way to extend the approach would be to define a time-varying covariance matrix depending for instance of a lockdown stringency index such as defined by [hale2021global]. However the competition policy forbade the use of external data and the location was undisclosed.
In a more long-term perspective let it be hoped that the evolution of the electricity load won’t be driven by lockdowns. It is therefore more generic to learn the variances of the state-space model in an adaptive fashion. We therefore apply a novel approach for time-series forecasting introduced in [tsw], and named Variational Bayesian Variance Tracking, alias VIKING. We briefly recall how the method works. This method was designed in parallel of the competition and was improved afterwards. We present the last version only.
We treat the variances as latent variables and we augment the state-space model:
Instead of estimating the state with variances fixed a priori, we estimate both the state and the variances represented by . Although we have removed as hyper-parameters, we now have to set priors on along with the parameters controlling the smoothness of the dynamics on the variances.
We apply a bayesian approach. At each step, we start from a prior obtained at the last iteration, where we introduce the filtration of the past observations . Then we obtain a prediction step thanks to the dynamical equations yielding . Finally Bayes’ rule yields the posterior distribution .
However the posterior distribution is analytically intractable, therefore the principle of VIKING is to apply the classical Variational Bayesian approach [vsmidl2006variational]. The posterior distribution is recursively approximated with a factorized distribution. In our setting we look for the best product approximating . The criterion minimized is the Kullback-Leibler (KL) divergence
where . At each step it yields a coupled optimization problem in the three gaussian distributions. The classical iterative method (see for instance [tzikas2008variational]) consists in computing alternately
where the expected value is taken with respect to two of the three latent variables, and identifying the desired first two moments with respect to the other latent variable. However the expressiondoesn’t match a gaussian distribution in , and similarly for . We therefore use the first two moments of the gaussian distribution to derive an upper-bound of the KL divergence for which we have an analytical solution. We refer to [tsw] for the detailed derivation of the algorithm.
We display the performance of the introduced methods that we call experts. Then we use aggregation of experts to leverage specificities of each forecaster. The end of the section is devoted to a discussion of our day-to-day strategy during the competition. Finally, we refer to the implementation for more details222https://gitlab.com/JosephdeVilmarest/state-space-post-covid-forecasting.
V-a Individual Experts
We first display in Table II the results of the statistical and machine learning methods of Section III-A, with or without the intraday correction of Section III-B. To present the improvement brought by the intraday correction we give the performance during a stable period (after the training of the model, but before the covid crisis). We observe that the only model for which the intraday correction doesn’t improve the performance (RF_GAM) is the one including already a residual correction. The improvement during the evaluation period (2021) is much bigger (57% decrease of the MAE for the MLP for instance), and it is natural as the intraday correction is an autoregressive, that is, an adaptive model.
Models are trained up to Jan. 1st 2020 and tested during the next two months before the break of March.
Then we focus on adaptive models to show the improvements due to each setting, see Table III.
The performances are displayed for each model after intraday correction. As a comparison, re-training the random forest every day yields an online RF of MAE 15.0 MW, and an online GAM_RF of MAE 18.1 MW.
We have 4 different models (autoregressive, linear, GAM and MLP). For each one, we try the various adaptation settings (no adaptation, Kalman filters and VIKING). Kalman filters with constant covariance matrix proportional to the identity obtain the best results. That is not the case on the data previous to the competition and it depends on the intrinsic evolution of the data.
We illustrate the different settings in Figure 6 where we display the evolution of the state coefficient for the GAM adaptation strategies.
Online robust aggregation of experts [Cesa-Bianchi:2006] is a powerful model agnostic approach for time series forecasting, already applied to load forecasting during the lockdown in [obst2021adaptive]. We use the ML-Poly algorithm proposed in [gaillard2014second] and implemented in the R package opera [gaillard2016opera] to compute these online weights.
The aggregation weights are estimated independently for each hour of the day. We summarize different variants in Table IV.
First, for each family of models we compute the aggregation of all the adaptation settings (7 for each). Then we aggregate all of them (28 models). An example of the weights obtained at 3PM is displayed in Figure 7.
The aggregation presented in this paper obtains a performance close to our strategy winning the competition (degradation of about 0.05 MW).
V-C Day-to-day Forecasts
During the competition our predictions were not exactly the ones of the aggregation method presented in the previous subsection. There are mainly two reasons for that.
First, we considered a bigger set of forecasting methods (we had experts). It seemed reasonable to prune the strategy for the sake of clarity of paper at the cost of a very small change of error, but it is interesting to present also the predictions used during the evaluation. We found a trade-off in the selection of experts. Indeed too much experts in the aggregation yields poor performances. We applied a greedy procedure to select the experts we keep in the aggregation: we begin with an empty set, and at each step we add the one improving the most the performance. That performance was evaluated with the MAE on the last month of the training data set. We provide in Figure 8 a graphical representation of how we defined different time periods. We refer to Figure 9 for the evolution of the validation MAE as the selection grows. We observe a sharp decrease of the error as experts are added with high diversity and then a slow increase of the loss as the set of experts becomes too large.
Second, we were constantly experimenting the different strategies. We used a variational bayesian method that was a prior version of the one of this paper. We also changed a lot the aggregation procedure.
We refer to the appendix for a detailed presentation of our daily strategy. Overall these day-by-day changes degraded the performance, if we had stayed on the first strategy with no change at all, our MAE would have been 10.51 MW instead of 10.84 MW. The critical issue in such unstable periods is to find the right validation period to select the prediction procedure. The month before the evaluation period seems a posteriori a good compromise. During the competition we changed ”manually” based on the performances in a shorter range, for instance considering an expert performing well on the last few weeks for a specific day type … We should have trusted the aggregation’s robustness.
In this paper we presented our procedure to win a competition on electricity load forecasting during an unstable period. Our approach relies heavily on state-space models and the competition was the first data set on which was applied a recent approach to adapt the variances of a state-space model. Some perspectives have been raised during the competition such as interpretability of the global approach and a better understanding of the error propagation along the different adaptations (intraday correction, Kalman filtering, variance tracking and aggregation).
Finally, similar state-space methods have been applied to obtain the first place in another competition in which the objective was to forecast the electricity consumption of a building333http://www.gecad.isep.ipp.pt/smartgridcompetitions/.
The experts AR, Lin, GAM, RF, RF_GAM, MLP are the ones presented in that same order in Section III-A.
Names of the form model_setting refer to the expert obtained by state-space adaptation of the model model with the setting setting. For instance, Lin_dynamic refers to a linear model adapted with the Kalman filter in the dynamic setting, c.f. Section IV-B.
We consider quantile variants of RF, denoted by RFq where q is the quantile order in percent (e.g. RF40 is the quantile random forest of quantile value 0.4). We also consider a quantile variant of the dynamic MLP denoted by similar names (MLP_dynamic60 is the quantile 0.6 of the MLP in the dynamic setting).
Furthermore, we introduce an expert named GAM_SAT forecasting each day with the GAM as if it were a Saturday motivated by [obst2021adaptive].
Finally, each expert x yields another expert x_corr after intraday correction.
Vi-B Day-to-day Evolution of the Forecasting Strategy
As explained in Section V-C, our strategy evolved in time and we recall here every change.
From January 18th to January 24th: we used the following set of experts obtained by the greedy selection described in Section V-C: RF, RF_corr, RF50_corr, RF60_corr, Lin_dynamic_corr, Lin_viking_corr, GAM, GAM_corr, GAM_staticbreak_corr, GAM_dynamic_corr, GAM_viking_corr, GAM_dynamicbig_corr, RF_GAM, RF_GAM_corr, GAM_SAT_corr, MLP_dynamic60, MLP_dynamic90, MLP_dynamic99. We aggregated with ML-poly with an aggregation estimated independently for each hour, with the absolute loss. We found afterwards a bug in RF50_corr, RF60_corr: the quantile RF were set to 0 on the test set so that these two experts were simple intraday autoregressive trained in an unwanted manner.
From January 25th to January 31st: we removed the experts RF50_corr, RF60_corr and we replaced them with AR_corr.
February 1st and 2nd: we used the uniform average between three forecasts. First, the previous aggregation. Second, another aggregation procedure called RF-stacking, consisting in a quantile random forest minimizing the MAE and taking as input the 72 experts as well as the day type and hour of the day. Third, a benchmark close to the one given by the competition organizers: we predict each time with the last available load of the same hour and the same day group (week days, saturdays and sundays).
From February 3rd to 7th: we removed the benchmark which damaged the performances. For Feb. 5th we corrected the ML-poly prediction using a special day correction, once we observe that Feb. 5th had a special behaviour in the last three years. Precisely, we observed that the relative error of the model is significantly negative on the last three years, a behaviour that may come from a bank holiday for instance. Therefore we fit a smoothed function of the time of day on the relative error and we applied it to our forecast. We truncated so that there is no correction during night. See the shape of the correction in Figure 10.
February 8th: we used the single expert Lin_dynamicbig_corr as we observed that it was by far our best expert on the last week, and it seemed to perform especially well on Mondays.
February 9th: we came back to the average between ML-poly and the RF-stacking but we added to the aggregation ML-poly the expert Lin_dynamicbig_corr, and we replaced the expert AR_corr with another expert AR_intra incorporating directly the intraday correction in the autoregressive, instead of correcting an autoregressive based only on daily lags.
February 10th and 11th: we removed the RF-stacking which degraded our performances since its introduction and we kept only the aggregation ML-poly.
February 12th and 13th: we corrected a posteriori the electricity load for February 5th with the special day correction. It was important to do it on that day as the weekly lags is important in the models.
February 14th: we used once again the average between the ML-poly aggregation and the RF-stacking, as we observed that the RF-stacking is especially good on Sunday.
February 15th and 16th: we used only the ML-poly aggregation.