State-Space Models Win the IEEE DataPort Competition on Post-covid Day-ahead Electricity Load Forecasting

10/01/2021 ∙ by Joseph De Vilmarest, et al. ∙ EDF Sorbonne Université 0

We present the winning strategy of an electricity demand forecasting competition. This competition was organized to design new forecasting methods for unstable periods such as the one starting in Spring 2020. We rely on state-space models to adapt standard statistical and machine learning models. We claim that it achieves the right compromise between two extremes. On the one hand, purely time-series models such as autoregressives are adaptive in essence but fail to capture dependence to exogenous variables. On the other hand, machine learning methods allow to learn complex dependence to explanatory variables on a historical data set but fail to forecast non-stationary data accurately. The evaluation period of the competition was the occasion of trial and error and we put the focus on the final forecasting procedure. In particular, it was at the same time that a recent algorithm was designed to adapt the variances of a state-space model and we present the results of the final version only. We discuss day-today predictions nonetheless.



There are no comments yet.


page 1

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Electricity demand forecasting is a crucial task for grid operators. Indeed the production must balance the consumption as storage capacities are still negligible compared to the load. Time series methods have been applied to address that problem, relying on calendar information and lags of the electricity consumption. Statistical and machine learning models have been designed to use exogenous information such as meteorological forecasts (the load usually depends on the temperature for instance, due to electric heating and cooling).

The field has been thoroughly studied the past decades and we will not propose here an exhaustive bibliographic study. Instead, we choose to focus on recent results in the different forecasting challenges related to this the field. The Global Energy Forecasting Competitions (GEFCOM) ([hong2014global], [HONG2016896] and [hong2019global]

) provide a large benchmark of popular and efficient load forecasting methods. Black box machine learning models such as gradient boosting machines


and neural networks

[ryu2017deep, dimoulkas2019neural] rank among the first as well as statistical models like Generalized Additive Models (GAM) [nedellec2014gefcom2012, dordonnat2016gefcom2014] or parametric regression models [CHARLTON2014364, ZIEL20191400]. Ensemble methods or expert aggregation are also a common practice for competitors [gaillard2016additive, smyl2019machine].

The behaviour of the consumption changed abruptly during the coronavirus crisis, especially during lockdowns imposed by many governments. These changes of consumption mode have been challenging for electricity grid operators as historical forecasting procedures performed poorly. It is to be noted that purely time series methods like autoregressives didn’t drift as they are very adaptive in essence, but they fail to capture the dependence of the load to, for instance, meteorological variables. Therefore designing new forecasting strategies to take that evolution into account is important to reduce the cost of forecasting errors and to ensure the stability of the network in the future.

We claim that state-space models allow the best of both worlds. First, machine learning models trained on historical data are used to design new feature representations. Second, a state-space representation yields a methodology to adapt these complex forecasting models.

Our work extends a previous study on the French electricity load [obst2021adaptive]

where a state-space approach to adapt generalized additive models was presented. The novelty of this article lies both in the method and in the application. First, besides generalized additive models we apply our procedure to other widely used machine learning models including neural networks. Second, after applying a standard Kalman filter we apply VIKING (Variational Bayesian Variance Tracking), another state-space approach allowing to estimate jointly the state and the variances

[tsw]. Third, our procedure resulted in the winning strategy in a competition on post-covid day-ahead electricity demand forecasting111, motivating the efficiency of the proposed approach.

Section II is an introduction to the competition and we detail how we handled the data provided. In Section III we discuss the meteorological variables and we present standard forecasting methods. The core of our strategy is Section IV where we propose a generic state-space framework to adapt these methods. We discuss the numerical performances of the various models in Section V and we combine them through aggregation of experts to leverage each model’s advantages.

Ii Data Presentation and Pre-Processing

The objective of the competition was to predict the electricity load of an undisclosed location of average consumption 1.1 GW, that is of the order of one million people in western countries. The break in the electricity demand in March 2020 is clear in Figure 1. The objective of the competition was to design new strategies for day-ahead forecasting in order to be robust to this unstable period.

Fig. 1: On top: electricity load from March 18th 2017 to February 16th 2021. On the bottom: daily profiles of the electricity load in March-April 2019 compared to March-April 2020.

Ii-a Time segmentation

The competition’s setting was to forecast the hourly load 16 to 40 hours ahead in an online manner. Precisely we had to predict the consumption of each hour of day d with data up to 8AM day d-1. After our prediction was sent a new batch of data up to 8AM day d was released so that we had to predict day d+1 …

The evaluation was based on the Mean Average Error on the period ranging from January 18th to February 16th 2021. To build a forecasting model, the historical load starting from March 18th 2017 was provided, as well as meteorological forecasts and realizations during the same period.

Ii-B Meteorological Forecasts

Aside from calendar variables, is is usual that the most important exogenous factor explaining the electricity demand is meteorology. The dependence of the load to the temperature for example is due to electric heating and cooling. Moreover, the dependence of the electricity demand to meteorology is augmented by the development of decentralized renewables. Indeed, small renewable production is often used by its owner, yielding a net consumption that highly depends on wind or solar radiation. Therefore the error of a forecasting model for the electricity demand crucially depends on the performance of the meteorological forecasts.

The data of the competition includes forecasts and realization of the temperature, the cloud cover, the pressure, the wind direction and speed. These forecasts are assumed to be known 48 hours in advance and invariant after, thus they can be used to forecast the load at the 16 to 40 hours horizon.

However from the statistical properties of the meteorological forecasting residuals (c.f. Figure 2), we conjecture that the forecasts come from physical models that need to be statistically corrected. Indeed, as the forecasts are available 48 hours in advance, if a statistical correction had been applied then auto-correlations of the residuals over 48 hours would be negligible.

Fig. 2: On top: auto-correlation plots of the temperature (left) and cloud cover (right) forecasting residuals, with lags in hours. On the bottom: auto-correlation plots of the temperature residuals focused on a specific hour of the day (midnight on the left, 3AM on the right), with lags in days.

We thus use correction models close to autoregressives on the residuals.

Formally, let be any of the meteorological variable and the forecast given in the data set. Then we use the model

where is the hour of the day of time and

In other words, we forecast the residual of the variable of interest with a linear model on

  • the last available daily lags of the residual,

  • the last available lags of the residual (up to 7AM of the previous day),

  • the forecast,

  • the last daily lag of the variable of interest.

We optimize the coefficients separately for each hour of the day for the temperature, whereas we use the same coefficients at each hour of the day for the cloud cover, pressure and wind speed (except the intercept term). We don’t correct the wind direction. The parameters and are selected based on BIC. We display in Table I the error of the initial forecast, compared to simply using the last daily lag of the variable of interest, and our corrected forecast.

Initial Last daily lag Corrected
Temperature (C) 3.00 2.11 1.69
Cloud cover (%) 17.28 18.74 14.99
Pressure (kPa) 0.506 1.30 0.423
Wind Speed (km/h) 4.53 3.49 2.53

The first column is the forecast given in the data set. The second one consists in using the variable of interest with a 24 or 48-hour delay. The last is our corrected forecast. We evaluate through the mean average error during 2020 while we train the corrections on the data before 2020.

TABLE I: Mean Average Error of different meteorological forecasts.

Iii Time-invariant Experts

We summarize the explanatory variables used in our forecasting models:

  • calendar variables: the hour of the day, the day of the week, the time of year () growing linearly from 0 on January 1st to 1 on December 31st, and a variable growing linearly with time to account for a trend,

  • meteorological forecasts after statistical correction: the temperature along with exponential smoothing variants of parameters 0.95 and 0.99 (respectively and ), the cloud cover, the pressure, the wind direction and speed,

  • lags of the electricity load: the load a week ago and the last load available (a day ago for the forecast before 8AM and two days ago after 8AM, this constraint coming from the availability of the online data during the competition).

The dependence to the hour of the day and the day of the week is well observed in Figure 1. We display in Figure 3 the dependence of the load to a few of the aforementioned covariates.

Fig. 3: Dependence of the load at 3PM to different covariates on the data up to January 1st 2020.

Iii-a Statistical and Machine Learning Methods

Based on the covariates we experiment a few classical predictive models. We define independent models for the different hours of the day as is usual in electricity load forecasting. For each model we use the same structure for the different hours but we learn the model parameters independently for each time of day based on the training data of that particular time of day. In what follows we denote by the load at time .

  • Autoregressive

    . We consider a seasonal autoregressive model based on the daily and weekly lags of the load:

  • Linear regression. We use a linear model with the following variables: temperature, cloud cover, pressure, wind direction and speed, day type (7 booleans), time of year, linear trend variable, and the two lags and .

  • Generalized additive model (GAM). We propose a Gaussian generalized additive model [wood2017generalized]:

    where is obtained by penalized regression on cubic cyclic splines and on cubic regression splines.

  • Random Forest (RF)

    . We build a random forest


    with the following covariates: linear trend variable, time of year, day type, the two lags and the two exponential smoothing variables of the temperature. Quantile variants were also computed.

  • Random Forest (RF_GAM). We also correct the GAM using a random forest on the GAM residuals, with the same covariates as in RF to which we add the GAM effects , , as well as lags (one week, one or two days) of the GAM residuals.

  • Multi-Layer Perceptron (MLP)

    . Finally we test a multi-layer perceptron of 2 hidden layers of 15 and 10 neurons using hyperbolic tangent activation. We take as input: linear trend variable, time of year, day type, the exponential smoothing variable

    and the two lags.

Iii-B Intraday correction

Although it performs better to use different models at the different hours of the day, let it be noted that the correlation between different hours is important. To capture intraday information we fit on the residuals of each model an autoregressive model incorporating lags of the 24 last available hours and optimized for each forecast horizon. This follows from the intuition that to predict the load at 8AM, instead of using as the last available data a delay of 48 hours, we can use a 25-hour delay.

We apply this correction to the models presented in Section III-A as well as to the ones resulting from the adaptation framework of Section IV, see the numerical improvements in Section V-A.

Iv Adaptation using State-Space Models

Due to the lockdowns the consumer’s behaviours changed abruptly and therefore the models presented in Section III-A perform poorly during Spring 2020 and afterwards, see Figure 4.

Fig. 4: Evolution of the forecasting error for the different models introduced in Section III-A trained on the data up to January 1st 2020 and with intraday correction (Section III-B).

To adapt the models in time, we rely on linear gaussian state-space models, summarized as

where is the latent state, the process noise covariance matrix and is the observation variance.

Iv-a Definition of

This state-space representation is natural for linear regression for which

is the vector containing the explanatory variables detailed in Section

III-A. Autoregressive models also fit directly in that framework, as they are in fact linear models on lags of the load, see Equation (1). To adapt GAM and MLP we linearize the models and is just another feature representation. We freeze the non-linear effects in the GAM as in [obst2021adaptive], and contains the different effects, linear and non-linear. We apply a similar approach for the MLP, for which we freeze the deepest layers and we learn the last one, that is is the final hidden state, see Figure 5.

Fig. 5: Diagram of the definition of the features to adapt the MLP. The network has two hidden layers of 15 and 10 neurons, we freeze all the weights except the last ones.

The state-space approach is not applied to the random forest. For the latter we compare with incremental offline random forests, consisting in re-training the random forest each day with all the data available at the time.

Iv-B Kalman Filter

Bayesian estimation of the state in linear gaussian state-space models is well understood under known variances . The best estimator is obtained by the well known Kalman Filter [kalman1961new]. It yields recursive exact estimation of the mean and covariance matrix of the state given the past observations, denoted respectively by and . However there is no consensus in the literature as to how to tune the hyper-parameters, see for instance [brockwell1991time, durbin2012time, fahrmeir1992posterior]

. The widely used Expectation-Maximization algorithm is an iterative algorithm that guarantees convergence to a local maximum of the likelihood. However there is no global guarantee and in our case it performs poorly. We propose instead the following settings, building on


  • Static. We consider the degenerate setting where and .

  • Static break. We consider a break at March 1st 2020 by setting except where is March 1st 2020.

  • Dynamic. We approximate the maximum-likelihood for constant . We set and we observe that for a given we have closed-form solutions for . Then we restrict ourselves to diagonal matrices whose nonzero coefficients are in and we apply a greedy procedure: starting from we change at each step the coefficient improving the most the likelihood. That procedure is designed to optimize on the training data (up to January 1st 2020).

  • Dynamic break. We use similar as in the dynamic setting except where is March 1st 2020.

  • Dynamic big. We simply use and a matrix proportional to defined based on the 2020 data.

Also, it is important that we estimate a gaussian posterior distribution, therefore we have a probabilistic forecast for the load. Precisely, our estimate is , thus we have . The likelihood that is optimized to obtain the dynamic setting is built on that probabilistic forecast of

given the past observations. In the competition we added quantiles of these gaussian distributions as forecasters in an expert aggregation.

Iv-C Dynamical Variances

The idea behind the break settings introduced in the previous paragraph is that we would like the model to adapt faster during an evolving period such as a lockdown than before. However it consists in modelling a break in the data, a sudden change of state resulting from a noise of much bigger variance at a specific time specified a priori. A way to extend the approach would be to define a time-varying covariance matrix depending for instance of a lockdown stringency index such as defined by [hale2021global]. However the competition policy forbade the use of external data and the location was undisclosed.

In a more long-term perspective let it be hoped that the evolution of the electricity load won’t be driven by lockdowns. It is therefore more generic to learn the variances of the state-space model in an adaptive fashion. We therefore apply a novel approach for time-series forecasting introduced in [tsw], and named Variational Bayesian Variance Tracking, alias VIKING. We briefly recall how the method works. This method was designed in parallel of the competition and was improved afterwards. We present the last version only.

We treat the variances as latent variables and we augment the state-space model:

Instead of estimating the state with variances fixed a priori, we estimate both the state and the variances represented by . Although we have removed as hyper-parameters, we now have to set priors on along with the parameters controlling the smoothness of the dynamics on the variances.

We apply a bayesian approach. At each step, we start from a prior obtained at the last iteration, where we introduce the filtration of the past observations . Then we obtain a prediction step thanks to the dynamical equations yielding . Finally Bayes’ rule yields the posterior distribution .

However the posterior distribution is analytically intractable, therefore the principle of VIKING is to apply the classical Variational Bayesian approach [vsmidl2006variational]. The posterior distribution is recursively approximated with a factorized distribution. In our setting we look for the best product approximating . The criterion minimized is the Kullback-Leibler (KL) divergence

where . At each step it yields a coupled optimization problem in the three gaussian distributions. The classical iterative method (see for instance [tzikas2008variational]) consists in computing alternately

where the expected value is taken with respect to two of the three latent variables, and identifying the desired first two moments with respect to the other latent variable. However the expression

doesn’t match a gaussian distribution in , and similarly for . We therefore use the first two moments of the gaussian distribution to derive an upper-bound of the KL divergence for which we have an analytical solution. We refer to [tsw] for the detailed derivation of the algorithm.

V Experiments

We display the performance of the introduced methods that we call experts. Then we use aggregation of experts to leverage specificities of each forecaster. The end of the section is devoted to a discussion of our day-to-day strategy during the competition. Finally, we refer to the implementation for more details222

V-a Individual Experts

We first display in Table II the results of the statistical and machine learning methods of Section III-A, with or without the intraday correction of Section III-B. To present the improvement brought by the intraday correction we give the performance during a stable period (after the training of the model, but before the covid crisis). We observe that the only model for which the intraday correction doesn’t improve the performance (RF_GAM) is the one including already a residual correction. The improvement during the evaluation period (2021) is much bigger (57% decrease of the MAE for the MLP for instance), and it is natural as the intraday correction is an autoregressive, that is, an adaptive model.

Adaptation AR Linear GAM RF RF_GAM MLP
Offline 29.3 20.8 20.7 24.6 23.0 21.2
Offline intraday 27.0 19.9 19.3 24.4 23.7 20.6

Models are trained up to Jan. 1st 2020 and tested during the next two months before the break of March.

TABLE II: Mean Average Error (in MW) of each method of Section II during normal test period.

Then we focus on adaptive models to show the improvements due to each setting, see Table III.

Adaptation AR Linear GAM MLP
Offline 14.6 22.8 22.7 16.7
Static 20.5 15.7 17.0 22.9
Static break 27.9 14.4 28.4 35.4
Dynamic 14.4 14.9 15.3 13.0
Dynamic break 16.2 13.6 14.3 12.3
Dynamic big 14.3 11.2 12.4 12.4
VIKING 14.4 11.5 12.7 12.5

The performances are displayed for each model after intraday correction. As a comparison, re-training the random forest every day yields an online RF of MAE 15.0 MW, and an online GAM_RF of MAE 18.1 MW.

TABLE III: Mean Average Error (in MW) of each method during the competition evaluation set (2021-01-18 to 2021-02-16).

We have 4 different models (autoregressive, linear, GAM and MLP). For each one, we try the various adaptation settings (no adaptation, Kalman filters and VIKING). Kalman filters with constant covariance matrix proportional to the identity obtain the best results. That is not the case on the data previous to the competition and it depends on the intrinsic evolution of the data.

We illustrate the different settings in Figure 6 where we display the evolution of the state coefficient for the GAM adaptation strategies.

Fig. 6: Evolution of the state coefficients for various adaptations of the GAM, see Section IV. On the left, we use the Kalman Filter in the static setting (degenerate covariance matrix ). On the middle, the dynamic setting where the variances are constant, and we provide the ratio : we observe that the coefficient corresponding to the biggest coefficient of (the effect of ) evolves much faster. On the right, the VIKING setting where we estimate the variances adaptively.

V-B Aggregation

Online robust aggregation of experts [Cesa-Bianchi:2006] is a powerful model agnostic approach for time series forecasting, already applied to load forecasting during the lockdown in [obst2021adaptive]. We use the ML-Poly algorithm proposed in [gaillard2014second] and implemented in the R package opera [gaillard2016opera] to compute these online weights.

The aggregation weights are estimated independently for each hour of the day. We summarize different variants in Table IV.

Adaptation AR Linear GAM MLP All
Best expert 14.3 11.2 12.7 12.3 11.2
Aggregation 14.4 11.4 11.7 11.9 10.9
TABLE IV: Mean average error of aggregation strategies (in MW) during the competition evaluation set (2021-01-18 to 2021-02-16).

First, for each family of models we compute the aggregation of all the adaptation settings (7 for each). Then we aggregate all of them (28 models). An example of the weights obtained at 3PM is displayed in Figure 7.

Fig. 7: Evolution of the aggregation weights at 3PM from July 1st 2020 to February 16th 2021.

The aggregation presented in this paper obtains a performance close to our strategy winning the competition (degradation of about 0.05 MW).

V-C Day-to-day Forecasts

During the competition our predictions were not exactly the ones of the aggregation method presented in the previous subsection. There are mainly two reasons for that.

First, we considered a bigger set of forecasting methods (we had experts). It seemed reasonable to prune the strategy for the sake of clarity of paper at the cost of a very small change of error, but it is interesting to present also the predictions used during the evaluation. We found a trade-off in the selection of experts. Indeed too much experts in the aggregation yields poor performances. We applied a greedy procedure to select the experts we keep in the aggregation: we begin with an empty set, and at each step we add the one improving the most the performance. That performance was evaluated with the MAE on the last month of the training data set. We provide in Figure 8 a graphical representation of how we defined different time periods. We refer to Figure 9 for the evolution of the validation MAE as the selection grows. We observe a sharp decrease of the error as experts are added with high diversity and then a slow increase of the loss as the set of experts becomes too large.

Fig. 8: Segmentation of the data set. Meteorological corrections as well as time-invariant forecasts were trained on the train period (up to January 1st 2020). Adaptive forecasting methods were evolving on the whole period. Then the aggregation weights were trained by MLpol from July 1st 2020, and the expert selection was determined with respect to the validation set.
Fig. 9: Evolution of the validation MAE as the expert selection grows from 1 to 30 experts. The nomenclature is provided in the Appendix.

Second, we were constantly experimenting the different strategies. We used a variational bayesian method that was a prior version of the one of this paper. We also changed a lot the aggregation procedure.

We refer to the appendix for a detailed presentation of our daily strategy. Overall these day-by-day changes degraded the performance, if we had stayed on the first strategy with no change at all, our MAE would have been 10.51 MW instead of 10.84 MW. The critical issue in such unstable periods is to find the right validation period to select the prediction procedure. The month before the evaluation period seems a posteriori a good compromise. During the competition we changed ”manually” based on the performances in a shorter range, for instance considering an expert performing well on the last few weeks for a specific day type … We should have trusted the aggregation’s robustness.

Vi Conclusion

In this paper we presented our procedure to win a competition on electricity load forecasting during an unstable period. Our approach relies heavily on state-space models and the competition was the first data set on which was applied a recent approach to adapt the variances of a state-space model. Some perspectives have been raised during the competition such as interpretability of the global approach and a better understanding of the error propagation along the different adaptations (intraday correction, Kalman filtering, variance tracking and aggregation).

Finally, similar state-space methods have been applied to obtain the first place in another competition in which the objective was to forecast the electricity consumption of a building333


Vi-a Nomenclature

The experts AR, Lin, GAM, RF, RF_GAM, MLP are the ones presented in that same order in Section III-A.

Names of the form model_setting refer to the expert obtained by state-space adaptation of the model model with the setting setting. For instance, Lin_dynamic refers to a linear model adapted with the Kalman filter in the dynamic setting, c.f. Section IV-B.

We consider quantile variants of RF, denoted by RFq where q is the quantile order in percent (e.g. RF40 is the quantile random forest of quantile value 0.4). We also consider a quantile variant of the dynamic MLP denoted by similar names (MLP_dynamic60 is the quantile 0.6 of the MLP in the dynamic setting).

Furthermore, we introduce an expert named GAM_SAT forecasting each day with the GAM as if it were a Saturday motivated by [obst2021adaptive].

Finally, each expert x yields another expert x_corr after intraday correction.

Vi-B Day-to-day Evolution of the Forecasting Strategy

As explained in Section V-C, our strategy evolved in time and we recall here every change.

  • From January 18th to January 24th: we used the following set of experts obtained by the greedy selection described in Section V-C: RF, RF_corr, RF50_corr, RF60_corr, Lin_dynamic_corr, Lin_viking_corr, GAM, GAM_corr, GAM_staticbreak_corr, GAM_dynamic_corr, GAM_viking_corr, GAM_dynamicbig_corr, RF_GAM, RF_GAM_corr, GAM_SAT_corr, MLP_dynamic60, MLP_dynamic90, MLP_dynamic99. We aggregated with ML-poly with an aggregation estimated independently for each hour, with the absolute loss. We found afterwards a bug in RF50_corr, RF60_corr: the quantile RF were set to 0 on the test set so that these two experts were simple intraday autoregressive trained in an unwanted manner.

  • From January 25th to January 31st: we removed the experts RF50_corr, RF60_corr and we replaced them with AR_corr.

  • February 1st and 2nd: we used the uniform average between three forecasts. First, the previous aggregation. Second, another aggregation procedure called RF-stacking, consisting in a quantile random forest minimizing the MAE and taking as input the 72 experts as well as the day type and hour of the day. Third, a benchmark close to the one given by the competition organizers: we predict each time with the last available load of the same hour and the same day group (week days, saturdays and sundays).

  • From February 3rd to 7th: we removed the benchmark which damaged the performances. For Feb. 5th we corrected the ML-poly prediction using a special day correction, once we observe that Feb. 5th had a special behaviour in the last three years. Precisely, we observed that the relative error of the model is significantly negative on the last three years, a behaviour that may come from a bank holiday for instance. Therefore we fit a smoothed function of the time of day on the relative error and we applied it to our forecast. We truncated so that there is no correction during night. See the shape of the correction in Figure 10.

    Fig. 10: Special day correction applied on February 5th. It is a multiplicative correction, e.g. at midday we reduce our forecast by about 3.8%.
  • February 8th: we used the single expert Lin_dynamicbig_corr as we observed that it was by far our best expert on the last week, and it seemed to perform especially well on Mondays.

  • February 9th: we came back to the average between ML-poly and the RF-stacking but we added to the aggregation ML-poly the expert Lin_dynamicbig_corr, and we replaced the expert AR_corr with another expert AR_intra incorporating directly the intraday correction in the autoregressive, instead of correcting an autoregressive based only on daily lags.

  • February 10th and 11th: we removed the RF-stacking which degraded our performances since its introduction and we kept only the aggregation ML-poly.

  • February 12th and 13th: we corrected a posteriori the electricity load for February 5th with the special day correction. It was important to do it on that day as the weekly lags is important in the models.

  • February 14th: we used once again the average between the ML-poly aggregation and the RF-stacking, as we observed that the RF-stacking is especially good on Sunday.

  • February 15th and 16th: we used only the ML-poly aggregation.