I Introduction
In countries with liberalized electricity markets, prices play a central role for the efficient shortterm coordination of supply and demand.
Operators of generation units, flexible loads, and storage facilities adapt their bidding and scheduling to anticipated market prices.
Hence, price forecasts are a crucial ingredient for flexible power systems.
Probabilistic forecasts extend classic point forecasts by also reporting the forecast uncertainty instead of only the expected value.
This allows an agent that bases his actions on the forecast to take risksensitive decisions by using tools from stochastic programming.
E.g. in [1] the operation of a multireservoir hydro power plant in the Nordic market under uncertain water inflow and market prices is modeled.
In [2] a combined heat and power plant which operates in the German market under uncertain dayahead and balancing prices in considered.
In these settings, electricity prices are a main source of uncertainty.
The optimization problem is usually solved using the sample average approximation of the objective [3] which requires samples from the multivariate joint predictive distribution.
Nevertheless, the probabilistic electricity price forecasting (EPF) literature is largely restricted to estimating marginal distributions over the single hourly prices [4]
. However, sampling independently from the marginals would yield suboptimal solutions since this ignores the dependency structure of the individual dimensions. A common approach in other forecasting domains is to first estimate the marginal predictive distributions for the individual output dimensions and then generate samples from the joint distribution using copulas
[5].In this paper, we introduce implicit generative ensemble postprocessing (IGEP) which bypasses these two steps by combining ideas from ensemble learning and implicit generative models (IGMs) [6]
. Our method works on top of an ensemble of deterministic expert models and allows to generate multivariate scenarios which represent the joint predictive distribution implicitly. We use a multivariate linear model to transform samples from a set of univariate latent random variables and the deterministic mean forecasts to vectors of the target distribution. The model is trained by minimizing the energy score (ES)
[7]using stochastic gradient descent. Our model uses two types of stochastic latent variables, adaptive and nonadaptive. The parameters of the adaptive latent variable distributions are set according to the ensemble dispersion and hence reflect the uncertainty of the mean prediction. The parameters of the nonadaptive distributions have fixed parameters to reflect the second type of uncertainty, the unexplainable randomness.
We demonstrate our approach on a publicly available data set of the GermanAustrian dayahead electricity market [8]
where hourly prices are set via 24 simultaneous blind auctions under uniform pricing. Our method outperforms various other approaches in terms of continuous ranked probability score (CRPS) and ES, including quantile regression averaging (QRA)
[9] and nonhomogeneous Gaussian regression (NGR) [10, 11] in combination with a Gaussian copula.The remainder of the paper is structured as follows. We introduce main concepts regarding probabilistic forecasting, proper scoring rules, and IGMs in Section II. We then describe our approach theoretically in Section III and apply it to the problem of probabilistic EPF in Section IV. We conclude in Section V.
Ii Prior Work & Concepts
Iia Multivariate Probabilistic Forecasting
The literature on EPF has become extensive over the last two decades [12].
In recent years the focus has shifted from point forecasts to probabilistic forecasts [4].
The goal of probabilistic forecasting is to correctly quantify prediction uncertainties.
Hence, probabilistic forecasts usually come in the form of a predictive conditional distribution over possible future quantities or events [13].
However, in order to use probabilistic forecasts for scenariobased stochastic optimization, the forecasts have to take the form of samples from the joint predictive distribution. For univariate problems or if the output dimensions are independent this is straight forward. In a multivariate setting, e.g. spatiotemporal modeling, sampling independently from the marginals would not lead to a coherent dependency structure. In this case, the common approach is to first estimate a set of univariate, marginal predictive cumulative distribution functions (CDF)
over individual output dimensions and then use copula functions to generate samples. Sklar’s theorem [14] shows that every multivariate CDF with marginal distributions can be represented as for . The copula function is a multivariate CDF with standard uniform marginals and allows to decouple the estimation of the marginal CDFs from determining the joint distribution. A popular choice in many applications is the Gaussian copula. Here, samples can be generated by first sampling a realization of from adimensional standard normal distribution
with covariance matrix . Then a predictive scenario is obtained by setting , where denotes the CDF of a univariate standard normal distribution and is the inverse of the predictive CDF for dimension .This approach is regularly applied in spatiotemporal problem settings like geostatistics, meteorological forecasting, and renewable energy forecasting [5]. To our best knowledge only two EPF papers consider the problem. Toubeau et al. [15] use empirical copulas to generate coherent scenarios of load, renewable generation, and prices for the Belgian market. Chai et al. [16] employ a Gaussian copula approach for the Nordpool dayahead market.
IiB Forecast Combination
It is well known that combining forecasts from different models often improves accuracy. The combination of probabilistic forecasting and ensemble methods is especially popular and well researched in the field of meteorological forecasting under the name of ensemble model output statistics (EMOS) [17]. In this domain, the task of model combination naturally arises from the need to statistically postprocess the output of numerical weather forecasting ensembles to obtain a calibrated probabilistic forecasts. However, ensemble postprocessing methodologies and forecast averaging have also become popular in the probabilistic EPF literature, mainly as variants of QRA [9]. In QRA, forecasts from an ensemble of models are used as inputs for a quantile regression model to approximate the predictive distribution using a dense grid of quantiles. However, there is to our best knowledge no EPF paper that systematically benchmarks other forecast combination schemes.
IiC Proper Scoring Rules
In probabilistic forecasting we aim for predictive distributions that maximize sharpness subject to calibration.
Calibration refers to the consistency between the forecast and the observations, e.g. in expectation 20% of the observed values should fall below the forecasted 0.2 quantiles, while sharpness refers to the concentration of the predictive distribution [7].
Scoring rules formalize the intuition of sharpness and calibration by assigning a numerical score to the predictive distribution depending on the realization of the random event or quantity of interest.
More formally, if we have access to samples from the true distribution and we issue a predictive distribution , let be our reward, where is a scoring rule.
A scoring rule is called strictly proper if its expected value is uniquely maximized for .
Thus, strictly proper scoring rules describe a principled framework for comparing probabilistic forecasts.
The CRPS is a strictly proper scoring rule for real valued quantities that does not rely on a predefined likelihood.
It is defined as
(1) 
where is the CDF of the probabilistic forecast [18]. Instead of computing the integral in (1) we can also compute the CRPS by
(2) 
where and are independent samples of a random variable distributed according to [7].
Gneiting and Raftery [7] introduce a generalization of the CRPS to the multivariate case.
The ES is defined as
(3) 
and is strictly proper for . Since the ES makes no assumption about the distributional form of and can be evaluated based on samples from the predictive distribution
, it provides an attractive loss function for multivariate probabilistic forecasting tasks.
IiD Implicit Generative Models
Unlike prescribed generative models, which provide an explicit parametric description of the underlying probability distribution, IGMs are likelihoodfree models that only define a stochastic procedure to generate samples
[6]. Since we are primarily interested in sampling from the predictive distribution, IGMs form an attractive class of models for multivariate probabilistic forecasting.A prominent example of IGMs are GANs [19]
which are based on deep neural networks and replace the likelihood function with a classification model called the discriminator. The only input for the generator model are vectors of unconditional random noise from a set of simple, univariate latent variable distributions, e.g. independent uniform distributions. GANs have already been applied to probabilistic wind power forecasting in
[20] where they are used to generate unconditional error scenarios for multistep ahead forecasting. While GANs are potentially very powerful models and achieve impressive results on a variety of tasks, they are known for a complex and unstable training procedure and require large training data sets [21]. MMDGANs [22, 23] replace the discriminator model with the maximum mean discrepancy twosample test (MMD) [24] which significantly simplifies the training procedure. Interestingly, the ES is a special case of the MMD [25].Iii Implicit Generative Ensemble Postprocessing
Iiia Problem Setting
Consider a data set comprised of examples , where is the output and are previously determined point predictions for from a set of different models, with and . We denote the mean prediction of the ensemble members by , i.e. . Our goal is to generate a set of scenarios that represent the predictive joint distribution . In order to do this, we parametrize a generator function with model parameters , where denotes a sample from independent univariate latent distributions.
IiiB Latent Variable Distributions
The latent variables in IGMs are usually treated as independent random noise and serve as an external source of randomness to generate samples from the unconditional implicit distribution. We propose to give meaning to some of the latent distributions by adapting their parameters depending on the ensemble predictions. We construct uniform adaptive latent distributions , where
(4) 
and
. This will reduce the variance of the samples for the dimension
if the ensemble predictions are similar and increase the variance in case the ensemble is more dispersed. We denote a sample from the adaptive latent distributions by .The uncertainty in the final prediction is only partly due to the uncertainty in the mean predictions. It is also caused by effects we are not able to model and hence treat as random noise. We therefore specify additional latent distributions that are independent of the point predictions, i.e. they have a constant variance. We denote a sample from the nonadaptive latent distributions by . To construct a sample from the latent space, we draw from all latent distributions and concatenate the values to the vector .
IiiC Generator Function
To be able to model a coherent dependency structure, the architecture of the generator function must ensure that all elements in can depend on the whole latent random vector . In general, could be any differentiable function, e.g. a deep neural network. However, as training data in a lot of forecasting tasks is usually relatively small, we propose a linear model of the form
(5) 
with , , and . Thus, the model has parameters. The first two terms allow for bias correction of the deterministic ensemble mean. The third term models the dependence on the adaptive latent variables while the fourth term models the dependence on the nonadaptive latent variables. Hence, for output dimension , the prediction is a linear combination of the respective ensemble mean and all elements in and ,
(6) 
Note that by construction and and therefore as can be seen from (5) and (6), i.e. in expectation the model predicts the bias corrected ensemble mean.
Fig. 1 shows the process of generating a set of scenarios for a single example . First, the parameters of latent distributions are adapted according to the ensemble dispersion . We then draw samples from the univariate latent distributions.
Together with the mean prediction , each of these samples is then mapped to the multivariate output space via .
We thereby obtain a set of scenarios which represent the predictive distribution.
IiiD Training
We train to minimize the expected ES given by
(7) 
with and is the Frobenius norm. Intuitively, this objective function represents two orthogonal goals. The first term in (7) decreases if the generated scenarios are close to the true value while the second term increases when the distance between the scenarios is large and hence rewards scenarios that are diverse. The last term is a regularization term that is controlled by the parameter . A sensible initialization for the model parameters is , , and . This corresponds to a model that in expectation predicts the ensemble mean and the uncertainty of dimension only depends on the ensemble spread for . During training, only a small number of scenarios is generated per training example. We train the model using stochastic gradient descent with batch size and use automatic differentiation to compute the gradients. Algorithm 1 formalizes the training procedure.
Iv Case Study: Probabilistic Electricity Price Forecasting
Iva Forecasting Study
In the following, we demonstrate how to apply our approach to the task of probabilistic electricity price forecasting using a publicly available data set of the GermanAustrian dayahead market from January 2015 to December 2017 [8]. As in most European countries, this market is operated as a daily blind auction under uniform pricing for 24 onehourblocks of electrical energy that is to be consumed or delivered during the respective hour of the following day. The data set contains the prices for the hours on the days along with the forecasted load , wind power generation , and solar power generation . The forecasting task is to issue a probabilistic forecast for the price vector in form of a set of samples . To apply our model, we first have to construct a set of point forecasting models that form the ensemble. The outofsample forecasts of the ensemble are then used as inputs for the probabilistic models. We use the data of 2015 as initial training set for the ensemble models. The ensemble forecasts for 2016 then form the training set for the probabilistic models. We use the full year of 2017 as test set. For all models, we apply a rolling window scheme, i.e. after forecasting the values for the next day, the training set is shifted by one day and all models are reestimated.
IvB Point Forecasting Ensemble
We train a set of expert point forecasting models and model the prices as a function of the residual load , i.e. the share of the demand that is not covered by generation from wind and solar. Point forecasts are denoted by .
ArxM
The first model is an ARXtype linear regression model given by
. This model works on transformed prices [26] denoted by to account for the nonlinear effect of the residual load. We fit one model per hour of the day, i.e. we fit 24 separate models.ArxU
The second model is of similar type but we only fit a single model for all hours of the day. The model is given by , where
is a one hot encoded vector of hour dummies.
PolyLR
This model is a polynomial linear regression model given by . This model does not use any lagged predictors. To still account for autocorrelation, we estimate the model parameters using weighted least squares, i.e. we weight the th training sample by .
LwLr
This model is a locally weighted linear regression model given by with weights .
Gb
The last model uses gradient boosted decision trees and models the price as a function of the residual load and the hour dummies
. It is also estimated using a weighted training set with weights. The model was estimated using Scikitlearn 0.20.1 with all hyperparameters kept at their default values.
We report the mean absolute error (MAE) and the root mean squared error (RMSE) of the models’ forecasts and the simple average of all forecasts (AVG) for the entire out of sample period in Table I. Taking the average of all forecasts reduces the MAE by 5% and RMSE by 2% in comparison to the best performing model.
ARXU  ARXM  PolyLR  LWLR  GB  AVG  

MAE  3.65  3.64  3.46  3.64  3.58  3.29 
RMSE  6.08  6.08  5.53  5.86  6.17  5.42 
IvC Model Benchmarking
We set up an IGEP model as described in Section III with hyperparameters and
, i.e. the model has 864 parameters. Before training we standardize all inputs and outputs using the mean and standard deviation of the prices in the training set. We train the model for 100 epochs using the Adam optimizer
[27]at default values in Keras 2.2.4
[28]. Training the model takes about one minute on a standard laptop with an Intel i77500U CPU. We generate 1000 scenarios for each of the 365 days in the test set and evaluate our model by computing the average ES and CRPS. The whole training and test procedure was repeated 10 times for all models to average out random effects during training and evaluation. We compare the results of our model against the five approaches described below.Raw Ensemble
The simplest approach is to treat the ensemble members as samples from the predictive distribution, i.e. we set .
Multivariate Gaussian Errors (MGE)
We can easily generate scenarios where and is the covariance matrix of the residuals of the ensemble mean predictions from the training set. This method can generate samples with a coherent dependency structure if the homoscedastic multivariate Gaussian error assumption is appropriate.
IGEP with independent latent variables (IGEP_{ind})
Here we use the IGEP model with the same hyperparameters but independent latent variables, i.e. we set for all examples. Thus, the variance of the latent distributions is not adapted to the ensemble spread and the model does not account for uncertainty in the ensemble mean prediction. However, the model should be able to generate realistic price scenarios if the model’s capacity is sufficient.
Quantile Regression Averaging & Gaussian copula (QRA+C)
We use quantile regression averaging (QRA) [9] to approximate the marginal predictive distribution for each price by a dense grid of quantiles . QRA applies a linear quantile regression [29] model to an ensemble of point forecasts, where is the predicted value for quantile . We then use the Gaussian copula approach described in [30]
with the standardize error covariance matrix
to generate scenarios based on the approximated univariate marginal distributions. QRA is a well established and strong benchmark for probabilistic electricity price forecasting [4]. Also note that the pinball loss function minimized in QRA is an approximation of the CRPS.Nonhomogeneous Gaussian Regression & Gaussian copula (NGR+C)
is an approach originally developed for postprocessing numerical weather forecast ensembles. The marginal probabilistic forecast takes the form of a univariate Gaussian distribution
. The mean and variance parameters are estimated using two different models. We use the model to estimate the mean and the model to estimate the variance, where is the standard deviation of the ensemble predictions. We estimate the NGR model parameters using maximum likelihood (ML) as well as minimum CRPS estimation using the R package [31]. We then again use the Gaussian copula to generate multivariate samples.IvD Results
IGEP  Raw Ens.  MGE  IGEP_{ind}  QRA+C  NGR_{ML}+C  NGR_{CRPS}+C  

ES  16.294  18.591  17.261  16.995  16.806  16.427  16.410 
CRPS  2.693  3.052  2.866  2.799  2.783  2.716  2.704 
RMSE  6.014  5.994  5.994  6.000  5.956  6.022  6.030 
Fig. 2 shows scenario forecasts generated by the IGEP model for two different market situations together with the true price and the ensemble predictions.
As can be seen, the model generates samples that resemble the real price vectors.
Furthermore, it generates more diverse scenarios at times where the ensemble spread is increased.
Table II shows the test set results for the average ES and CRPS values as well as the RMSE for the predictive mean.
Recall that the CRPS does not consider the joint dependency structure like the ES but only accounts for the marginal predictive distributions.
Hence, the CRPS values of the QRA and NGR models are only dependent on the models for the marginals while the ES also depends on the employed copula.
All tested models improve over the ES and CRPS values of the raw ensemble with the MGE approach showing the smallest improvement.
This indicates that the assumption of homoscedastic Gaussian errors is not appropriate.
The IGEP model performs best in terms ES and CRPS.
It shows a 4 % improvement over the IGEP_{ind} model in terms of ES.
This improved performance must result from the adaptive latent variable distributions as the models are otherwise identical.
Hence, the IGEP model is capable of meaningfully transforming the random samples from the univariate adaptive latent distributions to samples of the multivariate predictive distribution.
The second best model is the NGR model with minimum CRPS estimation which slightly improves over the NGR model with maximum likelihood estimation.
Interestingly, QRA, which is quite popular in the EPF literature, shows the worst performance of the more advanced methods.
While the differences in the scores seem small, note that the differences must largely result from a better assessment of forecasting uncertainty since the RMSE values for all models are very similar.
V Conclusion
In this paper we proposed IGEP, a novel forecast combination method for scenariobased multivariate probabilistic forecasting that combines IGMs with an ensemble of point prediction models.
Using a given set of point forecasts as inputs, our method allows to generate multivariate samples from the joint predictive distribution without making parametric assumptions about the underlying probability distribution.
We demonstrated our approach for the task of probabilistic electricity price forecasting.
Our method outperformed two well established benchmarks, QRA and NHR in combination with a Gaussian copula, in terms of ES and CRPS.
Since our method works on top of an ensemble of domain specific expert models, it can be applied in a wide variety of forecasting tasks.
There are several avenues for future work.
We did not systematically investigate the effect of changing the model’s hyperparameters, e.g. the number and type of latent variable distributions.
Introducing a nonlinear model structure is also worth investigating.
Furthermore, we plan to test and compare the proposed framework on more data sets.
In general, we see a lot of potential in combining IGMs and expert models for probabilistic forecasting, especially if one wants to avoid making parametric assumptions about the predictive distribution.
References
 [1] E. K. Aasgard, G. S. Andersen, S.E. Fleten, and D. Haugstvedt, “Evaluating a stochasticprogrammingbased bidding model for a multireservoir system,” IEEE Transactions on Power Systems, vol. 29, no. 4, pp. 1748–1757, 2014.
 [2] N. Kumbartzky, M. Schacht, K. Schulz, and B. Werners, “Optimal operation of a chp plant participating in the german electricity balancing and dayahead spot market,” European Journal of Operational Research, vol. 261, no. 1, pp. 390–404, 2017.
 [3] A. Shapiro, D. Dentcheva, and A. Ruszczyński, Lectures on stochastic programming: modeling and theory. SIAM, 2009.
 [4] J. Nowotarski and R. Weron, “Recent advances in electricity price forecasting: A review of probabilistic forecasting,” Renewable and Sustainable Energy Reviews, vol. 81, pp. 1548–1568, 2018.
 [5] R. Schefzik and A. Moeller, “Ensemble postprocessing methods incorporating dependence structures,” in Statistical Postprocessing of Ensemble Forecasts. Elsevier, 2018, pp. 91–125.
 [6] S. Mohamed and B. Lakshminarayanan, “Learning in implicit generative models,” arXiv preprint arXiv:1610.03483, 2016.
 [7] T. Gneiting and A. E. Raftery, “Strictly proper scoring rules, prediction, and estimation,” Journal of the American Statistical Association, vol. 102, no. 477, pp. 359–378, 2007.
 [8] ENTSOE. Transparency Platform, “transparency.entsoe.eu,” 2019.
 [9] J. Nowotarski and R. Weron, “Computing electricity spot price prediction intervals using quantile regression and forecast averaging,” Computational Statistics, vol. 30, no. 3, pp. 791–803, Sep 2015.
 [10] T. Gneiting, A. E. Raftery, A. H. Westveld, and T. Goldman, “Calibrated probabilistic forecasting using ensemble model output statistics and minimum CRPS estimation,” Monthly Weather Review, vol. 133, no. 5, pp. 1098–1118, 2005.

[11]
S. Jewson, A. Brix, and C. Ziehmann, “A new parametric model for the assessment and calibration of mediumrange ensemble temperature forecasts,”
Atmospheric Science Letters, vol. 5, no. 5, pp. 96–102, 2004.  [12] R. Weron, “Electricity price forecasting: A review of the stateoftheart with a look into the future,” International Journal of Forecasting, vol. 30, no. 4, pp. 1030–1081, 2014.
 [13] T. Gneiting and M. Katzfuss, “Probabilistic forecasting,” Annual Review of Statistics and Its Application, vol. 1, no. 1, pp. 125–151, 2014.
 [14] A. Sklar, “Fonctions de répartition a n dimensions et leur marges,” Publ. Inst. Stat. Paris, vol. 8, pp. 131–229, 1959.

[15]
J.F. Toubeau, J. Bottieau, F. Vallee, and Z. de Greve, “Deep learningbased multivariate probabilistic forecasting for shortterm scheduling in power markets,”
IEEE Transactions on Power Systems, p. 1, 2018.  [16] S. Chai, Z. Xu, and Y. Jia, “Conditional density forecast of electricity price based on ensemble elm and logistic emos,” IEEE Transactions on Smart Grid, vol. 10, no. 3, pp. 3031–3043, 2018.
 [17] D. S. Wilks, “Univariate ensemble postprocessing,” in Statistical postprocessing of ensemble forecasts. Elsevier, 2018, pp. 49–89.
 [18] J. E. Matheson and R. L. Winkler, “Scoring rules for continuous probability distributions,” Management science, vol. 22, no. 10, pp. 1087–1096, 1976.
 [19] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems, 2014, pp. 2672–2680.
 [20] Y. Chen, X. Wang, and B. Zhang, “An unsupervised deep learning approach for scenario forecasts,” in 2018 Power Systems Computation Conference (PSCC), June 2018, pp. 1–7.
 [21] I. Goodfellow, “Nips 2016 tutorial: Generative adversarial networks,” arXiv preprint arXiv:1701.00160, 2016.

[22]
G. Dziugaite, D. Roy, and Z. Ghahramani, “Training generative neural networks
via maximum mean discrepancy optimization,” in
Uncertainty in Artificial IntelligenceProceedings of the 31st Conference, UAI 2015
, 2015, pp. 258–267. 
[23]
Y. Li, K. Swersky, and R. Zemel, “Generative moment matching networks,” in
International Conference on Machine Learning
, 2015, pp. 1718–1727.  [24] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola, “A kernel twosample test,” Journal of Machine Learning Research, vol. 13, no. Mar, pp. 723–773, 2012.
 [25] D. Sejdinovic, B. Sriperumbudur, A. Gretton, and K. Fukumizu, “Equivalence of distancebased and rkhsbased statistics in hypothesis testing,” The Annals of Statistics, vol. 41, no. 5, pp. 2263–2291, 2013.
 [26] B. Uniejewski, R. Weron, and F. Ziel, “Variance stabilizing transformations for electricity spot price forecasting,” IEEE Transactions on Power Systems, vol. 33, no. 2, pp. 2219–2229, 2017.
 [27] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
 [28] F. Chollet et al., “Keras,” https://keras.io, 2015.
 [29] R. Koenker and G. Bassett Jr, “Regression quantiles,” Econometrica: journal of the Econometric Society, pp. 33–50, 1978.
 [30] P. Pinson, H. Madsen, H. A. Nielsen, G. Papaefthymiou, and B. Klöckl, “From probabilistic forecasts to statistical scenarios of shortterm wind power production,” Wind Energy, vol. 12, no. 1, pp. 51–62, 2009.

[31]
J. W. Messner, G. J. Mayr, and A. Zeileis, “Heteroscedastic censored and truncated regression with crch,”
The R Journal, vol. 8, no. 1, pp. 173–181, 2016.
Comments
There are no comments yet.