The importance of accurately quantifying forecast uncertainty has motivated much recent research on probabilistic forecasting. In particular, a variety of deep learning approaches has been proposed, with forecast distributions obtained as output of neural networks. These neural network-based methods are often used in the form of an ensemble based on multiple model runs from different random initializations, resulting in a collection of forecast distributions that need to be aggregated into a final probabilistic prediction. With the aim of consolidating findings from the machine learning literature on ensemble methods and the statistical literature on forecast combination, we address the question of how to aggregate distribution forecasts based on such deep ensembles. Using theoretical arguments, simulation experiments and a case study on wind gust forecasting, we systematically compare probability- and quantile-based aggregation methods for three neural network-based approaches with different forecast distribution types as output. Our results show that combining forecast distributions can substantially improve the predictive performance. We propose a general quantile aggregation framework for deep ensembles that shows superior performance compared to a linear combination of the forecast densities. Finally, we investigate the effects of the ensemble size and derive recommendations of aggregating distribution forecasts from deep ensembles in practice.READ FULL TEXT VIEW PDF
Probabilistic forecasts in the form of predictive probability distributions over future quantities or events aim to quantify the uncertainty in the predictions and are essential to optimal decision making in applications(Gneiting and Katzfuss, 2014; Kendall and Gal, 2017). Motivated by their superior performance on a wide variety of machine learning tasks, much recent research interest has focused on the use of deep neural networks (NNs) for probabilistic forecasting. Different approaches for obtaining a forecast distribution as the output of a NN have been proposed over the past years, including parametric methods where the NN outputs parameters of a parametric probability distribution (Lakshminarayanan et al., 2017; D’Isanto and Polsterer, 2018; Rasp and Lerch, 2018), semi-parametric approximations of the quantile function of the forecast distribution (Bremnes, 2020) and nonparametric methods where the forecast density is modeled as a histogram (Gasthaus et al., 2019; Li et al., 2021)
. To account for the randomness of the training process based on stochastic gradient descent methods, NNs are often run several times from different random initializations.Lakshminarayanan et al. (2017) refer to this simple to implement and readily parallelizable approach as deep ensembles. Deep ensembles of NN models for probabilistic forecasting thus yield an ensemble of predictive probability distributions. To provide a final probabilistic forecast, the ensemble of predictive distributions needs to be aggregated to obtain a single forecast distribution.
The problem of combining predictive distributions has been studied extensively in the statistical literature, see Gneiting and Ranjan (2013) and Petropoulos et al. (2022, Section 2.6) for overviews. Combining probabilistic forecasts from different sources has been successfully used in a wide variety applications including economics (Aastveit et al., 2019), epidemiology (Cramer et al., 2021; Taylor and Taylor, 2021), finance (Berkowitz, 2001), signal processing (Koliander et al., 2022) and weather forecasting (Baran and Lerch, 2016, 2018), and constitutes one of the typical components of winning submissions to forecasting competitions (Bojer and Meldgaard, 2021; Januschowski et al., 2021). On the other hand, forecast combination also forms the theoretical framework of some of the most prominent techniques in machine learning such as boosting (Freund and Schapire, 1996), bagging (Breiman, 1996)2001)
, which are based on the idea of building ensembles of learners and combining the associated predictions. Generally, the individual component models (or ensemble members) can be based on entirely distinct modeling approaches, or on a common modeling framework where the model training is subject to different input datasets of other sources of stochasticity. The latter is the case for deep ensembles where the main sources of uncertainty in the estimation are the random initialization of the network parameters and the stochastic gradient descent algorithm for the optimization. For general reviews on ensemble methods in machine learning, we refer toDietterich (2000), Zhou et al. (2002) and Ren et al. (2016).
While the arithmetic mean is a powerful and widely accepted method for aggregating single-valued point forecasts, the question how probabilistic forecasts should be combined is more involved and has been a focus of research interest in the literature on statistical forecasting (Gneiting and Ranjan, 2013; Lichtendahl et al., 2013; Petropoulos et al., 2022). We will focus on readily applicable aggregation methods for the combination of probabilistic forecasts from deep ensembles. A widely used approach is the linear aggregation of the forecast distributions, an approach that is often referred to as linear (opinion) pool (LP). In the context of deep ensembles, Lakshminarayanan et al. (2017) apply the LP and linearly combine density forecasts. An alternative is given by aggregating the forecast distributions on the scale of quantiles by linearly combining the corresponding quantile functions, an approach that is commonly referred to as Vincentization (VI; see, for example, Genest, 1992).
The main aim of our work is to consolidate findings from the statistical and machine learning literature on forecast combination and ensembling for probabilistic forecasting. Using theoretical arguments, simulation experiments and a case study on probabilistic wind gust forecasting, we systematically investigate and compare aggregation methods for probabilistic forecasts based on deep ensembles, with different ways to characterize the corresponding forecast distributions. This study is motivated by and based on our work in Schulz and Lerch (2022), where we use ensembles of NNs to statistically postprocess probabilistic forecasts for the speed of wind gusts and propose a common framework of NN-based probabilistic forecasting methods with different types of forecast distributions. In the following, we apply a two-step procedure by first generating an ensemble of probabilistic forecasts and then aggregating them into a single final forecast, which matches the typical workflow of forecast combination from a forecasting perspective. Alternatively, it is also possible to incorporate the aggregation procedure directly into the model estimation (Kim et al., 2021).
The remainder of the paper is organized as follows. Section 2 introduces relevant metrics for evaluating probabilistic forecasts and the forecast aggregation methods. Three NN-based methods for probabilistic forecasting are presented in Section 3 along with a discussion of how the different aggregation methods can be used to combine the corresponding predictive distributions of an ensemble of such forecasts. In Section 4, we conduct a comprehensive simulation study that is followed up by a case study on probabilistic weather prediction in Section 5. Section 6 concludes with a discussion. R (R Core Team, 2020) code with implementations of all methods is available online (https://github.com/benediktschulz/agg_distr_deep_ens).
Probabilistic forecasts given in the form of predictive probability distributions for future quantities or events aim to quantify the uncertainty inherent to the prediction. In the following, we first summarize how such distribution forecasts can be evaluated, and then formally introduce the LP and VI methods for aggregating probabilistic forecasts.
In our evaluation of predictive performance, we will follow the principle of Gneiting et al. (2007) that a probabilistic forecast should aim to maximize sharpness subject to calibration. Calibration refers to the statistical consistency between the forecast distribution and the observation, whereas sharpness is a property of the forecast alone and refers to the degree of forecast uncertainty. A forecast is said to be sharper, the smaller the associated uncertainty.
Quantitatively, calibration and sharpness can be assessed simultaneously using proper scoring rules (Gneiting and Raftery, 2007). A scoring rule assigns a penalty to a pair of a probabilistic forecast and corresponding observation and is called proper if the underlying true distribution scores lowest in expectation, that is, for all if . Our forecast evaluation in the following will mainly focus on the widely used continuous ranked probability score (CRPS; Matheson and Winkler, 1976)
is a forecast distribution with finite first moment andis the indicator function. Proper scoring rules such as the CRPS are not only used for forecast evaluation but also provide valuable tools for estimating model parameters. In the case of the CRPS, the estimation typically relies on closed-form analytical expressions of the integral in (2.1) (see, for example, Jordan et al., 2019) and is referred to as optimum score estimation (Gneiting and Raftery, 2007).
To compare competing forecasting methods based on a proper scoring rule with respect to a benchmark and an optimal forecast, we calculate the associated skill score, for example, the continuous ranked probability skill score (CRPSS). Let denote the mean score of the forecasting method of interest over a given dataset, the corresponding mean score of the benchmark forecast, and that of the (typically hypothetical) optimal forecast. The associated skill score is then calculated via
and simplifies to
if . In contrast to proper scoring rules, skill scores are positively oriented with 1 indicating optimal predictive performance, 0 no improvement over the benchmark and a negative skill a decrease in performance.
Further, we assess calibration qualitatively via histograms of the probability integral transform111Technically, we here use the unified PIT, a generalization proposed in Vogel et al. (2018), due to the format of some of the aggregated forecast distributions. (PIT),
. A probabilistic forecast is (well-)calibrated, if the PIT is uniformly distributed, resulting in a flat histogram. An U-shaped PIT histogram indicates underdispersion (or overconfidence), that is, a lack of spread in the forecast distribution, whereas a hump-shaped histogram indicates overdispersion (or underconfidence), that is, too much spread. In addition, we will generate quantile-based prediction intervals (PIs) to assess the calibration of the forecast distributions via the empirical coverage, and the sharpness via the length of the PIs. If a forecast is well-calibrated, the empirical coverage should resemble the nominal coverage, and a forecast is the sharper, the smaller the length of the PI. The nominal level of the PIs is a tuning parameter for evaluation, we here choose the specific level offrom the application in Schulz and Lerch (2022), which forms the basis of our case study in Section 5. Finally, we measure accuracy based on the mean forecast error of the median derived from the predictive distribution, , , which is positive in case of overforecasting and negative for underforecasting. For further background and details on the assessment of probabilistic forecasts, we refer to Schulz and Lerch (2022, Appendix A) and the references therein.
In the following, given
individual probabilistic forecasts we aim to aggregate, we will denote their cumulative distribution functions (CDFs) byand their quantile functions by . The aggregation methods introduced below will typically assign weights , …, to the individual forecast distributions.
The most widely used approach for forecast combination is the LP, which is the arithmetic mean of the individual forecasts (Stone, 1961)
. For probabilistic forecasts, the LP is calculated as the (in our case equally) weighted average of the predictive CDFs and results in a mixture distribution. Equivalently, the LP can be calculated by averaging the probability density functions (PDFs). We define the predictive CDF of the LP via
where for with . Note that the weights need to sum up to one to ensure that yields a valid CDF.
The LP has some appealing theoretical properties222For example, Lichtendahl et al. (2013) and Abe et al. (2022) show that the score of the LP forecast is at least as good as the average score of the individual components in terms of different proper scoring rules. and has been the prevalent forecast aggregation method over the last decades. For example, Lakshminarayanan et al. (2017) use the LP to combine density forecasts of multiple NNs introducing the term deep ensembles.
However, there are disadvantages to the use of the LP that is known to have suboptimal properties when aggregating probabilities, since a linear combination of probability forecasts results in less sharp and more underconfident forecasts (Ranjan and Gneiting, 2010). Gneiting and Ranjan (2013) extend this result to the general case of predictive distributions by showing that in case of distribution forecasts sharpness decreases and dispersion increases. In particular, a (non-trivial) combination of calibrated forecasts is not calibrated anymore. In the context of deep ensembles, these downsides have also been observed in recent studies (Rahaman and Thiery, 2020; Wu and Gales, 2021).
In our simulation and case study conducted in the following, we apply the aggregation methods to forecasts produced by the same data-generating mechanism based on an ensemble of NNs, which differ only in the random initialization. Therefore, we do not expect systematic differences between the individual forecasts and only consider equally weighted averages. In the following, we will refer to the LP as the equally weighted average given by for in (2.3). Figure 1 illustrates the effect of forecast combination via the LP.
While the LP aggregates the forecasts on a probability scale, VI performs a quantile-based linear aggregation (Vincent, 1912; Ratcliff, 1979; Genest, 1992). We extend the standard VI framework333To the best of our knowledge, VI is usually only applied with standardized weights with , and without the intercept . Exceptions include Wolffram (2021) and related, unpublished simulation experiments by Anja Mühlemann (University of Bern, 2020, personal communication). by defining the VI quantile function via
where and for . In contrast to the LP, the weights do not need to sum to one and only their non-negativity is required to ensure the monotonicity of the resulting quantile function .444Note that in general is not the quantile function corresponding to the CDF of the LP, even for and equal weights. Further, a real-valued intercept is added to the aggregated quantile functions to correct for systematic biases.
As for the LP, we only consider equally weighted averages for VI, that is, for . Given equal weights, we consider four different variants of VI. First, with weights that sum up to 1 and no intercept, that is, and , which is referred to by V. Similar to the LP, V does not require the estimation of any parameters. Further, we consider VI variants where we estimate the parameters and both independently (while the other is fixed at the values of V) and also simultaneously, resulting in the three variants V (where and is estimated), V (where and is estimated) and V (where both and
are estimated). The parameters are estimated minimizing the CRPS following the optimum scoring principle. The standard procedure for training machine learning models where the available data is split into training, validation and test datasets offers a natural choice for estimating the combination parameters. Given NN models estimated based on the training set (where the validation set is used to determine hyperparameters), we estimate the coefficients of the VI approaches separately in a second step based on the validation set, which can be seen as a post-hoc calibration step(Guo et al., 2017). During this second step, the component models with quantile functions are considered fixed and we only vary the combination parameters in (2.4). In the following, we will restrict our attention to fixed training and validation sets, but an extension of the approach described here to a cross-validation setting is straightforward. Table 1 provides an overview of the abbreviations and important characteristics of the different forecast aggregation methods we will consider below.
VI (in the form of V) has recently received more research interest in the machine learning literature and has for example been used by Kirkwood et al. (2021) and Kim et al. (2021) to aggregate probabilistic predictions. Related work in the statistical literature includes comparisons to the LP which demonstrate that VI tends to perform better than the LP (Lichtendahl et al., 2013; Busetti, 2017).
Regarding the different NN-based methods for probabilistic forecasting that will be introduced in Section 3, we now consider the special case of VI for location-scale families. Given a CDF , a distribution is said to be an element of a location-scale family if its CDF satisfies
where denotes the location and the scale parameter. Popular examples of location-scale families include the normal and logistic distributions. Unlike the LP, which results in a wide-spread, multi-modal distribution, VI is shape-preserving for location-scale families (Thomas and Ross, 1980). Shape-preserving here means that if the individual forecasts are elements of the same location-scale family, the aggregated forecast is as well. Further, the parameters of the aggregated forecast and are given by the weighted averages of the individual parameters and , , together with the intercept in case of the location parameter, that is,
Here, we will only consider the case of for . Lichtendahl et al. (2013), who compare the theoretical properties of the LP and V
, note that the aggregated predictive distributions both yield the same mean but the VI forecasts are sharper, that is, the VI predictive distribution has a variance smaller or equal to that of the LP.
To highlight the effects of the individual VI parameters, we note that the intercept only has an effect on the location of the resulting aggregated distribution, while the weight has an effect on both the location and the spread. If it is larger than 1, the spread increases compared to the average spread of the individual forecasts, and it decreases for values smaller than 1. However, a weight not equal to 1 also shifts the location of the distribution. Figure 1 illustrates this in the exemplary case of two normal distributions.
In the context of probabilistic wind gust prediction, Schulz and Lerch (2022) propose a framework for NN-based probabilistic forecasting that encompasses different approaches to obtain distribution forecasts as the output of a NN. The general framework is illustrated in Figure 2 and forms the basis of our work here. In the following, we briefly introduce three network variants and refer to Schulz and Lerch (2022) for details.
While the three variants differ in their characterization of the forecast distribution and the loss function employed in the NN, their use in practice shares a common methodological feature that constitutes the main motivation for our work here. As discussed in the introduction, extant practice in NN-based forecasting often relies on an ensemble of NN models trained based on randomly initialized weights and batches to account for the randomness of the stochastic gradient descent methods applied in the training process. This raises the question of how the distribution forecast from the three network variants can be combined using the aggregation methods described in Section2.2, which we will discuss below.
In the distributional regression network (DRN) approach, the forecasts are issued in the form of a parametric distribution. Under the parametric assumption
, the predictive distribution is characterized by the distribution parameter (vector), where is the parameter space. Different variants of the DRN approach have been proposed over the past years and can be traced back to at least Bishop (1994). Lakshminarayanan et al. (2017) and Rasp and Lerch (2018) use a normal distribution with , Schulz and Lerch (2022) use a zero-truncated logistic distribution with , where for both distributions is the location and the scale parameter, and Bishop (1994) and D’Isanto and Polsterer (2018) use a mixture of normal distributions. To estimate the parameters of the NN, proper scoring rules such as the CRPS (Rasp and Lerch, 2018; D’Isanto and Polsterer, 2018; Schulz and Lerch, 2022) or the negative log-likelihood (Lakshminarayanan et al., 2017) serve as custom loss functions. Extensions of the DRN approach to other parametric families are generally straightforward provided that analytical closed-form expressions of the selected loss function are available (for example, Ghazvinian et al., 2021; Chapman et al., 2022).
Both Lakshminarayanan et al. (2017) and Rasp and Lerch (2018) generate an ensemble of networks based on random initilalization. While Lakshminarayanan et al. (2017) propose to use the LP to aggregate the forecast distributions, Rasp and Lerch (2018) instead combine the forecasts by averaging the distribution parameters. Since the normal distribution (which we will also employ in the simulation study below) is a location-scale family, parameter averaging is equivalent to V. Although the logistic distribution also forms a location-scale family, the truncated variant used in Schulz and Lerch (2022) does not, and parameter averaging is not equivalent to V. However, in the context of the case study we found the differences between parameter averaging and V to be negligibly small and, in this particular case, therefore approximated the VI approaches by the corresponding parameter averages. To evaluate the LP forecasts, we draw a random sample of size 1 000 from the mixture distribution by first randomly choosing an ensemble member and then generating a random draw from the corresponding distribution.
Bremnes (2020) proposes a semi-parametric extension of the DRN approach we refer to as Bernstein quantile network (BQN). The probabilistic forecast is given in form of the quantile function , which is modeled as a linear combination of Bernstein polynomials, that is,
where and is the -th basis Bernstein polynomial of degree , . The basis coefficients , which define the predictive distribution, are obtained as output of the NN. The parameters of the NN are estimated by minimizing the quantile loss evaluated at pre-defined quantile levels. Note that the support of the forecast distribution is given by .
To aggregate ensembles of BQN forecasts, Bremnes (2020) and Schulz and Lerch (2022) average the individual basis coefficient values across ensemble members. This is equivalent to V, which is obvious from the quantile function of the general case of VI for BQN forecasts,
where is the coefficient of the -th basis polynomial of the -th ensemble member, , .
Since a closed form of the CDF or density of a BQN forecast is not readily available, the LP cannot be expressed in a similar fashion. Analogous to DRN, the evaluation of the LP forecasts will therefore be based on a random sample of size 1 000 drawn from the aggregated distribution. Here, the inversion method allows to sample from the individual BQN forecasts. Further, the VI forecasts are evaluated based on a sample of 100 equidistant quantiles.555The numbers of samples and quantiles were chosen based on simulation experiments and theoretical considerations. Compared to random samples from the forecast distributions, a smaller number of equidistant quantiles is required to achieve approximations of the same accuracy, see Krüger et al. (2021) and references therein for a discussion of sample-based estimation of the CRPS.
The last method considered here is the histogram estimation network (HEN) which divides the support of the target variable in bins and assigns each bin the probability of the observation falling in that bin. Variants of this approach have been proposed in a variety of applications (for example, Gasthaus et al., 2019; Li et al., 2021). Mathematically, the HEN forecast is given by a piecewise uniform distribution. Let denote the edges of the bins with probabilities , , where it holds that . The CDF of a HEN forecast is then given by the piecewise linear function
We here follow Schulz and Lerch (2022) by considering fixed bins and estimate only the corresponding probabilities as output of the NN. In the simulations study, the edges are given by 50 equidistant empirical quantiles of the training data (unique to the second digit), and for the case study, we use a semi-automated procedure specific to the application that is described in detail in Schulz and Lerch (2022). As for DRN, the parameters can be estimated via CRPS minimization or maximum likelihood. Here, we use the latter, which corresponds to minimizing the categorical cross-entropy, a standard approach for classification tasks in machine learning.
Regarding the aggregation of an ensemble of HEN forecasts in case of fixed bins, the LP is equivalent to averaging the bin probabilities since
where and is the probability of the -th bin for the -th ensemble member, , . An exemplary application of the LP for an approach akin to HEN forecasts in a stacked NN can be found in Clare et al. (2021).
By contrast to the LP, the VI approach exhibits a particular advantage for HEN forecasts in that it results in a finer binning than the individual HEN models. To illustrate this effect, we note that the quantile function of the HEN forecast is a piecewise linear function with edges depending on the accumulated bin probabilities, that is, , . In mathematical terms, the quantile function is given for by
Therefore, the resulting VI quantile function is a piecewise linear function with one edge for each accumulated probability of the individual forecasts. As the forecast probabilities differ for each member of the deep ensemble, the associated quantile functions are subject to a different binning. Since the set of edges of the aggregated VI forecast is given by the union of all individual edges, this leads to a smoothed final forecast distribution with a finer binning than the individual model runs that differs for every forecast case, and eliminates the potential downside of too coarse fixed bin edges. Figure 3 illustrates the effects of the LP and V for two exemplary HEN forecasts.
We compare the performance of the five aggregation methods for each of the three network variants in a simulation study. The simulation setting is adopted from Li et al. (2021), who investigate a variant of the HEN approach. From the six models they propose for the data-generating process, we consider two here and report results for the remaining four in the supplemental material.
We do not tune the specific architectures and hyperparameters of the individual NN models in each of the scenarios of the simulation study, but instead use the configurations that have proven to work well in Schulz and Lerch (2022). This is done intentionally in order to also generate forecasts that are not well-calibrated or subject to systematic biases, which allows us to assess the performance of the aggregation methods in situations when the forecasts are not optimal, see, in particular, the results for Scenario 2 reported below.
For each scenario, we generate training sets of size 6,000 and test sets of size 10,000. The simulations are repeated 50 times. We generate a series of 40 individual network ensemble members for each of the three network variants and consider aggregation of the first members, where . As benchmark, we will consider an optimal probabilistic forecast based on the inherent uncertainty of the data-generating process denoted by the noise term .666Note that the simulations are based on a finite sample, so even the optimal forecast might result in a small bias or an empirical coverage not equal to the nominal value. In the following, the CRPSS in (2.2) will be calculated using the average CRPS of the individual networks for and the CRPS of the optimal forecast for .777 Note that this does not correspond to the mean improvement over the individual forecasts. However, averaging the median skill scores of the individual ensemble member predictions over the repetitions of the simulations yields qualitatively analogous results (not shown).
As our first simulation scenario, we consider a linear model with normally distributed errors. Based on a random vector of predictors , which serves as the input of the networks, and the random coefficient vectors , which are fixed for each run of the simulation and unknown to the forecaster, the target variable is calculated via
where , , and . The optimal forecast is then given by
The key results for this simulation scenario are summarized in Figure 4, which shows different evaluation metrics averaged over the 50 repetitions of the simulation study. We start by comparing the aggregation methods for DRN, where the CRPSS indicates that aggregation via the VI approaches improves the network average by up to 12.5%, while the LP improves the forecast performance by at most 2.5%. Here, the best VI approach is to fix the intercept and weights instead of estimating them from the training data. In Figure 4, the relative weight difference of the estimated weight and a standardized weight for an ensemble of size , given by illustrates that the estimated weights are not equal to standardized weights. The flat PIT histograms in Figure 5 indicate that the individual component forecasts are already well-calibrated and corrections via coefficient estimation are not necessary. The average PI length of the network forecasts, which is identical to that of V and V, is smaller than that of the optimal forecast. Note that having sharper forecasts than comes at the cost of a lack in calibration. Comparing the aggregation methods, we find that the LP increases the PI length as expected due to its theoretical properties. V and V here increase the PI length because their estimated weights are larger than standardized ones. All aggregation methods increase the empirical coverage, which improves the predictive performance, because the coverage of the network average is smaller than the nominal value. In terms of accuracy, all methods are unbiased since they are approximately as accurate as the optimal forecast.
For BQN, the results are qualitatively similar, however, since the BQN forecasts are not as well-calibrated as those of DRN, there are some differences which we highlight in the following. The estimated weights of V and V are larger than standardized ones and result in a smaller CRPSS difference to V, which still performs best. The V and V forecasts are therefore less sharp than the network average and as sharp as the LP. The empirical coverage of the individual BQN forecasts is larger than the nominal value, thus that of the forecasts aggregated via VI approaches is as well. Interestingly, the LP decreases the coverage and is closest to the nominal value. Further, the VI forecasts are positively biased, the LP is instead close to being unbiased. Although the LP performs favorable in terms of the empirical coverage and accuracy, it performs worse than the VI approaches, even though the difference is smaller than in the case of DRN.
In contrast to DRN and BQN, the HEN forecasts are not well-calibrated but instead overdispersed, as indicated by the bulk-shaped histograms in Figure 5. In addition to the lack of calibration, the forecasts are also not sharp since the PIs are more than twice as large as those of the optimal forecast. These deficiencies result in a substantially worse CRPS compared to DRN and BQN. While the LP, V and V are unable to correct the systematic miscalibration, V and V result in well-calibrated forecasts, which is indicated by the flat PIT histograms in Figure 5. The estimated weights are smaller than standardized ones for all ensemble sizes, therefore the forecasts become sharper. The PI coverage of the overdispersed forecasts is, as expected, 2.5% larger than the nominal value. The corrections of V and V result in coverages closer to and even smaller than the nominal value. Further, note that V estimates smaller weights than V, but also a positive intercept larger than that of V in order to balance the effect of the weights on the location of the aggregated distribution. However, the positive intercepts estimated by V and V result in a larger bias. The correction of the overdispersion is also reflected in the CRPSS, where V and then V outperform the other approaches by a wide margin. Note that the LP improves the predictive performance and performs equally well as V and V. However, although all aggregation methods correct systematic errors and improve predictive performance, the aggregated forecasts are still not competitive to those of DRN and BQN in terms of the CRPS.
Finally, we investigate the effect of the ensemble size on the performance of the aggregation methods, in particular on the CRPSS, considering the variability over the 50 runs (Figure 6). For all network variants and aggregation methods, most improvement is obtained up to ensembles of size 10. The median CRPSS increases up to a size of 20 after which only minor further improvements can be observed. Interestingly, the variability over the runs does not decrease for larger ensemble sizes. Comparing the aggregation methods, more outliers are observed for methods that estimate parameters. For DRN, we see that parameter estimation may result in forecasts worse than the network average in a few cases, on the other hand the same results are obtained for V forecasts in case of HEN. Regarding systematic differences between the network variants, we find that the variation across simulation runs is notably lower for HEN. This can partly be explained by the fact that the DRN and BQN forecasts are much closer to the optimal forecasts and thus small absolute deviations in the CRPS result in larger differences in the skill.
In the second scenario, we consider a skewed distribution with a nonlinear mean function. The target variableis defined by
where , , and . The optimal forecast is given by
PIT histograms of the individual and aggregated forecasts for the different network variants are shown in Figure 7. In contrast to the first scenario, none of the network variants produces calibrated forecasts and their PIT histograms indicate systematic deviations from uniformity. As to be expected due to the wrong distributional assumption, DRN based on a normal distribution is not able to yield calibrated forecasts for an underlying skewed normal distribution, but also the semi-parametric BQN and HEN forecasts fail to provide calibrated forecasts. The HEN forecasts are strongly overdispersed and again result in the worst CRPS among the network variants, see Figure 8. None of the aggregation methods is able to correct the systematic lack of calibration for all of the network variants. That said, aggregation still improves the overall predictive performance in terms of the CRPS.
For DRN, we find that the LP outperforms the VI approaches. Interestingly, this is the case even though the LP forecasts are the least sharp and have a higher PI coverage that is farther away from the nominal value. In contrast to the first scenario, even the aggregated DRN forecasts perform notably worse than BQN. The VI approaches perform equally well and increase the coverage of the forecasts such that they are closer to the nominal value. While V and V estimate coefficients close to the nominal values, V estimates larger weights, and therefore yields larger PIs, and a negative intercept in order to balance the shift in the location. Still, V does not outperform the other VI approaches.
The results are again qualitatively similar for BQN. The main difference is that the LP does not outperform the VI approaches, as all aggregation methods result in an improvement in terms of the mean CRPS of up to 16%. Further, the LP again yields the least sharp forecasts and all methods increase the PI coverage.
Next, we consider the HEN forecasts. In contrast to Scenario 1, weight estimation via V and V is not able correct the systematic errors and outperform the other approaches. All VI approaches perform equally well and outperform the LP, which still improves the network average. The LP yields the least sharp forecasts, followed by V that estimates weights larger than standardized ones together with a negative intercept, as for DRN and BQN. The negative intercepts of V and V improve the accuracy, as they decrease the forecast bias.
Regarding the effect of the ensemble size, the largest improvements of the aggregation methods are again obtained for up to 10 ensemble members and only minor improvements can be observed for sizes larger than 20, see Figure 9. In contrast to Scenario 1, it can be noted that the variability over the runs decreases as the ensemble size increases, and that the degree of variability is similar for all aggregation methods within one network variant. A direct comparison of the network variants indicates that the variability generally increases with the overall skill of the aggregated forecast.
Modern weather forecasts are usually based on ensemble simulations from numerical weather prediction (NWP) models, consisting of a set of deterministic forecasts that differ in their initial conditions (Bauer et al., 2015). These ensemble weather predictions continue to be subject to systematic errors and require corrections via distributional regression models, a process which is referred to as postprocessing. Over the past years, much research interest has been focused on modern machine learning approaches for postprocessing, where NN models enable the incorporation of arbitrary input predictors and yield flexible forecast distributions. Exemplary NN-based postprocessing methods include the approaches introduced in Section 3, see Vannitsem et al. (2021) for an overview of recent developments.
Our case study focuses on the application of the aggregation methods to probabilistic wind gust forecasting over Germany using forecast distributions obtained as the output of NN methods for ensemble postprocessing, and is based on Schulz and Lerch (2022). We use the same dataset and consider 4 of the 22 available forecast lead times (0, 6, 12, 18h). In the following, we will typically evaluate the predictive performance aggregated over those lead times and note that while there are minor differences across lead times, the results are qualitatively similar and all the main conclusions are valid for all the considered lead times.
In our implementation of DRN, BQN and HEN for probabilistic forecasting via ensemble postprocessing, we follow Schulz and Lerch (2022) and refer to the descriptions there for implementation details including minor technical adjustments to the general descriptions in Section 3. For each of the lead times, we generate an ensemble of 100 models for each of the network variants based on random initialization, which form the basis for our study of the different aggregation methods. We randomly draw a subset of these 100 models for each of the considered ensemble sizes and repeat this procedure 20 times to account for uncertainties. Therefore, 20 aggregated forecasts based on a pool of 100 network ensemble members are generated for each model variant and each ensemble size.
Note that the underlying distribution of the target variable is of course unknown in the case study, and following common practice the observed value is used as the (hypothetical) optimal forecast resulting in a CRPS of 0 to calculate the CRPSS in (2.2). The magnitude of the CRPSS values of the simulation and case study is thus not directly comparable.
Figure 10 summarizes the key results of the case study. Applying the aggregation methods to the DRN and BQN forecasts leads to similar results as in the simulation study. Although the LP improves predictive performance with a skill of up to 1.6%, the VI approaches are superior to the LP. Among the considered VI approaches, coefficient estimation leads to better predictions with V and V performing best, followed by V and V.
As expected, the LP yields less sharp forecasts than the network average indicated by larger PIs, which are the least sharp for DRN. V also issues less sharp forecasts than the network average, which is identical to that of V and V, and V produces the sharpest forecasts. This is due to the fact that V estimates weights smaller and V larger than the nominal value of V. As in the simulation study, V estimates a more extreme intercept than V, which balances the effect of the weight estimation. The PI coverage increases for all aggregation methods and both network variants. For both network variants, the V forecasts have the smallest coverage closest to the nominal value, whereas V results in a coverage larger than the other VI approaches. Only for DRN, the LP has a larger PI coverage.
The results of the aggregated HEN forecasts are again qualitatively different from those of DRN and BQN. While still outperforming benchmark methods from statistics and machine learning, the HEN method does not perform as well as DRN and BQN in the comparison in Schulz and Lerch (2022) and is subject to more systematic errors. Although the ranking of the aggregation methods is identical to that of DRN and BQN, the magnitude of the differences in the CRPSS for the superior method is notably larger, and since V is able to improve some of the systematic errors, it clearly outperforms the other approaches. The most significant difference to the other aggregation methods is that V estimates more extreme coefficients. As for BQN, this results in the largest PIs and largest coverage, in both cases followed by the LP.
Regarding the accuracy of the forecasts produced by the different aggregation methods, the results are qualitatively similar for all three network variants. The two methods that estimate an intercept have the largest absolute biases. This is a somewhat counter-intuitive result, since it can be expected that including an intercept should enable the correction of systematic biases. As noted in Schulz and Lerch (2022), there are minor structural differences in the distribution of the observed values in the test and validation datasets. Due to the average observed values in the test data being somewhat smaller than those in the validation dataset, the data the coefficients are estimated on is not fully representative of the test data and overfitting may occur.
To assess the effect of the ensemble size on the predictive performance in Figure 11, we pick one specific lead time, namely 18h, to avoid distortions in the boxplots caused by the minor variations in the magnitude of the improvements over lead times. The results coincide with the corresponding main conclusions of the simulation study in that we observe most improvement up to ensembles of size 10 and only minor for ensembles of size larger than 20. In the case study, the improvement up to size 10 is more pronounced than in the simulation study and strongly suggests that a network ensemble should include at least 10 members. Finally, we note that the variability of the CRPSS decreases for larger ensemble sizes.
We have conducted a systematic comparison of aggregation methods for the combination of distribution forecasts from ensembles of neural networks based on different random initializations, so-called deep ensembles. In doing so, our work aims to reconcile and consolidate findings from the statistical literature on forecast combination and the machine learning literature on ensemble methods. Specifically, we propose a general Vincentization framework where quantile functions of the forecast distributions can be flexibly combined, and compare to the results of the widely used linear pool, where the probabilistic forecasts are linearly combined on the scale of probabilities. For deep ensembles of three variants of NN-based models for probabilistic forecasting that differ in the characterization of the output distribution, aggregation with both the LP and VI improves the predictive performance. The VI approaches show superior performance compared to the LP. For example, given ensemble members that are already calibrated, V preserves the calibration and improves the predictive accuracy while the LP decreases sharpness with more dispersed forecasts. If the individual forecast distributions are subject to systematic errors such as biases and dispersion errors, coefficient estimation via V, V and V is able to correct these errors and improve the predictive performance considerably, otherwise V should be preferred. While these combination approaches require the estimation of additional combination coefficients, the computational costs are negligible compared to the generation of the NN-based probabilistic forecasts and can be performed on the validation data without restricting the estimation of the NNs.
Even though forecast combination generally improves the predictive performance, Scenario 2 of the simulation study demonstrates that for example the lack of calibration of the severely misspecified individual forecast distributions cannot be corrected by the aggregation methods considered here. In the context of NNs and deep ensembles, the calibration of (ensemble) predictions and re-calibration procedures have been a focus of much recent research interest (Guo et al., 2017; Ovadia et al., 2019). For example, in line with the results of Gneiting and Ranjan (2013), deep ensemble predictions based on the LP were found to be miscalibrated and should be re-calibrated after the aggregation step (Rahaman and Thiery, 2020; Wu and Gales, 2021). A wide range of re-calibration methods, which simultaneously aggregate and calibrate the ensemble predictions (such as the V, V and V approaches presented in Section 2.2.2 for VI), have been proposed in order to correct the systematic errors introduced by the LP in the context of probability forecasting for binary events (Allard et al., 2012)
. For example, the beta-transformed LP composites the CDF of a Beta distribution with the LP(Ranjan and Gneiting, 2010), and Satopää et al. (2014)
propose to aggregate probabilities on a log-odds scale. Some of these approaches can be readily extended to the case of forecast distributions considered here(Gneiting and Ranjan, 2013). For VI, more sophisticated approaches that allow the weights to depend on the quantile levels might improve the predictive performance (Kim et al., 2021). Further, moving from a linear combination function towards more complex transformations allowing for non-linearity might help to correct more involved calibration errors.
We have restricted our attention to ensembles of NN-based probabilistic forecasts generated based on random initializations. While such deep ensembles have been demonstrated to work well in many settings (Lee et al., 2015; Fort et al., 2019; Ovadia et al., 2019), a variety of alternative approaches for uncertainty estimation in NNs has been proposed including Bayesian NNs (Neal, 2012) or generative models (Mohamed and Lakshminarayanan, 2016). A particularly prominent approach to deal with the uncertainty in the estimation of NNs is dropout (Srivastava et al., 2014; Gal and Ghahramani, 2016). Dropout can not only be used as a regularization method during estimation but also for prediction, which results in an ensemble of forecasts and is readily applicable for the different variants of NN methods considered here. Compared to deep ensembles based on random initialization, a potential advantage of dropout-based ensembles is that the lower computational costs make the generation of larger ensembles more feasible. The aggregation methods we investigated are agnostic to the generation of the ensembles of distribution forecasts provided that they can be considered as realizations of the same basic type of model, and are thus readily applicable to dropout-based ensembles.888 In experiments with dropout ensembles in the context of the case study (not shown), we found that aggregating forecast distributions improves the predictions, but the overall performance of both the individual and combined dropout-based forecasts is substantially worse compared to the ensembles considered here. Therefore, an interesting avenue for future work is to investigate the performance of the combination methods for different approaches to generate NN-based probabilistic forecasts, for example within the framework of comprehensive simulation testbeds (Osband et al., 2021).
Finally, we summarize three key recommendations for aggregating distribution forecasts from deep ensembles based on our results. First, to optimize the final predictive performance of the aggregated forecast, the individual component forecasts should be optimized as much as possible.999 Abe et al. (2022) find that deep ensembles do not offer benefits compared to single larger (that is, more complex) NNs. Our results do not contradict their findings since we address a conceptually different question and argue that given the generation of a deep ensemble, the individual members’ forecasts should be optimized as much as possible. In this situation, a single NN will generally not be able to match the predictive performance of the associated deep ensemble. While forecast combination improves predictive performance, it generally did not effect the ranking of the different NN-variants for generating probabilistic forecasts, and can be unable to fix substantial systematic errors. Second, generating an ensemble with a size of a least 10 appears to be a sensible choice, with only minor improvements being observed for more than 20 members. This corresponds to the results in Fort et al. (2019) and ensemble sizes typically chosen in the literature (Lakshminarayanan et al., 2017; Rasp and Lerch, 2018), but the benefits of generating more ensemble members need to be balanced against the computational costs, and sometimes smaller ensembles have been suggested (Ovadia et al., 2019; Abe et al., 2022). Third, aggregating forecast distributions via VI is often superior to the LP. Thereby, the choice of the specific variant within the general framework depends on potential misspecifications of the individual component distributions, as discussed above. Note that these conclusions, in particular the superiority of the quantile aggregation approaches, refer to the specific situation of deep ensembles considered here. The property of shape-preservation justifies the use of VI from a theoretical perspective in a setting where the ensemble members are based on the same model and data. If the ensemble members differ in terms of the model used to generate the forecast distribution or the input data they are based on, shape-preservation might not be desired. Instead, a model selection approach based on the LP, which allows for obtaining a multi-modal forecast distribution, might better represent the possible scenarios that may materialize.
The research leading to these results has been done within the project C5 “Dynamical feature-based ensemble postprocessing of wind gusts within European winter storms” of the Transregional Collaborative Research Center SFB/TRR 165 “Waves to Weather” funded by the German Research Foundation (DFG). Sebastian Lerch gratefully acknowledges support by the Vector Stiftung through the Young Investigator Group “Artificial Intelligence for Probabilistic Weather Forecasting”. We thank Daniel Wolffram, Eva-Maria Walz, Anja Mühlemann, Alexander Jordan and Tilmann Gneiting for helpful comments and discussions.
What uncertainties do we need in Bayesian deep learning for computer vision?In Proceedings of the 31st International Conference on Neural Information Processing Systems. 5580–5590.
Predictive inference based on Markov chain Monte Carlo output.International Statistical Review, 89, 274–301.
Combining multiple probability predictions using a simple logit model.International Journal of Forecasting, 30, 344–356.
Here, we present additional results for the remaining scenarios proposed in Li et al. (2021). Their models 1 and 4 correspond to our scenarios 1 and 2 considered in the main text. The results for their models 5 and 6 are almost identical to those of their model 2, and are thus not included here.
This scenario is based on model 2 of Li et al. (2021) and uses a mixture distribution with a nonlinear mean function,
where , , , , . The optimal forecast is given by the conditional distribution of , that is,
This scenario is based on model 3 of Li et al. (2021) and also uses a mixture distribution with a nonlinear mean function
where , , , , . The optimal forecast is given by the conditional distribution of , that is,
Analogous to the presentation of the results for the first two simulation scenarios, results for Scenario 3 are shown in Figures S1–S3, and results for Scenario 4 in Figures S4–S6. The overall results and main conclusions are qualitatively similar to those discussed for the first two simulation settings in the main text, we thus limit the discussion here to some noteworthy differences. In both additional settings, the magnitude of improvements by forecast aggregation is smaller, in particular for Scenario 4. Scenario 4 further provides the only case where the HEN forecasts are better calibrated than those of DRN and BQN, and also substantially larger improvements by forecast combination are observed for HEN.