# Statistical post-processing of hydrological forecasts using Bayesian model averaging

Accurate and reliable probabilistic forecasts of hydrological quantities like runoff or water level are beneficial to various areas of society. Probabilistic state-of-the-art hydrological ensemble prediction models are usually driven with meteorological ensemble forecasts. Hence, biases and dispersion errors of the meteorological forecasts cascade down to the hydrological predictions and add to the errors of the hydrological models. The systematic parts of these errors can be reduced by applying statistical post-processing. For a sound estimation of predictive uncertainty and an optimal correction of systematic errors, statistical post-processing methods should be tailored to the particular forecast variable at hand. Former studies have shown that it can make sense to treat hydrological quantities as bounded variables. In this paper, a doubly truncated Bayesian model averaging (BMA) method, which allows for flexible post-processing of (multi-model) ensemble forecasts of water level, is introduced. A case study based on water level for a gauge of river Rhine, reveals a good predictive skill of doubly truncated BMA compared both to the raw ensemble and the reference ensemble model output statistics approach.

## Authors

• 9 publications
• 3 publications
• 3 publications
04/30/2021

### Calibration of wind speed ensemble forecasts for power generation

In the last decades wind power became the second largest energy source i...
09/11/2018

### Statistical post-processing of ensemble forecasts of temperature in Santiago de Chile

Currently all major meteorological centres generate ensemble forecasts u...
05/23/2018

### Neural networks for post-processing ensemble weather forecasts

Ensemble weather predictions require statistical post-processing of syst...
12/16/2011

08/25/2019

### Probabilistic Forecasting of the Arctic Sea Ice Edge with Contour Modeling

Sea ice, or frozen ocean water, annually freezes and melts in the Arctic...
08/31/2019

### Quantification of predictive uncertainty in hydrological modelling by harnessing the wisdom of the crowd: Methodology development and investigation using toy models

We introduce an ensemble learning post-processing methodology for probab...
08/31/2019

### Quantification of predictive uncertainty in hydrological modelling by harnessing the wisdom of the crowd: A large-sample experiment at monthly timescale

Predictive hydrological uncertainty can be quantified by using ensemble ...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Hydrological forecasts are important for a heterogeneous group of users such as, for instance, the operators of hydrological power plants, flood prevention authorities, or shipping companies. For rational decision making based on cost-benefit analyses an estimate of the predictive uncertainty (Krzysztofowicz, 1999; Todini, 2008) needs to be provided with any forecast. The state-of-the-art approach of using an ensemble of hydrological model runs driven by numerical weather prediction (NWP) models (Cloke and Pappenberger, 2009) gives a first estimate of the meteorological input uncertainty. However, as NWP ensembles are usually biased and underdispersed (Buizza et al., 2005; Park et al., 2008; Bougeault et al., 2010) and other sources of uncertainty like hydrological model formulation, boundary and initial condition uncertainty as well as measurement uncertainties are typically neglected, statistical post-processing is important in order to reduce systematic errors and to obtain an appropriate estimate of the predictive uncertainty (Buizza, 2018).

In the last decade various methods of statistical calibration of ensemble forecasts for different weather variables have been developed (see e.g. Schmeits and Kok, 2010; Ruiz and Saulo, 2012; Williams et al., 2014) and some of them such as ensemble model output statistics (EMOS; Gneiting et al., 2005) or Bayesian model averaging (BMA; Raftery et al., 2005)

provide full predictive distributions. The EMOS predictive distribution is given by a single parametric probability law with parameters depending on the ensemble, whereas the BMA predictive probability density function (PDF) is a weighted mixture of PDFs corresponding to the individual ensemble members. EMOS and BMA models for various weather quantities differ in the applied parametric distribution family and once the predictive distribution is given, its functionals (e.g. median or mean) can be considered as classical point forecasts.

Besides the successful application e.g. to ensemble forecasts for temperature (Gneiting et al., 2005), wind speed (Thorarinsdottir and Gneiting, 2010; Lerch and Thorarinsdottir, 2013; Baran and Lerch, 2015) or precipitation (Scheuerer, 2014; Scheuerer and Hamill, 2015; Baran and Nemoda, 2016), EMOS based statistical post-processing turned out to improve the predictive performance of hydrological ensemble forecasts for different gauges along river Rhine (Hemri et al., 2015; Hemri and Klein, 2017)

. As EMOS is a quite parsimonious post-processing method that basically links a parametric forecast distribution to ensemble statistics like the ensemble mean and the ensemble variance, its performance is limited by i) how well the true process can be represented by a parametric distribution family and ii) to what extent the complete information from the ensemble can be summarized in a limited set of ensemble statistics. For instance, a typical EMOS approach based on a Gaussian or a Gamma distribution family is not able to model bimodal forecast distributions. However, BMA, which has also been applied to hydrological ensemble forecasts

(see e.g. Duan et al., 2007; Hemri et al., 2013), is much more flexible in that it converts a (multi-) model raw ensemble to a mixture distribution. Accordingly, we hypothesize that BMA may be able to outperform EMOS. However, the settings at river Rhine ask for double truncation in order to avoid physically unrealistic forecasts (Hemri and Klein, 2017). To our best knowledge, up to now, there is no study that has applied a doubly truncated normal BMA approach. In this study, the work by Baran (2014), which introduces a one-sided truncated normal BMA method, is extended to a two-sided truncated normal BMA approach. Its performance and its suitability for hydrological ensemble forecasts is assessed at the example of multi-model ensemble forecasts of water level at gauge Kaub at river Rhine.

Doubly truncated BMA is introduced in Section 2 on calibration methods and forecast evaluation, followed by a brief description of the data in Section 3. The results are presented in Section 4 and conclusions are drawn in Section 5.

## 2 Calibration methods and forecast evaluation

### 2.1 Bayesian model averaging

As mentioned in the Introduction, the BMA predictive distribution of a future weather quantity is a weighted mixture of probability laws corresponding to the individual ensemble members. In general, if    denote the ensemble forecast of a given weather or hydrological quantity    for a given location, time and lead time, the BMA predictive PDF (Raftery et al., 2005) of    equals

 p(x|f1,…,fK;θ1,…,θK):=K∑k=1ωkg(x|fk,θk), (2.1)

where    is the component PDF from a parametric family corresponding to the th ensemble member

with parameter (vector)

to be estimated and

is the corresponding weight determined by the relative performance of this particular member during the training period. Note that the weights should form a probability distribution, that is

and  .

Recently most operational ensemble predictions systems (EPSs) incorporate ensembles where at least some members can be considered as statistically indistinguishable and in this way exchangeable, as these forecasts are generated using perturbed initial conditions. This is the case with the 52 member operational ECMWF ensemble (Molteni et al., 1996; Leutbecher and Palmer, 2008) or one can mention multi-model EPSs such as the GLAMEPS ensemble (Iversen et al., 2011) or the THORPEX Interactive Grand Global Ensemble (Swinbank et al., 2016). Obviously, using exchangeable ensemble weather forecasts as inputs of a hydrological model results in hydrological ensemble forecasts with exchangeable members, which is the case with the water level data at hand described in Section 3. To account for the existence of exchangeable ensemble groups Fraley et al. (2010) suggest to use the same weights and parameters within a given group. Thus, if we have    ensemble members divided into    exchangeable groups, where the  th  group contains    ensemble members  ()  and    denotes the th member of the th group, model (2.1) is replaced by

 p(x|f1,1,…,f1,M1,…,fK,1,…,fK,MK;θ1,…,θK):=K∑k=1Mk∑ℓ=1ωkg(x|fk,ℓ,θk). (2.2)

For the sake of simplicity in the remaining part of this section we provide results and formulae only for model (2.1) as their extension to model (2.2) is rather straightforward. Further, as in the case study of Section 4 the different lead times are treated separately, reference to the lead time is also omitted.

### 2.2 Truncated normal BMA model

For weather variables such as temperature or pressure, BMA models with Gaussian components provide a reasonable fit (Raftery et al., 2005; Fraley et al., 2010)

, whereas wind speed calls for non-negative and skewed distributions such as gamma

(Sloughter et al., 2010) or truncated normal with truncation from below at zero (Baran, 2014). However, water levels are typically non-Gaussian (see e.g. Duan et al., 2007), moreover, they are bounded both from below and from above, which should also be taken into account during model formulation. A general procedure is to normalize the forecasts and observations using Box-Cox transformation

 hλ(x):={(xλ−1)/λ,λ≠0,log(x),λ=0 (2.3)

with some coefficient  ,  perform post-processing, and then transform the results back using the inverse Box-Cox transformation (Duan et al., 2007; Hemri et al., 2014, 2015). Following the ideas of Hemri and Klein (2017), for modelling Box-Cox transformed water levels we use a doubly truncated normal distribution  ,  with PDF

 ga,b(x;μ,σ):=1σφ(x−μσ)Φ(b−μσ)−Φ(a−μσ),x∈[a,b], (2.4)

and  ,  otherwise, where    and    are the lower and upper bounds and    and

denote the PDF and the cumulative distribution function (CDF) of the standard normal distribution, respectively. Note that the mean and variance of

are

 κ =μ+σφ(a−μσ)−φ(b−μσ)Φ(b−μσ)−Φ(a−μσ)and (2.5) ϱ2 =σ2⎛⎝1+a−μσφ(a−μσ)−b−μσφ(b−μσ)Φ(b−μσ)−Φ(a−μσ)−(φ(a−μσ)−φ(b−μσ)Φ(b−μσ)−Φ(a−μσ))2⎞⎠,

respectively. The proposed BMA predictive PDF is

 p(x∣f1,…,fK;α1,…,αK;β1,…,βK;σ)=K∑k=1ωkga,b(x∣αk+βkfk,σ), (2.6)

where we assume that the location parameter of the th mixture component is an affine function of the corresponding ensemble member and scale parameters are assumed to be equal for all component PDFs. The latter assumption is for the sake of simplicity and is common in BMA modelling (see e.g. Raftery et al., 2005), whereas the form of the location parameter is in line with the truncated normal BMA model of Baran (2014). Further, note that the EMOS model of Hemri and Klein (2017) also links location and scale of the truncated normal distribution to the ensemble members and not to the corresponding mean and variance.

### 2.3 Parameter estimation

Location parameters  ,  weights  ,  and scale parameter    can be estimated from training data, which consists of ensemble members and validating observations from the preceding    days. In the BMA approach, estimates of location parameters are typically obtained by regressing the validating observations on the ensemble members, whereas weights and scale parameter(s) are obtained via maximum likelihood (ML) estimation (see e.g. Raftery et al., 2005; Sloughter et al., 2007, 2010), where the likelihood function of the training data is maximized using the EM algorithm for mixture distributions (Dempster et al., 1977; McLachlan and Krishnan, 1997). However, the regression approach assumes the location parameters to be simple functions of the mean, which is obviously not the case for the truncated normal distribution. Hence, we propose a pure ML method, which ideas have already been considered e.g. by Sloughter et al. (2010) or Baran (2014).

In what follows, for a given location    and time    let    denote the th ensemble member, and denote by    the corresponding validating observation. By assuming the conditional independence of forecast errors with respect to the ensemble members in space and time, the log-likelihood function for model (2.6) corresponding to all forecast cases    in the training set equals

 ℓ(ω1,…,ωK,α1,…,αK,β1,…,βK,σ)=∑s,tlog[K∑k=1ωkga,b(xs,t|αk+βkfk,s,t,σ)]. (2.7)

To obtain the ML estimates we apply EM algorithm for truncated Gaussian mixtures proposed by Lee and Scott (2012) with a mean correction. In line with the classical EM algorithm for mixtures (McLachlan and Krishnan, 1997), first we introduce latent binary indicator variables    identifying the mixture component where the observation    comes from. Using these indicator variables one can provide the complete data log-likelihood corresponding to (2.7) in the form

 ℓC(ω1,…,ωK,α1,…,αK, β1,…,βK,σ) (2.8)

with  .  After specifying the initial values of the parameters the EM algorithm alternates between an expectation (E) and a maximization (M) step until convergence. As first guesses    and

,  for the location parameters one can use the coefficients of the linear regression of

on  ,  so  ,  initial scale

can be the standard deviation of the observations in the training data set, whereas the initial weights might be chosen uniformly, that is

.  Then in the E step the latent variables are estimated using the conditional expectation of the complete log-likelihood on the observed data, while in the M step the parameter estimates are updated by maximizing    given the actual values of the latent variables.

For the doubly truncated normal model specified by (2.4) and (2.6) the E step of the st iteration is

 z(j+1)k,s,t:=ω(j)kga,b(xs,t|μ(j)k,s,t,σ(j))∑Ki=1ω(j)iga,b(xs,t|μ(j)i,s,t,σ(j)). (2.9)

Once the estimates of the indicator variables (which are not necessary or any more) are given, the first part of the M step updating the weights is obviously

 ω(j+1)k:=1N∑s,tz(j+1)k,s,t, (2.10)

where    is the total number of forecast cases in the training set.

Further, non-linear equations    and  ,  lead us to update formulae

 α(j+1)k (2.11) β(j+1)k :=[∑s,tz(j+1)k,s,tf2k,s,t]−1∑s,tz(j+1)k,s,tfk,s,t⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩(xk,s,t−α(j)k,s,t)+σ(j)φ(b−μ(j)k,s,tσ(j))−φ(a−μ(j)k,s,tσ(j))Φ(b−μ(j)k,s,tσ(j))−Φ(a−μ(j)k,s,tσ(j))⎫⎪ ⎪ ⎪ ⎪⎬⎪ ⎪ ⎪ ⎪⎭,

respectively. However, using then simply    as the update of the location parameter in our case study results in an unstable parameter estimation process, so similar to Baran (2014) we introduce a mean correction of form

 (2.12)

which reflects to the difference between the location and mean of a truncated normal distributions, see (2.5). Finally, from    we obtain the last update formula

 σ2(j+1):=1N∑s,tK∑k=1z(j+1)k,s,t{ (xs,t−μ(j+1)k,s,t)2 (2.13) +σ(j)(b−μ(j+1)k,s,t)φ(b−μ(j+1)k,s,tσ(j))−(a−μ(j+1)k,s,t)φ(a−μ(j+1)k,s,tσ(j))Φ(b−μ(j+1)k,s,tσ(j))−Φ(a−μ(j+1)k,s,tσ(j))⎫⎪ ⎪ ⎪ ⎪⎬⎪ ⎪ ⎪ ⎪⎭.

Note that without truncation  ()  the terms of (2.11) and (2.13) depending on    disappear, so location (mean) and scale (standard deviation) are updated separately, no mean correction is required, and we get back the classical EM algorithm for normal mixtures.

As a more simple alternative approach one can omit the update step (2.11) for    and  ,  modify the mean correction step (2.12) to

 μ(j+1)k,s,t:=μ(0)k,s,t−σ(j)φ(a−μ(j)k,s,tσ(j))−φ(b−μ(j)k,s,tσ(j))Φ(b−μ(j)k,s,tσ(j))−Φ(a−μ(j)k,s,tσ(j)), (2.14)

and after the EM algorithm stops, estimate location parameters    and    from a linear regression of the final value of    on  .

Finally, one can also try the classical naive approach, where location parameters    and    are not updated at all, that is  .

In the case study of Section 4 the latter two approaches do not show significantly different forecast skills in terms of the verification scores defined in Section 2.4, so only the results for the naive and pure ML approaches are reported.

### 2.4 Verification scores

In probabilistic forecasting the principal aim is to access the maximal sharpness of the predictive distribution subject to calibration (Gneiting et al., 2007), where the former means a statistical consistency between the predictive distributions and the validating observations, whereas the latter refers to the concentration of the predictive distribution. One of the simplest tools for getting a first impression about the calibration of forecast distributions is the probability integral transform (PIT) histogram. By definition, the PIT is the value of predictive CDF at the validating observation (Raftery et al., 2005)

, which in case of proper calibration should follow a uniform distribution on the

interval. In this way the PIT histogram is the continuous counterpart of the verification rank histogram for the raw ensemble defined as histogram of ranks of validating observations with respect to the corresponding ensemble forecasts (see e.g. Wilks, 2011, Section 7.7.2). Again, for a properly calibrated ensemble the ranks should be uniformly distributed.

Predictive performance can be quantified with the help of scoring rules, which are loss functions assigning numerical values to pairs of forecasts and observations. In hydrology and atmospheric sciences one of the most popular scoring rules is the continuous ranked probability score

(CRPS; Gneiting and Raftery, 2007; Wilks, 2011), as it assesses calibration and sharpness simultaneously. For a (predictive) CDF    and real value (observation)    the CRPS is defined as

 \rm CRPS(F,x):=∫∞−∞(F(y)−1{y≥x})2dy =∫x−∞F2(y)dy+∫∞x(1−F(y))2dy (2.15) =E|X−x|−12E|X−X′|,

where    denotes the indicator of a set  ,  whereas    and

are independent random variables with CDF

and finite first moment. CRPS is a negatively oriented proper scoring rule

(Gneiting and Raftery, 2007), that is the smaller the better, and the right-hand side of (2.15) shows that it can be expressed in the same unit as the observation. For truncated normal distribution the CRPS has a simple closed form (see e.g. the R package scoringRules; Jordan et al., 2017), whereas for the truncated normal mixture (2.6) the second integral expression in the definition (2.15) should be evaluated numerically. Moreover, in our case study each calibration approach provides a predictive CDF    for the Box-Cox transformed water level  .  Thus, the CRPS corresponding to the predictive CDF    of the original water level    and a real value    equals

 (2.16)

which integral should again be approximated numerically. Further, in order to get the CRPS of the raw ensemble, the predictive CDF should be replaced by the empirical one.

One can quantify the improvement in CRPS with respect to a reference predictive distribution    with the help of the continuous ranked probability skill score (CRPSS; Murphy, 1973; Gneiting and Raftery, 2007), defined as

 \rm CRPSS(F,x;Fref):=1−\rm CRPS% (F,x)\rm CRPS(Fref,x).

In contrast to the CRPS the CRPSS is positively oriented, that is the larger the better, and in our case study we consider the raw ensemble as a reference.

Calibration and sharpness of a predictive distribution can also be investigated using the coverage and average width of the    central prediction interval, respectively. As coverage we consider the proportion of validating observations located between the lower and upper

quantiles of the predictive CDF, and level

should be chosen to match the nominal coverage of the raw ensemble, that is  ,  where    is the ensemble size. As the coverage of a calibrated predictive distribution should be around  ,  such a choice of    allows direct comparison with the raw ensemble.

Further, as point forecasts we consider the medians of the predictive distributions and the raw ensemble, that are evaluated with the help of mean absolute errors (MAEs).

Finally, as suggested by Gneiting and Ranjan (2011), statistical significance of the differences between the verification scores is assessed by utilizing the Diebold-Mariano (DM; Diebold and Mariano, 1995) test, which allows accounting for the temporal dependencies in the forecast errors.

### 2.5 Truncated normal EMOS model

As a reference post-processing method for calibration of Box-Cox transformed ensemble forecasts for water levels we consider the truncated normal EMOS model of Hemri and Klein (2017). In this approach the predictive distribution is a single doubly truncated normal distribution    defined by (2.4), and the ensemble members are just linked to the location    and scale    via equations

 μ=a0+a1f1+⋯+aKfKandσ2=b0+b1S2, (2.17)

where    denotes the variance of the transformed ensemble. In case of existence of groups of exchangeable ensemble members the equation for the location in (2.17) is replaced by

 μ=a0+a1¯¯¯f1+⋯+aK¯¯¯fK, (2.18)

where    denotes the mean value of the th group. According to the optimum score estimation principle of Gneiting and Raftery (2007), location parameters    and scale parameters    are estimated from the training data by optimizing a proper verification score, which is usually the CRPS defined by (2.15).

## 3 Data

BMA and EMOS calibration approaches are tested on ensemble forecasts for water level (cm) at gauge Kaub of river Rhine (546 km) for the eight year period 1 January 2008 – 31 December 2015 with lead times from 1 h to 120 h with a time step of 1 h and the corresponding validating observations. The minimum and maximum recorded water levels at this particular gauge are 35 cm and 825 cm, respectively. Our 79 member multimodel water level ensemble is obtained by plugging the ECMWF high resolution (HRES) forecasts, the 51 member ECMWF forecasts (EPS) (Molteni et al., 1996; Leutbecher and Palmer, 2008), the 16 member COSMO LEPS forecasts of the limited-area ensemble prediction system of the consortium for small-scale modeling (Montani et al., 2011) and the 11 member NCEP GEFS forecasts of the reforecast version 2 of the global ensemble forecast system of the National Center for Environmental Prediction (Hamill et al., 2013) for there relevant weather variables into the hydrological model HBV-96 (Lindström et al., 1997), which is run at the German Federal Institute of Hydrology (BfG) for operational runoff forecasting. The runoff forecasts are then converted into water level forecasts for the navigation-relevant gauges, including gauge Kaub, using a hydrodynamic model. All ensemble forecast are initialized at 6 UTC. We remark that the data set at hand is part of the data studied in Hemri and Klein (2017), where we refer to for further details on the data.

## 4 Results

As mentioned in Sections 2.2 and 2.5, BMA and EMOS post-processing is applied for modelling Box-Cox transformed water levels. Each lead time has an individual Box-Cox parameter    (see Figure 1) providing the best fit to a normal distribution (Hemri et al., 2015; Hemri and Klein, 2017) and obviously, for a given lead time the same coefficient is applied both for the forecasts and observations.

Similar to Hemri and Klein (2017) we assume that water levels are in the interval spanned by half of the minimum and double of the maximum recorded water level, i.e. they are between 17.5 cm and 1650 cm, so the Box-Cox transforms of these values serve as lower and upper bounds for the truncated normal distribution used both in BMA and EMOS modelling.

The generation of the hydrological ensemble forecast described in Section 3 induces a natural grouping of the ensemble members. One contains just the forecast based on the ECMWF HRES, the other 51 member group corresponds to the ECMWF EPS, whereas forecasts based on COSMO LEPS and NCEP GEFS ensemble weather forecasts form two other groups of sizes 16 and 11, respectively. Hence, Box-Cox transformed water level forecasts are calibrated using the truncated normal BMA model for exchangeable ensemble members specified by (2.2) and (2.4) and truncated normal EMOS given by (2.17) and (2.18) with    and  .  This means that for BMA modelling 12, whereas for finding the EMOS predictive distribution 7 free parameters have to be estimated. To ensure a reasonably stable parameter estimation we use a rolling training period of length 100 days and consider one day ahead calibration for all lead times. Thus, BMA and EMOS models are verified on the period 10 April 2008 – 31 December 2015 (2822 calendar days).

While BMA and EMOS models are fit to Box-Cox transformed values, to ensure comparability we provide verification scores for the original forecasts and observations. This means that for quantile based scores (MAE, coverage, average width) before evaluating the score the inverse Box-Cox transformation is applied to the appropriate quantiles of the predictive distribution, whereas CRPS is calculated with the help of (2.16).

In Figure 2a the mean CRPS values of the different post-processing approaches and the raw ensemble are plotted as functions of the lead time. Note that compared to the raw ensemble all calibration approaches reduce the mean CRPS and the gap increases together with the lead time. The differences between the forecast skills are more pronounced in Figure 2b showing the CRPSS values with respect to the raw ensemble forecast. Note that all three presented methods have their maximal skill score at hour 9, for shorter lead times the increase is very fast and naive BMA shows the best predictive performance, whereas for longer lead times the pure ML BMA starts dominating. Obviously, larger lead times also result in larger forecast uncertainty which should be taken into account when one compares predictive performance. According to the results of DM tests for equal predictive performance naive BMA significantly outperforms the raw ensemble for all lead times and the same holds for the pure ML BMA except hour 1. In general, in terms of the mean CRPS the two BMA approaches differ significantly mainly for very short and long lead times, as can be observed on the graph of -values displayed in Figure 2c. EMOS also significantly outperforms the raw ensemble for all lead times, and except for the first couple of hours underperforms the BMA approaches, as depicted in Figure 2d.

There is much less variety in the performance of BMA and EMOS calibrated medians in terms of the MAE. According to Figure 3a showing the difference in MAE with respect to the raw ensemble the pure ML BMA has the best forecast skill, however, even this approach underperforms the raw ensemble till hour 70. Note that DM tests for equality of MAE values indicate that all differences plotted in Figure 3a are significant, which will definitely not be the case if we compare the performance of the three post-processing methods, see the -values of Figure 3b.

The positive effect of post-processing on calibration can be clearly observed on Figure 4a showing the coverages of nominal    central prediction intervals as functions of the lead time. All post-processing approaches for all lead times result in almost perfect coverage, whereas the coverage of the raw ensemble is much lover and strongly depends on the lead time. The coverage values of the two BMA approaches are almost identical and after hour 4 they are closer to the nominal value than those of the EMOS. Finally, as depicted in Figure 4b, the raw ensemble produces the sharpest forecasts for all lead times, however, at the cost of being uncalibrated. This is fully in line with the verification rank histograms of the raw ensemble and PIT histograms of post-processed forecasts for lead times 24, 72 and 120 hours plotted in Figure 5. All verification rank histograms are strongly U-shaped (and the same holds for other lead times, not reported), indicating that the raw ensemble is strongly underdispersive and requires post-processing. BMA and EMOS approaches significantly improve the statistical calibration of the forecast and result in more uniform PIT histograms, although for hour 120 naive BMA and EMOS still show a slight underdispersion. The Kolmogorov-Smirnov test accepts at a level the uniformity of the PIT values of pure ML BMA, naive BMA and EMOS for only 9 (5, 6, 7, 14, 17, 72, 75, 77, 79 h), 6 (4, 5, 6, 7, 14, 17 h) and 4 (5, 6, 7, 9 h) different lead times, respectively, however, this might be a consequence of the large sample size resulting in numerical problems (see e.g. Baran and Lerch, 2015). Thus, in order to get a better quantification of the differences in calibration, the mean -values of 1000 random samples of PITs of sizes 1000 each, are calculated and plotted in Figure 6. The corresponding points at lead times 24, 72 and 120 hours nicely reflect the shapes of the PIT histograms of Figure 5 and clearly show the advantage of pure ML BMA post-processing.

## 5 Conclusions

We introduce a new BMA model for calibrating Box-Cox transformed hydrological ensemble forecasts for water level, providing a predictive distribution which is a weighted mixture of doubly truncated normal distributions. The model with three different parameter estimation approaches is tested on the 79 member ensemble forecast of BfG for water level at gauge Kaub of river Rhine for 120 different lead times. Using the CRPS of probabilistic and MAE of median forecasts and coverage and average width of nominal central prediction intervals as verification measures, the forecast skill of the BMA model is compared to that of the state of the art EMOS model of Hemri and Klein (2017) and the raw ensemble.

Based on the results of the presented case study one can conclude that compared with the raw ensemble, post-processing always improves the calibration of probabilistic and accuracy of point forecasts. Further, BMA model utilizing pure ML for parameter estimation has the best predictive performance and, except very short lead times, the BMA approach significantly outperforms the EMOS calibration. A direct comparison of the CRPSS values obtained in this case study with those shown in Figure 4a of Hemri and Klein (2017) reveals that – at least in the case of EMOS – seasonal and analog based training periods considerably outperform the rolling window training periods used in this study. Hence, though out-of-scope of this study, a thorough comparison of the gains in forecast skill and the (computational) costs of using a more complex post-processing method, e.g. BMA instead of EMOS with the gains and costs of using a more complex selection of the training periods, e.g. seasonal and analog based training periods for forecasts of water level, would be beneficial.

Further, following the ideas of Hemri et al. (2015) one can combine the BMA calibrated forecasts corresponding to different lead times into temporally coherent multivariate predictions with the help of state of the art techniques such as e.g. the ensemble copula coupling (Schefzik et al., 2013) or the Gaussian copula approach (Pinson and Girard, 2012), however, these studies are also beyond the scope of the present paper.

Acknowledgments.  The authors are grateful to the German Federal Office of Hydrology (BfG), and in particular Bastian Klein, for providing the data. The authors also thank Tilmann Gneiting for valuable comments and suggestions. Essential part of this work was made during the visit of Sándor Baran at the Heidelberg Institute of Theoretical Studies in the framework of the visiting scientist program. Sándor Baran was also supported by the János Bolyai Research Scholarship of the Hungarian Academy of Sciences and the National Research, Development and Innovation Office under Grant No. NN125679. He is grateful to Sebastian Lerch for his useful suggestions, remarks and helps with the EMOS code. Sándor Baran and Stephan Hemri also acknowledge the support of the Deutsche Forschungsgemeinschaft (DFG) Grant No. MO 3394/1-1 “Statistical postprocessing of ensemble forecasts for various weather quantities”.

## References

• Baran (2014) Baran, S. (2014) Probabilistic wind speed forecasting using Bayesian model averaging with truncated normal components. Comput. Stat. Data. Anal. 75, 227–238.
• Baran and Lerch (2015)

Baran, S. and Lerch, S. (2015) Log-normal distribution based EMOS models for probabilistic wind speed forecasting.

Q. J. R. Meteorol. Soc. 141, 2289–2299.
• Baran and Nemoda (2016) Baran, S. and Nemoda, D. (2016) Censored and shifted gamma distribution based EMOS model for probabilistic quantitative precipitation forecasting. Environmetrics 27, 280–292.
• Bougeault et al. (2010) Bougeault, P., Toth, Z., Bishop, C., Brown, B., Burridge, D., Chen, D. H., Ebert, B., Fuentes, M., Hamill, T. M., Mylne, K., Nicolau, J., Paccagnella, T., Park, Y.-Y., Parsons, D., Raoult, B., Schuster, D., Dias, P. S., Swinbank, R., Takeuchi, Y., Tennant, W., Wilson, L. and Worley, S. (2010) The THORPEX interactive grand global ensemble. B. Am. Meteorol. Soc. 91, 1059–1072.
• Buizza (2018) Buizza, R. (2018) Ensemble forecasting and the need for calibration. In Wannitsem, S., Wilks, D. S., Messner, J. W. (eds.), Statistical Postprocessing of Ensemble Forecasts, Elsevier, 2018, pp. 15–48.
• Buizza et al. (2005) Buizza, R., Houtekamer, P. L., Toth, Z., Pellerin, G., Wei, M. and Zhu, Y. (2005) A comparison of the ECMWF, MSC, and NCEP global ensemble prediction systems. Mon. Wea. Rev. 133, 1076–1097.
• Cloke and Pappenberger (2009) Cloke, H. L. and Pappenberger, F. (2009) Ensemble flood forecasting: A review. Journ. of Hydrol. 375, 613–626.
• Dempster et al. (1977) Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977) Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Statist. Soc. Ser. B 39, 1–39.
• Diebold and Mariano (1995) Diebold, F. X. and Mariano, R. S. (1995) Comparing predictive accuracy. J. Bus. Econ. Stat. 13, 253–263.
• Duan et al. (2007) Duan, Q., Ajami, N. K., Gao, X. and Sorooshian, S. (2007) Multi-model ensemble hydrologic prediction using Bayesian model averaging. Adv. Water Resour. 30, 1371–1386.
• Fraley et al. (2010) Fraley, C., Raftery, A. E. and Gneiting, T. (2010) Calibrating multimodel forecast ensembles with exchangeable and missing members using Bayesian model averaging. Mon. Wea. Rev. 138, 190–202.
• Gneiting et al. (2007) Gneiting, T., Balabdaoui, F. and Raftery, A. E. (2007) Probabilistic forecasts, calibration and sharpness. J. Roy. Stat. Soc. B. 69, 243–268.
• Gneiting and Raftery (2007) Gneiting, T. and Raftery, A. E. (2007) Strictly proper scoring rules, prediction and estimation. J. Amer. Statist. Assoc. 102, 359–378.
• Gneiting et al. (2005) Gneiting, T., Raftery, A. E., Westveld, A. H. and Goldman, T. (2005) Calibrated probabilistic forecasting using ensemble model output statistics and minimum CRPS estimation. Mon. Weather Rev. 133, 1098–1118.
• Gneiting and Ranjan (2011) Gneiting, T. and Ranjan, R. (2011) Comparing density forecasts using threshold- and quantile-weighted scoring rules. J. Bus. Econ. Stat. 29, 411–422.
• Hamill et al. (2013) Hamill, T. M., Bates, G. T., Whitaker, J. S., Murray, D. R., Fiorino, M., Galarneau, T. J., Zhu, Y. and Lapenta, W. (2013) NOAA’s second-generation global medium-range ensemble reforecast dataset. Bull. Am. Meteorol. Soc. 94, 1553–1565.
• Hemri et al. (2013) Hemri, S., Fundel, M. and Zappa, M. (2013) Simultaneous calibration of ensemble river flow predictions over an entire range of lead times. Water Resour. Res. 49, 6744–6755.
• Hemri et al. (2014) Hemri, S., Lisniak, D. and Klein, B. (2014) Ermittlung probabilistischer Abflussvorhersagen unter Berücksichtigung zensierter Daten. HyWa 58, 84–94.
• Hemri et al. (2015) Hemri, S., Lisniak, D. and Klein, B. (2015) Multivariate postprocessing techniques for probabilistic hydrological forecasting. Water Resour. Res. 51, 7436–7451.
• Hemri and Klein (2017) Hemri, S. and Klein, B. (2017) Analog based post-processing of navigation-related hydrological ensemble forecasts. Water Resour. Res. 53, 9059–9077.
• Iversen et al. (2011) Iversen, T., Deckmin, A., Santos, C., Sattler, K., Bremnes, J. B., Feddersen, H. and Frogner, I.-L. (2011) Evaluation of ’GLAMEPS’ – a proposed multimodel EPS for short range forecasting. Tellus A 63, 513–530.
• Jordan et al. (2017) Jordan, A., Krüger, F. and Lerch, S. (2017) Evaluating probabilistic forecasts with the R package scoringRules. arXiv 1709.04743. Available from https://arxiv.org/abs/1709.04743 [Accessed on 19 July 2018].
• Krzysztofowicz (1999) Krzysztofowicz, R. (1999) Bayesian theory of probabilistic forecasting via deterministic hydrologic model. Water Resour. Res. 35, 2739–2750.
• Lindström et al. (1997) Lindström, G., Johansson, B., Persson, M., Gardelin, M. and Bergström, S. (1997) Development and test of the distributed HBV-96 hydrological model. J. Hydrol. 201, 272–288.
• Lee and Scott (2012)

Lee, G and Scott, C. (2012) EM algorithms for multivariate Gaussian mixture models with truncated and censored data.

Comput. Statist. Data Anal. 56, 2816–2829.
• Lerch and Thorarinsdottir (2013) Lerch, S. and Thorarinsdottir, T. L. (2013) Comparison of non-homogeneous regression models for probabilistic wind speed forecasting. Tellus A 65, 21206.
• Leutbecher and Palmer (2008) Leutbecher, M. and Palmer, T. N. (2008) Ensemble forecasting. J. Comp. Phys. 227, 3515–3539.
• McLachlan and Krishnan (1997) McLachlan, G. J. and Krishnan, T. (1997) The EM Algorithm and Extensions. Wiley, New York.
• Molteni et al. (1996) Molteni, F., Buizza, R. and Palmer, T. N. (1996) The ECMWF Ensemble Prediction System: Methodology and validation. Q. J. R. Meteorol. Soc. 122, 73–119.
• Montani et al. (2011) Montani, A., Cesari, D., Marsigli, C. and Paccagnella, T. (2011). Seven years of activity in the field of mesoscale ensemble forecasting by the COSMO-LEPS system: Main achievements and open challenges. Tellus A 63, 605–624.
• Murphy (1973) Murphy, A. H. (1973) Hedging and skill scores for probability forecasts. J. Appl. Meteorol. 12, 215–223.
• Park et al. (2008) Park, Y.-Y., Buizza, R. and Leutbecher, M. (2008) TIGGE: Preliminary results on comparing and combining ensembles. Q. J. R. Meteorol. Soc. 134, 2029–2050.
• Pinson and Girard (2012) Pinson, P. and Girard, R. (2012) Evaluating the quality of scenarios of short-term wind power generation. Appl. Energy 96, 12–20.
• Raftery et al. (2005) Raftery, A. E., Gneiting, T., Balabdaoui, F. and Polakowski, M. (2005) Using Bayesian model averaging to calibrate forecast ensembles. Mon. Weather Rev. 133, 1155–1174.
• Ruiz and Saulo (2012) Ruiz, J. J. and Saulo, C. (2012) How sensitive are probabilistic precipitation forecasts to the choice of calibration algorithms and the ensemble generation method? Part I: Sensitivity to calibration methods. Meteorol. Appl. 19, 302–313.
• Schefzik et al. (2013) Schefzik, R., Thorarinsdottir, T. L. and Gneiting, T. (2013). Uncertainty quantification in complex simulation models using ensemble copula coupling. Stat. Sci. 28, 616–640.
• Scheuerer (2014) Scheuerer, M. (2014) Probabilistic quantitative precipitation forecasting using ensemble model output statistics. Q. J. R. Meteorol. Soc. 140, 1086–1096.
• Scheuerer and Hamill (2015) Scheuerer, M. and Hamill, T. M. (2015) Statistical post-processing of ensemble precipitation forecasts by fitting censored, shifted gamma distributions. Mon. Wea. Rev. 143, 4578–4596.
• Schmeits and Kok (2010)

Schmeits, M. J. and Kok, K. J. (2010) A comparison between raw ensemble output, (modified) Bayesian model averaging and extended logistic regression using ECMWF ensemble precipitation reforecasts.

Mon. Weather Rev. 138, 4199–4211.
• Sloughter et al. (2007) Sloughter, J. M., Raftery, A. E., Gneiting, T. and Fraley, C. (2007) Probabilistic quantitative precipitation forecasting using Bayesian model averaging. Mon. Wea. Rev. 135, 3209–3220.
• Sloughter et al. (2010) Sloughter, J. M., Gneiting, T. and Raftery, A. E. (2010) Probabilistic wind speed forecasting using ensembles and Bayesian model averaging. J. Amer. Stat. Assoc. 105, 25–37.
• Swinbank et al. (2016) Swinbank, R., Kyouda, M., Buchanan, P., Froude, L., Hamill, T. M., Hewson, T. D., Keller, J. H., Matsueda, M., Methven, J., Pappenberger, F., Scheuerer, M., Titley, H. A., Wilson, L. and Yamaguchi, M. (2016) The TIGGE project and its achievements. B. Am. Meteorol. Soc. 97, 49–67.
• Thorarinsdottir and Gneiting (2010)

Thorarinsdottir, T. L. and Gneiting, T. (2010) Probabilistic forecasts of wind speed: Ensemble model output statistics by using heteroscedastic censored regression.

J. Roy. Statist. Soc. Ser. A 173, 371–388.
• Todini (2008) Todini, E. (2008) A model conditional processor to assess predictive uncertainty in flood forecasting. Int. Jour. of River Basin Manag. 6, 123–137.
• Wilks (2011) Wilks, D. S. (2011) Statistical Methods in the Atmospheric Sciences. 3rd ed. Elsevier, Amsterdam.
• Williams et al. (2014) Williams, R. M., Ferro, C. A. T. and Kwasniok, F. (2014) A comparison of ensemble post-processing methods for extreme events. Q. J. R. Meteorol. Soc. 140, 1112–1120.