## 1 Introduction

As a chaotic dynamical system, the atmosphere has an evolution that is intrinsically uncertain (Malardel, 2005; Holton and Hakim, 2012)

and should be described in a probabilistic form. In the field of numerical weather prediction (NWP), this probabilistic form is not a probability distribution but a set of deterministic forecasts whose aim is to assess the forecast uncertainty

(Leutbecher and Palmer, 2008). Such a set of deterministic forecasts is called an ensemble forecast and each individual deterministic forecast is called a member. The members are usually obtained by running the same NWP model with different initial conditions and different parametrizations of the model physics (Descamps et al., 2011). Forecast uncertainty can then be derived from the members as a probability distribution with statistical estimation techniques and considering the members are a random sample from an unknown multivariate probability distribution .

Being often biased and under-dispersed for surface parameters (Hamill and Colucci, 1998; Buizza et al., 2005), the ensemble forecast systems may be post-processed with statistical methods, called ensemble model output statistics (EMOS) to get more skillful forecast distributions (Wilson et al., 2007; Thorarinsdottir and Gneiting, 2010; Möller and Scheuerer, 2013; Zamo et al., 2014; Baran and Lerch, 2015; Taillardat et al., 2016).

Nowadays, several ensemble forecast systems are available routinely (Bougeault et al., 2010; Descamps et al., 2014). Combining, or “aggregating”, several forecasts may improve the predictive performance compared to the most skillful post-processed ensemble (Allard et al., 2012; Gneiting et al., 2013; Baudin, 2015; Baran and Lerch, 2016; Möller and Groß, 2016; Bogner et al., 2017). The theory of prediction with expert advice (Cesa-Bianchi et al., 2006; Stoltz, 2010) shows how to efficiently aggregate in real-time several forecasts based on their respective past performances. This theory studies the mathematical properties of aggregation algorithms of several forecasting systems (called “experts” in this framework), and has been applied mostly to point forecasts.

The first goal of the present work is to apply the
theory of prediction with expert advice to probabilistic forecasts
represented as step-wise cumulative distribution functions (CDF)
with any number of steps. Two previous studies
(Baudin, 2015; Thorey, 2017) used this
theory for specific cases of probabilistic forecasts. In
Baudin (2015), the experts are the *ordered*
individual members of pooled ensembles. Each expert’s forecast is a
stepwise cumulative distribution function with one step. In this case the experts are not
identifiable over time although it is required by the theory. For instance, at
different times, the lowest forecast value comes from a different
member of a different ensemble. Thorey (2017)

applies the theory of prediction with expert advice to forecasts of photovoltaic electricity production. The experts are ensemble forecasts or built from ensemble or deterministic forecasts thanks to statistical regression methods. Each expert is treated as a different deterministic forecast, even though some of them are members of the same ensemble. Other experts are built thanks to quantile regression methods. As such, the set of forecast quantiles could be considered as a specific probabilistic expert, but each forecast quantile is again considered as a separate deterministic expert. In this study, contrary to

Baudin (2015) and Thorey (2017), each expert is a whole (raw or post-processed) ensemble, and is thus actually identifiable over time. This work extends the work of Baudin (2015) in the sense that the formulae established in this article reduce to the ones in Baudin (2015) if considering step-wise CDFs with a single step. In a nutshell, the aggregated forecast is a linear combination of the CDFs of the experts. Since the aggregated forecast must be a CDF, only convex aggregation strategies are investigated: the weights are constrained to be positive and to sum up to one.The second goal of this work is to compare two model selection approaches when dealing with probabilistic forecasting systems (Collet and Richard, 2017). The first approach is based on “reliability” (or “calibration”) and “sharpness” (Gneiting et al., 2007; Jolliffe and Stephenson, 2011)

. A forecasting system is reliable if the conditional probability distribution of the observation given the forecast distribution is equal to the forecast distribution. A forecasting system is sharper when, on average, it predicts a lower dispersion of the observation. According to the

*sharpness-calibration paradigm*of Gneiting et al. (2007) a forecasting system should aim at providing reliable probabilistic forecasts that are the sharpest (i.e. less dispersed) possible. A practical motivation of this sharpness-calibration paradigm is that decisions based on such a forecasting system would be optimal due to the reliability of the forecast and less uncertain due to the forecast’s low dispersion, which improves its value for economical decisions (Richardson, 2001; Zhu et al., 2002; Mylne, 2002). Among several reliable forecasting systems, one should thus select the sharpest. The second approach to model selection among probabilistic forecasting systems is based on a scoring rule, such as the Continuous Ranked Probabilistic Score (CRPS, Matheson and Winkler 1976): the selected model is the one that has the best value of the scoring rule (highest or lowest value depending on the scoring rule). The two approaches to model selection do not yield equivalent forecasts, as previously mentioned in different studies (Collet and Richard, 2017; Wilks, 2018). For instance, minimizing the CRPS may lead to forecasts that are not reliable. To solve this problem, Wilks (2018) proposes to minimize the CRPS penalized with a term quantifying the unreliability of the forecast. Collet and Richard (2017) introduces a post-processing method that, under quite strong assumptions, yields reliable forecasts without degrading too much the CRPS compared to the CRPS-minimizing approach. We do not use any of these solutions here but compare the two model selection approaches and their properties on a case study. Both approaches are used to select the best forecasting system among several experts and aggregated forecasts. In the first approach, reliability is imposed by using the Jolliffe-Primo flatness test (JP test, Jolliffe and Primo (2008)). As for the second approach, forecast performance is measured with the CRPS.

In Section 2, the theory of prediction with expert advice is presented, along with notations. It is shown how this theory can be straightforwardly applied to step-wise CDFs. The CRPS and JP test are also introduced with more details and their use is further motivated. Section 3 presents the different aggregation methods investigated. Some are empirical, while others exhibit interesting theoretical properties. Section 4 describes the use-case of this study and the data it uses: four ensemble forecasts, two EMOS methods used to post-process the ensembles, and the wind speed observation. The results of the comparison of the aggregation methods are presented in Section 5. These results motivate a more theoretical comparison of the two approaches to model selection among probabilistic forecasts, in Section 6. Finally, Section 7 concludes with a summary of the results and perspectives.

## 2 Theoretical Framework and Performance Assessment Tools

The desired properties of forecast aggregation are two-fold. The first one is to yield an aggregated forecast that performs better than any of the forecasts that are used in the aggregation. According to the theory of prediction with expert advice (Cesa-Bianchi et al., 2006; Stoltz, 2010), some algorithms used to sequentially aggregate forecasts exhibits theoretical guarantees of performance. These guarantees state that the aggregated forecast will not perform much worse than some skillful reference forecast, called the oracle. In practice, the aggregated forecast may even outperform the oracle. This motivates using the theory of prediction with expert advice. The second desired property is to dynamically tackle changes in the forecasts’ generating process (such as modification in NWP model’s code). These changes may strongly affect the performance of the raw or post-processed forecasts. A good aggregation method should quickly detect changes in the performance of the individual forecasts and adapt the aggregation weights to discard the bad ones and favor the good ones. Being a sequential aggregation framework based on the recent performance of the experts, the theory of prediction with expert advice may help in reaching this second goal.

### 2.1 Sequential Aggregation of Step-Wise CDFs

The situation tackled by the theory of prediction with expert advice is the following: a forecaster has to forecast some parameter of interest by using only past observations of the parameter and past and current forecasts of the parameter steming from several sources. These sources whose forecasts are to be aggregated are called “experts” in this theory. In this very general framework no assumption is made on the generating process of the observations and the experts.

More formally, at time , before the observation is revealed, let us suppose available the forecast of the experts and of the past observations (noted ). An expert is any means, in a very general sense (NWP model, human expertise, …), to produce a forecast of at each time , before the observation is known. The forecast of expert at time is noted , with the value set of the forecasts. Although the prediction with expert advice has been mostly used with experts issuing point forecasts (), let us stress that the experts yield forecasts of any type, not necessarily point forecasts. The theory straightforwardly adapts to probabilistic forecasts as will be shown below. Someone or something, called the “forecaster” in the theory, produces an aggregated forecast as a linear combination of the experts’ current forecasts

(1) |

where is the aggregation weight of expert at time . The aggregation weights are computed using only information available at the present time, namely the past and present experts’ forecasts with and , and the past observations , with . The aggregation is usually initialized with equal weights, that is, . The weights can be computed with many algorithms (called aggregation methods), some of which are presented in Section 3.

When the observation is revealed, the forecaster suffers a loss, quantified with a function . The most general goal of the forecaster would be to build the best possible forecast, that is, to minimize its cumulative loss over a period of time . This minimization is not possible for all sequences , since it is always theoretically possible to build a sequence of observations that makes the forecaster’s cumulative loss arbitrarily high. Therefore, a more realistic goal is to build the best possible forecast relatively to the best element from some class of reference forecasts. Let us note such a class of reference forecasts, whose elements are written

(2) |

In this case the weights may be computed by using the present and future information (that is on the whole period ). Hence the aggregated forecast delivered by the forecaster may be different from the best aggregated forecast in class (called the oracle). Two examples of class are the set of the available experts, or the set of linear combination of the experts’ forecasts with constant weighting computed with some chosen aggregation method. The computation of the oracle uses all the information for the whole period. Thus, the oracle cannot be used for real-time applications, but may be used as a reference in order to evaluate aggregation methods. The regret of the forecaster relatively to the class is the cumulative additional loss suffered by the forecaster who used its own aggregated forecast instead of the oracle of class :

(3) |

According to the theory of prediction with expert advice there exists aggregation methods available to the forecaster such that the regret relative to class is sub-linear in

(4) |

where the supremum is taken over all possible sequences of
the observation and experts. Let’s point out that the regret is
not lower bounded, and so may be negative for some datasets. Since
the upper bound on the regret holds for *any* sequence of
observation and forecasts, this sublinearity property ensures to the
forecaster good forecast performances. In most cases,
this property only requires that the chosen loss be convex in
its first argument, and no assumption is required about the observation or the experts
(Cesa-Bianchi et al., 2006). In specific cases where some information
is known about the experts, such as a correlation structure between
the experts’ forecasts, this information may be used to design
better aggregation methods (see supplement in
Swinbank et al. (2016)). Adjakossa et al. (2020) uses assumptions about the experts’ generating process to improve aggregation methods for point forecasts. Here only the general case is treated.
Mallet et al. (2007) and Gerchinovitz et al. (2008) review
many aggregation methods of expert advice, with numerical algorithms
thereof. Both articles describe the case of experts issuing point
forecasts () and real
observations (), along with
theoretical bounds for the -loss
(), when they exist.

Applying this theory for probabilistic forecasts expressed as step-wise CDF is straightforward. The observation is supposed real-valued (). The forecast space is the set of step-wise CDF, that is the set of piece-wise constant, non-decreasing functions taking their values in . Each expert forecast is a step function with jumps of heights (called weights) at values . For instance the values may come from an ensemble forecast with members. The weights are such that , for , and . Then each expert’s forecast is the step-wise CDF

(5) |

where is the Heaviside function, and . Without loss of generality, the are supposed sorted in ascending order for each expert and at each time , so that may be considered as the quantile of order

of a random variable

.The aggregated forecast CDF, , is a step function at the pooled values such that the jump of height is associated to the value , thus

(6) |

In other words, may be considered as the quantile of order of some random variable , whose computation is illustrated in Figure 1. To produce a valid step-wise CDF, the aggregation method must produce aggregation weights that are all non negative and sum up to 1 at fixed . Thus, the aggregated forecast is a convex combination of the expert forecasts.

### 2.2 Performance Assessment Tools

Among the several attributes of performance of probabilistic forecast systems, reliability and resolution are most useful (Murphy, 1993; Winkler et al., 1996). A probabilistic forecast system is reliable if the conditional distribution of the (random) observation given the forecast distribution , noted , is equal to the forecast distribution: . Resolution refers to the ability of the forecast system to issue forecast distributions very different from the marginal distribution of the observations. If a forecast is reliable, decisions can then be made by considering the observation will be drawn from the forecast distribution. For a reliable forecasting system, the higher is its resolution, the more useful it is useful for economical decision taking (Richardson, 2001).

In order to compare the two approaches of probabilistic forecast selection, as stated in the Introduction, two tools are used: the Continuous Ranked Probability Score which is a global measure of performance, and the rank histogram which is linked to reliability.

The Continuous Ranked Probability Score (CRPS, Matheson and Winkler 1976) is a scoring rule that quantifies the forecast performance of a probabilistic forecasts expressed as a CDF with a scalar observation

(7) |

The CRPS being convex in its first argument, the existence of theoretical bounds for the regret is ensured for some aggregation methods by the theory of prediction with expert advice. Since the experts are step-wise CDFs with steps at , the information about the underlying forecast CDF is incomplete, which makes some estimators of the CRPS biased, as investigated in Zamo and Naveau (2018). This last article shows that, in order to get an accurate estimate of the CRPS, one has to choose an estimator of the CRPS according to the nature of the (a sample from or a set of quantiles from ). This recommendation was followed in this study. For a forecast CDF built as the empirical CDF of an -random sample (such as a raw ensemble), the CRPS is estimated with

(8) |

and for a forecast CDF defined from a set of quantiles with regularly spaced orders (such as a post-processed ensemble), the following estimator is used

(9) |

The expectation of the CRPS can be decomposed into three terms that quantify different properties of a forecasting system or the observation distribution: the reliability (RELI), the resolution (RES) and the uncertainty (UNC) terms:

(10) |

Let us note the marginal distribution of the observation and the conditional distribution of the observation given the forecast distribution . Then the three terms are defined as in Bröcker (2009) by

(11) | |||||

(12) | |||||

(13) |

The reliability term is the average difference of CRPS between forecasts and when forecasting observations distributed according to . It is negatively oriented with a minimum of 0 for a perfectly reliable forecast system (). The resolution term is the average difference of CRPS between forecasts and when forecasting observations distributed according to . It is positively oriented and the more is different form the higher it is. For a reliable forecast system, it is essentially equivalent to the sharpness. The uncertainty term is linked to the dispersion of the marginal distribution of the observation and does not depend on the forecast. Hersbach (2000) gives a method to estimate these terms from an ensemble forecast. Based on this decomposition, good CRPS can be obtained with a bad reliability (high RELI) provided that the resolution is good enough (high RES). By minimizing the CRPS as a forecast selection criterion, one may thus select a forecast system that has a low CRPS but is not reliable (Wilks, 2018). The solution proposed by Wilks (2018) is to minimize the CRPS modified with a penalty proportional to the reliability term. Although this may indeed decrease the reliability term, there is no way to check if it is low enough.

Collet and Richard (2017) proposed a post-processing
method that enforces the reliability of the forecast. To have a more
general forecast selection method, it is proposed here to check that
a forecasting system is reliable by testing the hypothesis that its
rank histogram is flat. This procedure may be used as long as one
can build a rank histogram from the forecast, which is generally the case. The
rank histogram of an ensemble forecast, simultaneously introduced by
Anderson (1996), Hamill and Colucci (1996) and
Talagrand et al. (1997), is the histogram of the rank of the
observation when it is pooled with its corresponding forecast
members. For a reliable ensemble, the observation and the members must
have the same statistical properties, resulting in a flat rank
histogram. The type of deviation from flatness gives indications about the
flaws of an ensemble. For instance, an L-shape histogram means the
forecasts are consistently too high, while a J-shape histogram
indicates consistently too low forecasts. A U-shape histogram reveals
the forecast distribution is under-dispersed or conditionally
biased. Hamill (2001) showed on synthetic data that a
flat rank histogram can be obtained for an unreliable ensemble. Producing a
flat rank histogram is thus a necessary but not sufficient condition
for a forecasting system to be reliable. Although the rank histogram was initially designed for ensemble forecasts, it can be used for forecasts in the form of quantiles. If the quantiles’ orders are regularly spaced between 0 and 1 (excluded), then the histogram should also be flat if the forecasting system is reliable. The flatness of a rank histogram
can be statistically tested thanks to the Jolliffe-Primo tests of
flatness described in Jolliffe and Primo (2008) and summarized in
Appendix A. These tests assess the existence of some
specific deviations from flatness in a rank histogram, such as a
slope, a convexity and a wave shape^{1}^{1}1The p-values of the Jolliffe-Primo test
for slope and convexity have been computed based on a modified
version of the function TestRankhist in the R package SpecsVerification (Siegert, 2015). The function has
been modified to compute also the p-value for the Jolliffe-Primo
test for a wave shape, introduced in this
study.. In this study we apply the three flatness tests to each rank histogram at several locations where forecasts and observations are available, which leads to a multiple testing procedure. In order to take into account this multiple testing, the false discovery rate is controlled thanks to the Benjamini-Hochberg procedure (Benjamini and Hochberg, 1995; Benjamini and Yekutieli, 2001) with an alpha of 0.01. Bröcker (2018) showed that when the observation ranks are serially dependent, the JP tests
should be adapted to take into account this temporal dependency. In this
study, the lag-1 autocorrelation of the rank has a median of 0.2
over all the studied locations and lead times and is lower than 0.4
for most of the forecasting systems (raw, post-processed and aggregated
alike). This correlation was judged low enough so the modified
procedure proposed by Bröcker (2018) was not used.

## 3 Aggregation Methods

Five aggregation methods are introduced, from simple empirical ones to more sophisticated ones derived from the theory of prediction with expert advice.

### 3.1 Inverse CRPS Weighting

The *inverse CRPS weighting method* (INV) gives to each
expert’s forecast a weight inversely proportional to its average
CRPS over the last days:

(14) |

where is the average CRPS of expert during the days before time .

### 3.2 Sharpness-Calibration Paradigm

The *sharpness-calibration paradigm* (SHARP) of Gneiting et al. (2007)
can be used as an aggregation method to select at each time one
expert. The aggregated forecast is the forecast of the expert whose
range of the central interval , averaged over the last
days is the lowest, among the experts whose reliability term, as
computed in Hersbach (2000), is lower than a chosen
threshold over the last days. The aggregated forecast
gives an aggregation weight of 1 to the sharpest reliable expert’s forecast and
of 0 to the other experts’ forecast:

(15) |

where is the reliability term of expert over the days before time , and is the average range of the interval between quantiles of orders 0.95 and 0.05, forecasted by the expert over the days before time . If no expert has a reliability term lower than the reliability threshold over the last days, the aggregated forecast is just the expert with the lowest mean CRPS over the last days.

The three following methods are derived from the theory of prediction with expert advice, and bounds for the regret may be computed.

### 3.3 Minimum CRPS

The *minimum CRPS method* (MIN) chooses the best recent expert in
terms of CRPS, that is, the aggregation weight is 1 for the expert
with the lowest average CRPS over the last days, and 0 for all the
other experts:

(16) |

where is the index of the expert with the minimum average CRPS during the last days. The reference class is the set of the available experts, so that the oracle for this method is the expert with the lowest CRPS averaged over the period , . This aggregation method is called “follow-the-best-expert” in Cesa-Bianchi et al. (2006), who prove that, under several assumptions on the loss, the regret of the aggregated forecast relatively to the oracle is .

### 3.4 Exponential Weighting

The *exponentially weighted average forecaster* (EWA) computes the aggregation weights as

(17) |

where is called the learning rate and is the cumulative CRPS of expert over the last days. The highest the learning rate, the lowest the weight for a bad expert. For very large learning rates , EWA is practically equivalent to the MIN method. The reference class is the set of the available experts. If spans the whole period before , that is, if days, EWA competes well with the best expert, in terms of average CRPS over the whole period , with a boudned regret relatively to this oracle

(18) |

where

is the upper bound of the loss function (see proof in Appendix

B). The demonstration requires only the convexity of the loss in its first argument. In practice, for an unbounded loss such as the CRPS, is the maximum of observations and expert forecasts over the whole period . Although the theoretical bounds exist only for , one can use a shorter sliding window in order to make the aggregation weights change quickly over time. This may improve the forecast of an unstationary variable of interest.### 3.5 Exponentiated Gradient

The *exponentiated gradient forecaster* (GRAD) weights the experts with

(19) |

where is the cumulative CRPS of the GRAD aggregated forecast over the last days, and . Using Appendix C leads to

(20) |

which generalizes Equation (5.13) of Baudin (2015) to the case of the aggregation of step-wise CDFs with any number of steps. The rationale for using the gradient is that, if the gradient is positive for an expert over the past forecasts, increasing the weight would have increased the CRPS (and decreased the performance) of the aggregated forecast. So for incoming forecasts, giving this expert a low weight should improve the forecast performance. The reverse is true for a negative gradient.

For this aggregation method, the reference class is the set of convex combinations with constant weights over the whole period , that is, such that and . The oracle is the best, in terms of cumulative CRPS, constant convex combination of experts. This is usually a better oracle than the best expert. If days, the following bound immediately follows from Mallet et al. (2007) or Baudin (2015)

(21) |

where . Although the oracle for GRAD is better than for EWA, the bounds may be larger, so that EWA may actually perform better than GRAD. In practice, one has to try both methods to know which one performs best.

## 4 The Experts and the Observation

In this work, aggregation is applied to ensemble forecasts of the 10 m wind speed over France. Forecasting surface wind is quite difficult due to complex interactions between phenomena at different spatio-temporal scales. For instance, wind speed is influenced by large scale atmospheric structures such as cyclones and anticyclones, but also by local orography and surface friction. Local atmospheric effects such as downward drifts under convective clouds also contribute to the direction and speed of wind. National weather services need to have skillful wind speed forecasts to issue early warnings to the population and civil security services. Also, wind speed influences many economic activities, such as sailing, windpower generation, construction,… that need good forecasts for decision making.

This section presents the 28 experts used in this study and the wind speed observation used to assess the forecasts.

### 4.1 The Four Experts based on TIGGE

The International Grand Global Ensemble, formerly the THORPEX
Interactive Grand Global Ensemble (TIGGE), was an international
project aiming, among other things, to provide ensemble prediction
data from leading operational forecast centers
(Bougeault et al., 2010; Swinbank et al., 2016). Although the TIGGE
data set^{2}^{2}2The TIGGE data set can be retrieved from the ECMWF at
http://apps.ecmwf.int/datasets or from the Chinese Meteorological
Administration at http://wisportal.cma.gov.cn/wis/ includes 10
ensemble NWP models, only the four ensemble models available on a
daily basis at Météo-France have been retained (see
Table 1) as experts, so that the aggregation methods may later
be used in operations at Météo-France.

The TIGGE ensembles are available on a grid size is 0.5. Over France this amounts to a total of 267 grid points. The study period goes from the January, 2011 to the December, 2014 (so in the notations of Section 2). The lead times go from 6 h to 54 h depending on the ensemble, with a timestep of 6 h. The forecast are done at 1800 UTC for lead times from 6 h to 48 h, with a time step of 6 h. This implies that for experts based on CMC and ECMWF, whose runtime is 1200 UTC, the actual lead time is (see Table 1).

Each ensemble is an expert whose forecast CDF is the empirical CDF of the members associated with the same weight , where is the number of members in the ensemble.

Weather service | Members | Hour of the run used (UTC) | Lead times |
---|---|---|---|

Canadian Meteorological Center (CMC) | 21 | 1200 | 12h to 54h |

European Center for Medium-Range Weather Forecasts (ECMWF) | 51 | 1200 | 12h to 54h |

Météo-France (MF) | 35 | 1800 | 6h to 48h |

US National Centers for Environmental Prediction (NCEP) | 21 | 1800 | 6h to 48h |

### 4.2 The Twenty Four Experts built with EMOS Methods

Each ensemble is post-processed with two kinds of EMOS: non-homogeneous regression (NR, Gneiting et al. 2005, Hemri et al. 2014

) and quantile random forest (QRF,

Meinshausen 2006, Zamo et al. 2014, Taillardat et al. 2016).The forecast CDF produced with NR is parametric. Following Hemri et al. (2014), the square root of the forecast wind speed

is supposed to follow a normal distribution truncated at 0

(22) |

where and

are the mean and standard deviation of the square-root of the associated ensemble values, forecasted at time

. The four paremeters ( and ) are optimized by maximizing the log-likelihood^{3}

^{3}3With the function optim in R (R Core Team, 2015). over the last forecast days. In order to build several NR-post-processed ensembles with different reaction time to the raw ensemble’s performance changes, five sizes of the training window are used, as summarized in Table 2. These parametric CDFs can not be used as such in the framework of step-wise CDF aggregation. A step-wise is built from the squared quantiles of orders of the parametric CDF produced by NR. The last order is not in order to avoid infinite values.

QRF is a non parametric machine learning method that produces a set of quantiles of chosen order. A random forest is a set of regression trees built on a bootstrapped version of the training data

(Zamo et al., 2016). While building each tree, the observations are split in two most homogeneous groups (in terms of variance of the observation), according to a splitting criterion over an explanatory variable. At each split (or node) only a random subset of the complete set of explanatory variables are tested, until a stopping criterion is reached (such as a maximum number of final nodes, called leaves). When a forecast is required, a step-wise CDF is produced by going down the forest with the vector of explanatory variables, computing the step-wise CDF of the observations associated to each leaf and averaging those CDFs. In practice, one requests a set of quantile orders and gets the corresponding quantiles

^{4}

^{4}4In the R packages quantregForest or ranger.

. The requested quantile orders are the same as for NR, except the last order which is 1, because QRF cannot produce infinite values. Since the forecast CDF is a step-wise function, the obtained quantiles may contain many ties, that are suppressed by linearily interpolating between the points defining the CDF’s steps, as explained in

Zamo and Naveau (2018).The post-processing and aggregation methods are trained separately for each of the eight lead times and each grid point. Since NR is a parametric method with only four parameters, it can be trained with few data, thus a sliding-window training period is used. QRF, being a non parametric method, requires more training data to define the many splits in each tree. It is thus trained with 4-fold cross-validation: each year is successively used as a test sample, the reminaing three being used as the training sample.

From each of the four ensembles seven experts are built: the raw ensemble, the QRF-post-processed ensemble and five NR-post-processed versions of the ensemble (each with a different size of the training window, Table 2). Altogether, 28 experts are aggregated.

Quantile Regression Forest (QRF) | |
---|---|

Forecast distribution | Non-parametric (set of quantiles). |

Explanatory variables | Control member, ensemble mean, ensemble 0.1 and 0.9 quantiles, month. |

Training method | 4-fold cross-validation (3 training years, 1 test year). |

Orders of the forecast quantiles | |

Non-homogeneous Regression (NR) | |

Forecast distribution | Parametric (truncated normal distribution for the square-root of wind speed). |

Explanatory variables | Mean and standard deviation of the raw ensemble. |

Training method | Likelihood maximization on a sliding window over the previous days. Five windows are used: days |

Orders of the forecast quantiles | . |

### 4.3 The Observation

## 5 Results

The five aggregation methods presented in Section 3 have been investigated, with different values of the tuning parameters and . The values of the tuning parameters tested in the study are listed in Table 3. When two parameters are listed, all combinations have been tried.

Aggregation method | Parameters’ values |
---|---|

Minimum CRPS or Inverse CRPS | days |

Sharpness-calibration | days, m/s |

Exponentiated weighting or Exponentiated gradient weighting | days, |

Hereafter the best expert and the best aggregation method is selected following the two approaches mentioned above: the minimization of the average CRPS or the maximization of the proportion of grid points with a rank histogram deemed flat by the three flatness tests. The best forecasting system in terms of minimum CRPS is called the most skillful. The forecasting system which gets the maximum proportion of grid points with a flat rank histogram is called the most reliable.

### 5.1 CRPS and Reliability

From Table 4, the most skillful expert is the QRF-post-processed ECMWF ensemble (it is an oracle). The time series of the regret of each most skillful or reliable aggregation method of each type relatively to the QRF-post-processed ECMWF ensemble is drawn in Figure 2, for lead time 24 h. Whereas the most skillful setting of the SHARP aggregation method gets a consistently higher CRPS than the most skillful expert, the other aggregation methods manage to outperform the latter at least for some part of the four years. The most skillful settings of EWA and GRAD even get a negative regret relatively to the most skillful expert. The regret exhibits a trend and a diurnal cycle with the lead times (not shown), and so does the averaged CRPS (see Table 4): the forecast performance decreases with increasing lead times, particularly during the late afternoon when the wind strengthens. The main point is that, in terms of CRPS, post-processing improves performance, and aggregation further improves performance. According to the minimization of the CRPS, the chosen forecast method would be the most skillful GRAD setting, that is and days.

Method | Parameters | Lead time (h) | |||||||||

all | 6 | 12 | 18 | 24 | 30 | 36 | 42 | 48 | |||

RAW ECMWF | 0.76 | 0.79 | 0.79 | 0.73 | 0.73 | 0.78 | 0.78 | 0.73 | 0.74 | ||

Selection: most skillful forecasting system. | |||||||||||

expert (QRF ECMWF) | 0.49 | 0.47 | 0.46 | 0.48 | 0.50 | 0.49 | 0.49 | 0.52 | 0.53 | ||

SHARP | 1095 | 0.55 | 0.52 | 0.52 | 0.55 | 0.56 | 0.55 | 0.55 | 0.59 | 0.60 | |

GRAD | -1 | t-1 | 0.47 | 0.44 | 0.44 | 0.46 | 0.47 | 0.47 | 0.47 | 0.51 | 0.51 |

EWA | -1 | 365 | 0.48 | 0.44 | 0.44 | 0.47 | 0.48 | 0.47 | 0.48 | 0.52 | 0.52 |

INV | 7 | 0.49 | 0.46 | 0.46 | 0.47 | 0.48 | 0.49 | 0.49 | 0.52 | 0.52 | |

MIN | 365 | 0.51 | 0.47 | 0.47 | 0.51 | 0.52 | 0.50 | 0.51 | 0.56 | 0.56 | |

Selection: most reliable forecasting system. | |||||||||||

expert (QRF MF) | 0.50 | 0.46 | 0.47 | 0.51 | 0.50 | 0.49 | 0.50 | 0.55 | 0.53 | ||

SHARP | 1095 | 0.55 | 0.52 | 0.52 | 0.55 | 0.56 | 0.55 | 0.55 | 0.59 | 0.60 | |

GRAD | 0.5 | t-1 | 0.50 | 0.46 | 0.46 | 0.50 | 0.50 | 0.49 | 0.50 | 0.54 | 0.54 |

EWA | 0.5 | 30 | 0.50 | 0.46 | 0.46 | 0.50 | 0.50 | 0.50 | 0.50 | 0.55 | 0.54 |

INV | 7 | 0.49 | 0.46 | 0.46 | 0.47 | 0.48 | 0.49 | 0.49 | 0.52 | 0.52 | |

MIN | 365 | 0.51 | 0.47 | 0.47 | 0.51 | 0.52 | 0.50 | 0.51 | 0.56 | 0.56 |

If one uses the flatness of rank histograms as the selection criterion, the best raw ensemble is CMC, the best expert is QRF-post-processed MF and the best aggregated forecast is EWA with and days (see Table 5). Raw CMC is not a very reliable ensemble, as shown in Figure 3 LABEL:sub@fig:aggr_rkhistmaps_raw for lead time

h: the ensemble is consistently biased with too strong forecast wind speeds in the north-west of France and too weak forecast wind speeds over the Alps and the Pyrenees. Elsewhere, although the ensemble is much less biased, the rank histogram is not deemed flat due to an obvious U-shape. The rank histograms at the other lead times and for the other raw ensembles exhibit similar features (not shown). For the most reliable expert and aggregated forecasts, the rank histograms are computed with the nine forecast deciles. As illustrated in Figure

3 LABEL:sub@fig:aggr_rkhistmaps_nrinf, the QRF-post-processed version of the MF ensemble yields a higher number of flat rank histograms than the raw CMC ensemble. Finally, Figure 3 LABEL:sub@fig:aggr_rkhistmaps_aggr shows that, the JP tests do not reject the flatness hypothesis at many more grid-points for the most reliable EWA forecast. Table 5 confirms quantitatively that the most reliable EWA outperforms the most reliable expert in terms of flatness of the rank histogram, at each lead time.The most reliable setting of the other aggregation methods produce fewer flat rank histograms than the most reliable expert, for all lead times, as shown in Table 5. In other words, post-processing improves reliability over raw ensemble, and aggregation may further improve reliability over the most reliable expert if the right setting is chosen.Method | Parameters | Lead time (h) | |||||||||

all | 6 | 12 | 18 | 24 | 30 | 36 | 42 | 48 | |||

Selection: most skillful forecasting system. | |||||||||||

expert (QRF ECMWF) | 0.60 | 0.26 | 0.17 | 0.83 | 0.89 | 0.41 | 0.43 | 0.95 | 0.86 | ||

SHARP | 1095 | 0.39 | 0.28 | 0.22 | 0.39 | 0.36 | 0.53 | 0.45 | 0.45 | 0.41 | |

GRAD | -1 | t-1 | 0.05 | 0.06 | 0.06 | 0.03 | 0.04 | 0.06 | 0.05 | 0.03 | 0.04 |

EWA | -1 | 365 | 0.04 | 0.04 | 0.03 | 0.03 | 0.04 | 0.06 | 0.06 | 0.05 | 0.04 |

INV | 7 | 0.01 | 0.03 | 0.02 | 0.00 | 0.01 | 0.02 | 0.01 | 0.00 | 0.01 | |

MIN | 365 | 0.82 | 0.83 | 0.79 | 0.84 | 0.69 | 0.88 | 0.87 | 0.92 | 0.75 | |

Selection: most reliable forecasting system. | |||||||||||

expert (QRF MF) | 0.92 | 0.76 | 0.89 | 0.99 | 0.99 | 0.87 | 0.87 | 1.00 | 0.99 | ||

SHARP | 1095 | 0.39 | 0.28 | 0.22 | 0.39 | 0.36 | 0.53 | 0.45 | 0.45 | 0.41 | |

GRAD | 0.5 | t-1 | 0.67 | 0.46 | 0.43 | 0.91 | 0.79 | 0.46 | 0.51 | 0.98 | 0.79 |

EWA | 0.5 | 30 | 0.97 | 0.96 | 0.90 | 0.98 | 0.99 | 0.97 | 0.97 | 0.98 | 0.98 |

INV | 7 | 0.01 | 0.03 | 0.02 | 0.00 | 0.01 | 0.02 | 0.01 | 0.00 | 0.01 | |

MIN | 365 | 0.82 | 0.83 | 0.79 | 0.84 | 0.69 | 0.88 | 0.87 | 0.92 | 0.75 |

The two selection approaches choose a different “best” forecast. The GRAD aggregation method with and days is the most skillful forecast, whereas EWA with and days is the most reliable forecast. In terms of averaged CRPS over all lead times, the most skillfull forecast is about 6 % more performant than the most reliable forecast, as computed from Table 4 (0.47 m/s versus 0.50 m/s). Table 5 shows that the most skillful forecast passes the three flatness tests for only of grid points, whereas the most reliable forecast produces about of flat rank histograms. In other words, selecting the most reliable forecast as best forecast, instead of the most skillful, leads to a slightly worse CRPS but increases dramatically the number of grid points with a flat rank histogram. This discrepancy between the two criteria for choosing the best forecasts is discussed more deeply in Section 6. For decision making, it is important and easier to have a reliable forecast, since this allows us to take the forecast as the true distribution of the observation and make optimal decision in terms of economic returns. Therefore, in the following the best retained forecast is EWA with and days.

### 5.2 Temporal Variation of the Weights

The most reliable EWA aggregation is able to quickly redistribute the aggregation weights between the experts, as illustrated in Figure 4. For instance, during the middle of 2011, nearly all the aggregation weight shifts from the QRF-post-processed MF ensemble to the QRF-post-processed ECMWF ensemble, in a few days. The aggregation weights can also remain stable for long periods of time, such as around mid-2014, when the QRF-post-processed MF ensemble keeps a high weight for about two months. Moreover, although the raw ensembles do not perform well on their own, the EWA method may find periods where the raw ensembles can significantly contribute to the aggregated forecasts, such as in late 2012 for the raw CMC ensemble. Last, the time series of the aggregation weights is very different from one lead time to another (not shown). These features prove that this aggregation method is very adaptive. This may be very useful for operations when an ensemble undergoes important changes: the aggregation method will quickly detect a modification in performances and adjust its weighting accordingly.

### 5.3 Aggregation of individual sorted experts

Baudin (2015) and Thorey (2017) defines experts as the sorted values of the pooled raw ensembles, and aggregate step-wise CDFs with one step. To compare our approach to their’s, we aggregated the sorted values of the raw ensembles. This amounts to 128 experts. We also aggregated the sorted values of the raw and post-processed ensembles. This amounts to 2,552 experts (the 128 values forecasted by the raw ensembes plus 101 quantiles for each of the 24 post-processed forecasts). A comprehensive comparison with the approach in Baudin (2015) being out of the scope of this article, only the most reliable aggregation method (EWA, with and days) has been used.

The resulting CRPSs are shown in Table 6. The forecast skill is improved compared to the raw ensembles (compare with Table 4). Even if no post-processed ensemble is used, the CRPS is improved by the aggregation. But adding more experts (among which are post-processed ensembles) further increases the performance improvement. However, the CRPS stays higher than the CRPS of the EWA aggregation of multi-step CDFs (0.52 m/s versus 0.50 m/s with our most reliable EWA). Although the differences in CRPS is low, the difference between the two approaches proves much more important in terms of proportion of flat rank histograms. Whereas our most reliable EWA gets about 97% of flat rank histograms (see Table 5), the aggregation with EWA of one-step CDFs gets no flat rank histogram whatever lead time is considered. All the p-values of the Jolliffe-Primo tests are lower than (with a lag-1 autocorrelation of the rank around 0.12). Further investigations would be required to check if this holds for other settings of EWA and for other aggregation methods. A possible explanation of this lack of reliability would be the fact that, for 1-member ensembles, the CRPS reduces to the mean absolute error (MAE). Since the MAE is minimized by the conditional median of the observation given the forecast, each expert is weighted according to its ability to predict well the conditional median, which is not what is looked for.

Aggregated experts | Lead time (h) | ||||||||
---|---|---|---|---|---|---|---|---|---|

all | 6 | 12 | 18 | 24 | 30 | 36 | 42 | 48 | |

Sorted raw members () | 0.58 | 0.57 | 0.57 | 0.57 | 0.57 | 0.58 | 0.58 | 0.61 | 0.60 |

Sorted raw and post-processed members () | 0.52 | 0.48 | 0.49 | 0.50 | 0.51 | 0.52 | 0.53 | 0.56 | 0.56 |

## 6 Discussion about Probabilistic Forecast Selection

The discrepancy between the choice of the best forecast according to the CRPS and according to the flatness of rank histograms is now more fully investigated and discussed.

Although the CRPS is a natural measure of performance for forecast CDF, minimizing it does not ensure to get the highest number of grid-points with a flat histogram, as stated earlier and as confirmed in Figure 5 (left). In this figure, for each lead time, EWA and GRAD reach a minimum average CRPS for very low proportions of flat rank histograms. Actually, the point corresponding to this optimal CRPS is indicated by the graph of INV in Figure 5 (left). The graph for INV seems to reduce to a point because the associated CRPS and proportion of flat rank histograms barely vary. For post-processed ensembles, the same behavior can be observed (see Appendix D): the most skillful expert may not be the most reliable one.

This discrepancy can be explained with the decomposition of the average CRPS as a sum of a reliability term (that must be minimized) minus a resolution term (that must be maximized) and an uncertainty term (that does not depend on the forecasting system). The reliability and the resolution terms in the average CRPS after post-processing or aggregation have very different ranges of variation from one setting to another, as deduced from Figure 5 for the aggregated forecasts. For instance, at a fixed lead time, the reliability term of the MIN methods varies by less than 0.01 m/s while the CRPS varies by 0.05 m/s. Since the uncertainty term depends only on the observation and is fixed for a fixed lead-time, the CRPS varies mainly because of the variations in the resolution term. In other words, in this case, the average CRPS is mostly a measure of resolution and its minimization as a selection criterion leads to maximize the resolution. But the distributions may not be reliable since a small change in the reliability term will not change a lot the average CRPS while being compatible with unreliable flat rank histograms according to the JP-tests (see right side of Figure 5). This comment is also true for the other aggregation methods. It also holds for post-processed ensembles, whose reliability term may change by about 0.01 m/s whereas the average CRPS varies by 0.2 m/s with very different proportions of reliable grid points (see Appendix D).

Another part of the origin of the discrepancy between the two selection approaches is that the CRPS quantifies the forecast performance over the whole distribution, while the JP tests assess only forecast performance for the tested shapes of the rank histogram. Firstly, the rank histogram takes only into account the ranking of the observation and the members (or quantiles) while the relability term of the CRPS takes into account the distance between the observation and the members (or quantiles). Secondly, limiting the flatness criterion to the absence of slope, convexity and wave shapes may lead to miss other important deviations to flatness in the rank histogram. One could use more flatness tests to add constraints on the forecasting system selected on a reliability criterion. It would be interesting investigate how it changes the solution.

Finally, the sampling noise may add further differences between the solutions selected by the two approaches. The CRPS and the flatness tests may not react in the same way to this sampling. The JP-test being hypothesis testing it has the same limit as every tests. Rejection of the flatness hypothesis may be due to other features than a lack of reliability (Wasserstein et al., 2019).

In conclusion, the selection of the best forecast should not be made only by minimizing the CRPS, but also by taking care of the actual reliability of the forecasts, as tested with the JP tests. Both performance critera should be used. Being only focused on the CRPS may lead to choosing a forecasting system that is not optimal in terms of reliabilty as shown here and earlier studies (Collet and Richard, 2017; Wilks, 2018). But relying only on the JP flatness tests may also be misleading. Indeed, always forecasting the climatology leads to a very good reliability but a very low resolution. Consequently, a forecaster may be tempted to forecast the climatology in order to get a good forecast, instead of issuing a forecast he might consider more likely but too different from the climatology, and too risky to issue. Choosing a forecast based on the JP tests only may lead to such hedging strategies. In this study, hedging is avoided by using the CRPS to tune the experts and the aggregation. But the reliability of the chosen forecast is ensured by using the JP tests.

## 7 Conclusion and Perspectives

The first goal of the present study was to adapt the theory of prediction with expert advice to the case of experts issuing probabilistic forecasts as step-wise CDFs with any number of steps. Contrary to the work of Baudin (2015) who aggregated unidentifiable experts built by pooling and sorting members of several ensembles, each expert used in the present work is identifiable over time as required by the theoretical framework of prediction with expert advice: the aggregation weights for the members or quantiles of the same expert are constrained to be equal. Some formulae of Baudin (2015) valid for step-wise CDFs with one step have been generalized to the case of step-wise CDFs with any number of steps.

Several aggregation methods to combine step-wise forecast CDFs have been presented and compared in terms of reliability and CRPS. The reliability has been assessed by using the Jolliffe-Primo tests, which detect the presence in the rank histogram of typical deviations from flatness. The systematic use of the Jolliffe-Primo flatness test highlights that the minimization of the CRPS as the main criterion to calibrate or aggregate may not produce the maximum number of flat rank histograms. It is also shown that choosing the best forecast by maximizing the proportion of rank histograms ensures reliable forecasts, without significantly increasing the CRPS.

On a real wind speed data set, the best aggregation method, in terms of proportion of flat rank histograms, is the exponentially weighted average forecaster, with a learning rate and an aggregation window days. This aggregated forecast has a similar CRPS as the most skillful expert in terms of CRPS, and produces many more flat rank histograms than the most reliable expert. The method can produce weights with very different temporal patterns: rapidly evolving weighting of the experts, long period of constant weighting, short period with large weights for the raw ensembles. With the use of experts fitted over sliding windows of different size, this flexibility may help to solve a recurrent problem in post-processing: important changes in the NWP models that may make the post-processing equation inadequate for the new version of the NWP model. Although EWA is the selected aggregation method in this study, this must not be taken as a result valid for other observations and/or geographical domains. To the best of our knowledge, no theoretical results allow us to tell among the many available aggregation methods which one will be the best on a given dataset.

As for the perspectives, it is planned to study the same and other aggregation methods by pooling data in blocks of nearby grid-points. This may improve the fit by enlarging the training sample or, at the very least, speed up operations on finer grids with thousands of points, as was demonstrated for deterministic forecasts (Zamo et al., 2016). Since post-processing of other meteorological parameters, such as temperature and rainfall, has been already tested internally at Météo-France, aggregation methods will be tried on these variables too. A more comprehensive study of the discrepancy between the CRPS and the proportion of flat rank histograms as a performance criterion and its implication on post-processing and aggregation constitutes a more theoretical perspective. A further line of research would be to expand on the proposed approach to post-processing and aggregation. Indeed, the retained criterion in this study (maximizing the number of grid points with a flat rank histogram) does not allow us to choose a different forecast for each grid point, which may be desirable to further improve forecast performance.

## Appendix

## Appendix A Statistical Tests of Flatness of a Rank Histogram

Consider the vector of normalized deviation from flatness in each rank of some rank histogram at hand,

where is the number of possible ranks, is the count of rank and is the theoretical count in each rank for a flat histogram.

Under the null hypothesis

that the rank histogram is compatible with a flat histogram up to sampling noise, the squared norm follows a distribution with degrees of liberty. The flatness of the rank histogram can be tested with a chi-square test.The chi-square test statistic

is insensitive to the shape of the deviations to a flat histogram, as shown in Figure 6. To build this figure, as in Elmore (2005), 60 integer values from 1 to 16 have been drawn from a uniform distribution. Four histograms are shown: the histogram computed with the raw sample (top left), with the same counts sorted in ascending order (top right), with the counts reassigned to have a peak-shaped histogram (bottom left), and with the counts reassigned in a wave shape (bottom right). The p-value of the chi-square test and three other flatness tests presented below is reproduced under the histograms. Although the counts of each rank are reorganized, the p-value of the chi-square test of the four histograms is the same, because reordering the counts is equivalent to reordering the components of

, which does not change its norm. Because of this, in our study, the flatness of each rank histogram is assessed with the decomposition of the chi-square test statistic, as detailed in Jolliffe and Primo (2008). Under the null hypothesis , any projection of onto an orthonormal basis of has components whose squares are asymptotically independentrandom variables, each with 1 degree of freedom. If the basis vectors are chosen to describe a sloped histogram, a convex histogram, or any other shape of interest, the existence of the shape in the rank histogram can be tested. The existence of a shape is not rejected if the projection of

onto the corresponding basis vector has a component statistically different from 0.Jolliffe and Primo (2008) give formulae to compute the basis vectors for deviations from flatness commonly encountered on real data. As an example, if , the basis vector for the slope (resp. convexity) test is proportional to (resp. ). In our study, three tests are used, the slope and convexity tests, and the “wave” test not described in Jolliffe and Primo (2008). This last test assesses the presence of a deviation from flatness in the shape of a tilde, that was frequently observed in the literature (Scheuerer et al., 2015; Baran and Lerch, 2016; Taillardat et al., 2016) and in internal studies at Météo-France. The corresponding basis vector is built thanks to the Grahm-Schmidt process as follows: the vector is made orthogonal to the slope basis vector, and the resulting vector is normalized to get the basis vector for testing the presence of a wave shape. In Figure 6, the p-values for the test of existence of a slope, a convexity or a wave are in agreement with the shape of the histograms. For instance, the low p-value of the slope test (top right) rejects flatness against slope as expected.

## Appendix B Proof of the Bounds for the Regret of the Exponentially Weighted Average Forecaster

The proof closely follows the proof of theorem 2.2 in Cesa-Bianchi et al. (2006).

Let be a real-valued, bounded loss function. is supposed convex in its first argument.

The EWA weights at time are computed as

with the cumulative loss of expert at time , and with the convention so that .

Let us define and . At all times , and using the convention that a sum over 0 elements is 0 (for , such that ),

(23) |

The proof now needs Hoeffding’s inequality (Hoeffding, 1963). Let with . Let be a bounded random variable with values in , then, , Hoeffding’s inequality states that

Using Equation (23) and Hoeffding’s inequality for the random variable taking the values with discrete probability , taking and summing over leads to

after using the convexity of the loss function in its first argument, and the definition of the EWA forecast.

Noting that the following relationship also holds

and combining it with the previous relationship leads to

Finally, dividing by results in the following bound of the regret of the aggregated forecast relatively to the best expert

(24) |

Noting that this bound for the regret holds for any bounded loss function convex in its first argument, which is a propery of the CRPS, concludes the demonstration.

## Appendix C Formula for the Gradient of the CRPS

Baudin (2015) considers the aggregation of step-wise CDFs with one single step (). We generalize equations (5.10) and (5.13) of Baudin (2015) for, respectively, the CRPS and gradient thereof, to an aggregation of step-wise CDFs with any number of steps.

Dropping the time index in the notations, the aggregated CDF at time is

with the notation .

Therefore, the CRPS of the aggregated CDF at time is

where and .

By developing the square inside the integral,

Noting that , and , then

because , and .

Since ,

Comments

There are no comments yet.