# Extreme events evaluation using CRPS distributions

Verification of ensemble forecasts for extreme events remains a challenging question. The general public as well as the media naturely pay particular attention on extreme events and conclude about the global predictive performance of ensembles, which are often unskillful when they are needed. Ashing classical verification tools to focus on such events can lead to unexpected behaviors. To square up these effects, thresholded and weighted scoring rules have been developed. Most of them use derivations of the Continuous Ranked Probability Score (CRPS). However, some properties of the CRPS for extreme events generate undesirable effects on the quality of verification. Using theoretical arguments and simulation examples, we illustrate some pitfalls of conventional verification tools and propose a different direction to assess ensemble forecasts using extreme value theory, considering proper scores as random variables.

## Authors

• 3 publications
• 2 publications
• 9 publications
• 5 publications
• ### Evaluation of point forecasts for extreme events using consistent scoring functions

We present a method for comparing point forecasts in a region of interes...
02/01/2021 ∙ by Robert J. Taggart, et al. ∙ 0

• ### A simple study of the correlation effects in the superposition of waves of electric fields: the emergence of extreme events

In this paper, we study the effects of correlated random phases in the i...
11/03/2019 ∙ by Roberto da Silva, et al. ∙ 0

• ### More on verification of probability forecasts for football outcomes: score decompositions, reliability, and discrimination analyses

Forecast of football outcomes in terms of Home Win, Draw and Away Win re...
06/28/2021 ∙ by Jean-Louis Foulley, et al. ∙ 0

• ### Forest-based methods and ensemble model output statistics for rainfall ensemble forecasting

Rainfall ensemble forecasts have to be skillful for both low precipitati...
11/29/2017 ∙ by Maxime Taillardat, et al. ∙ 0

• ### Extreme event probability estimation using PDE-constrained optimization and large deviation theory, with application to tsunamis

We propose and compare methods for the estimation of extreme event proba...
07/28/2020 ∙ by Shanyin Tong, et al. ∙ 0

• ### On relations between extreme value statistics, extreme random matrices and Peak-Over-Threshold method

Using the thinning method, we explain the link between classical Fisher-...
11/09/2017 ∙ by Jacek Grela, et al. ∙ 0

• ### Integration of max-stable processes and Bayesian model averaging to predict extreme climatic events in multi-model ensembles

Projections of changes in extreme climate are sometimes predicted by usi...
07/19/2020 ∙ by Yonggwan Shin, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In a pioneering paper on forecast verification, Murphy (1993) distinguished three types of forecast “goodness”: the quality that quantifies the adequacy with what actually happened, the consistency based on the fidelity to an expert knowledge, and the value that describes how the forecast helps the decision maker to proceed efficiently. The quality of a forecast is often summarized by one scalar. For example, to identify the best forecast, one classically takes the mean on a validation period of proper scoring rules (see, e.g., Matheson and Winkler, 1976; Gneiting and Raftery, 2007; Schervish et al., 2009; Tsyplakov, 2013). Proper scoring rules can be decomposed in terms of reliability, uncertainty and resolution. Several examples of such decompositions can be found in Hersbach (2000) and Candille and Talagrand (2005). Bröcker (2015) showed that resolution is strongly linked with discrimination. Resolution and reliability can also be merged into the term calibration, and Gneiting et al. (2007) suggested to maximize the sharpness subject to calibration. Note that the sharpness is the spread of the forecast, and it is a property of the forecast only. In ensemble forecasts’ verification, the most popular scoring rule is the Continuous Ranked Probability Score (CRPS) (see, e.g., Epstein, 1969; Hersbach, 2000; Bröcker, 2012) and it can be defined as

 CRPS(F,y) = ∫∞−∞(F(x)−1{x≥y})2dx, (1) = EF|X−y|−12EF|X−X′|, = y+2¯¯¯¯F(y)EF(X−y|X>y)−2EF(XF(X)).

where , and and

are two independent random copies coming from a given continuous cumulative distribution function (cdf)

. Hence, the CRPS is a proper111A proper score like the CRPS satisfies: score that makes the link between the observed value and the forecast distribution . The second line in Equality (1) highlights the two terms of calibration and sharpness.

Regarding extremes verification, it is important to counteract some cognitive biases bounding to discredit skillful forecasters (examples of cognitive biases can be found in Kahneman and Tversky (1979); Morel (2014)). That is what is called in Lerch et al. (2017) the “Forecaster’s dilemma”. Citing these authors, “In the public, forecast evaluation often only takes place once an extreme event has been observed, in particular, if forecasters have failed to predict an event with high economic or societal impact”. Indeed, the only remedy is to consider all available cases when evaluating predictive performance. Proper weighted scoring rules (Gneiting and Ranjan, 2011; Diks et al., 2011) attempt to emphasize predefined regions of interest. Following the notation of Equation (1), the weighted CRPS can be defined as

 wCRPS(F,y) = ∫∞−∞(F(x)−1{x≥y})2w(x)dx, = EF|W(X)−W(y)|−12EF|W(X)−W(X′)|,

where is a non negative function and . Alternatively, the weighted CRPS can also be expressed as the following:

 wCRPS(F,y) = W(y)+2¯¯¯¯F(y)EF(W(X)−W(y)|X>y) (3) −2EF(W(X)F(X))

as soon as the weight function is continuous (see proof in Appendix 6.1). Who among different users (e.g., forecast users and forecasters) should choose this weight function remains a complex issue (see, e.g. Ehm et al., 2016; Gneiting and Ranjan, 2011; Patton, 2014). Even in this case where

can be objectively chosen with respect to an application at hand, one can wonder if the corresponding weighted CRPS captures well the extreme behavior of the observational records, i.e discriminating between two competitive forecasts with respect to extreme events. This leads to the question of how to model accurately the distributional features of the forecast and observational vectors.

In this work, we move away from looking at averages like the properness of a score. Instead we propose a different framework that considers the observational vector, not as a realization, but as a random variable with a specific extreme value behavior. This change of view with regards to scoring rules brings us to study the distribution of the CRPS itself. It naturally suggests to bridge the random variable CRPS with the field of Extreme Value Theory (EVT).

The foundations of this theory are laid in De Haan (1970), see e.g. the books of Embrechts et al. (1997); Beirlant et al. (2004); De Haan and Ferreira (2007). EVT provides probabilistic models to represent the distributional behavior of large values, i.e. excesses above a large threshold. Roughly speaking, it is based on the survival function of the so-called Generalized Pareto (GP) distribution with shape parameter defined by

 ¯¯¯¯¯Hγ(x/σ)=(1+γxσ)−1γ,

for each such that and represents a scale parameter. For , this can be viewed as the classical exponential tail. The fundamental property of this GP family is its stability with respect to thresholding (see, e.g., Embrechts et al., 1997, Theorem 3.4.13 (c)). As pinpointed by Friederichs and Thorarinsdottir (2012), it is possible to express explicitly the whenever is a GP distribution and the real is fixed. Starting from this link between CRPS and extreme values, it would be of interest to know what would be the behavior of when the observational vector becomes a random variable that takes very large values.

This work is organized as follows. In Section 2 we point out some undesirable properties of the CRPS and its weighted counterpart. We derive the non-tail equivalence of the CRPS and highlight potential difficulties of using the CRPS for extreme weather evaluation. Section 3 sets the theoretical framework in which the temporal aspects in scoring rules are stressed. Section 4 links the observational tail behavior with the CRPS tail behavior. An index which captures some quality w.r.t.  extremes forecast is introduced. The work closes with a discussion in Section 5.

## 2 Tail equivalence, weighted CRPS and choice of a weight function

### 2.1 Tail equivalence and (weighted) CRPS

Focusing on the upper tail behavior study, it is convenient to recall the definition of tail equivalence between two cdf and (see, e.g., Embrechts et al., 1997, Section 3.3). They are said to be tail equivalent if they share the same upper endpoint and if their survival functions and satisfy Another useful EVT concept is the notion of domain of attraction: a distribution is said to belong to the domain of attraction of the GP distribution , denoted , if for some , some positive auxiliary function , the rescaled survival function converges as tends to to a GP with shape parameter , i.e.

 ¯¯¯¯F(u+zb(u))¯¯¯¯F(u)⟶¯¯¯¯¯Hγ(z), (4)

for each such that . Even if the weighted is proper, there is no guarantee of tail equivalence222

It may explain the weakness in estimating the shape parameter of a GP distribution via the CRPS based inference, see

Friederichs and Thorarinsdottir (2012).. More precisely, for any positive , it is always possible to construct a cdf that is not tail equivalent to such that

 |EG(wCRPS(G,Y))−EG(wCRPS(F,Y))|≤ϵ. (5)

This result is proven in Appendix 6.2. The implication of (5) is that a forecast  could have a misspecified tail (one which is not tail equivalent to the “true” forecast ) and still score almost as well as the true forecast. To illustrate the inability to distinguish among different tail behaviors, the upcoming section investigates the case of a particular choice of weight function. This choice will also be considered later in Section 3 in a forecasts comparison perspective.

### 2.2 The quantile weight function

Since our interest concerns jointly the CRPS and the upper tail behavior, the quantile weight function appears as a natural candidate to highlight upper tail features. It is simply defined as

for any real . The following lemma, see Appendix 6.3. for its proof, links the CRPS and its weighted version.

###### Lemma 1

The weighted quantile CRPS defined by

 wCRPS(F,y;q)=∫∞−∞(F(x)−1{x≥y})21{x≥q},dx,

where represents any real number can be written as

 wCRPS(F,y;q) = ∫∞q¯¯¯¯F2(x)dx+{0,if q>y,CRPS(F,y)−CRPS(F,q),if y≥q.

The following conditional equality in distribution holds for any random variable :

 [CRPS(F,Y)|Y≥q]d=[wCRPS(F,Y;q)−cF(q)|Y≥q], (6)

where the constant

 cF(q)=CRPS(F,q)−∫∞q¯¯¯¯F2(x)dx

only depends on the forecast distribution and the constant .

In terms of extremes, Equation (6) tells that the distributional behavior of this weighted CRPS above the threshold is identical to the classical CRPS conditional on , up to the constant .

### 2.3 The CRPS for the Pareto case

When both the forecast (with cdf ) and observational vectors (with cdf ) are Pareto distributed, explicit computations of the CRPS can be made, see the following lemma and its proof in Appendix 6.4.

###### Lemma 2

Assume that and with and , with respective survival functions and for any . If , with , then

 EG[CRPS(F,Y)]=σ1−γ+2β[12(2−ξ)−γγ+ξ−γξ].

This gives the minimum CRPS value for and ,

 EG[CRPS(G,Y)]=σ(2−γ)(1−γ).

In particular, this lemma tells us that if the GP forecast parameter is proportional to the GP ideal parameter, i.e.  and for some , then we can study the effect of changing the forecast tail behavior captured by and the spread forecast encapsulated in . In this case, the CRPS can be written as a function of , say

 EG[CRPS(F,Y)]=ϕγ(a)=σ1−γ+2aσ[12(2−aγ)−11+a−aγ]. (7)

Figure 1 shows how this CRPS varies in function of (-axis) when we fix and (left panel) or (right panel).

The important point is that no meaningful conclusions about the upper tail previsions from two competing forecasters can be made within the blueish “cup” region. Inside this ambiguous region, the same value of , -axis, can be given by two different values of . One is greater than one, corresponding to a forecaster prone to a risky strategy, while another is smaller than one, made by a forecaster risk averse. The spread of this region over is not small, and spans from to around . The left panel in Figure 1 implies that two hypothetical forecasters, one multiplying by the true extreme quantiles and the other dividing them by two, cannot be differentiated, both have around . In other words, the difference between two inaccurate forecasters with opposite risk views have to be very pronounced to be correctly ranked by the CRPS. To numerically assess this drawback, we can compute the area of this ambiguous region. The area delimited by the blueish cup can be written as

 A(γ) = ϕ(0)×a0−∫a00ϕγ(a)da, % with a0=31+γ and ϕ(a0)=ϕ(0), = 2σ[∫a00a1+a−aγda−∫a00a2(2−aγ)da], = 2σ[11−γ∫a00(1−11+(1−γ)a)da+12γ∫a00(1−2(2−aγ))da], = 2σ[a0(1+γ)2γ(1−γ)−1(1−γ)2log(1+(1−γ)a0)+1γ2log(1−a0γ/2)].

By plotting this function (graph not shown here but available upon request), one can notice that this integral varies from between and for and . So, despite that the blueish “cup” region appears smaller in the right panel than in the left one, this impression is just due to the -axis scale. The area does not vary much and the problem aforementioned still remains over a wide range of .

## 3 Framework, examples and forecast comparison

### 3.1 Theoretical framework and examples

Our theoretical framework is the now classical prediction space already introduced by Murphy and Winkler (1987); Gneiting et al. (2013); Ehm et al. (2016). A probabilistic forecast for a real-valued outcome is identified with its cdf . The elements of the associated sample space can be identified with tuples of the form

 (F1,…,Ft,Y)

where the predictive cdfs use information sets , respectively, where is a general set with nice theoretical properties333More precisely, is a sigma-field on .. See e.g. Ehm et al. (2016, section 3.1) for details. As an illustration, if is a weather variable of interest, with (unconditional) distribution function , can for example depict the underlying atmospheric state at time , see e.g. tables 1 and 2 for two examples of conditioning. Following Gneiting et al. (2007), the outcome can be viewed as a true data-generating process coming from for each time step . The associated sequence of probabilistic forecasts is for each , and a forecast is ideal relative to if .

The asymptotic compatibility between the data-generating process and the predictive distributions can be divided in the three following modes of calibration, where an arrow denotes almost sure convergence as tends to infinity:

 (P):1TT∑t=1Gt∘F−1t(p)→pfor allp∈(0,1).
 (E):1TT∑t=1G−1t∘Ft(x)→xfor allx∈R.

The third type of calibration can be summed up by the existence of the limit distributions and such as . The existence of G is a natural assumption in meteorology and corresponds to the existence of a stable climate. Hence, can be interpreted in terms of the equality of observed and forecast climatology.444In Gneiting et al. (2007), the sequence is then respectively said probabilistically calibrated (in short, calibrated), exceedance calibrated, and marginally calibrated. Besides, the sequence is said auto-calibrated when it is jointly probabilistically and marginally calibrated.

For seek of illustration, let us now consider two different designs of experiments. The first one is the design introduced by Gneiting et al. (2007), also considered by Straehl and Ziegel (2017), and described here in Table 1. At times

, the cdf of the truth (observation) is modeled by a normal distribution with mean

and unit variance, where

is a random draw from the standard normal distribution. This model is denoted in the following by Model NN. The ideal forecaster provides a forecast of . The climatological forecaster corresponds to the unconditional distribution, i.e. a centered normal distribution with variance equal to 2. The unfocused forecaster adds a Rademacher-type bias in his/her forecast. Finally, the extremist

forecaster adds an additive bias in his forecast. The Bayesian reader can notice that this design of experiments relies on the conjugate prior of the normal distribution. For extreme events concerns, this example can be limiting, since it involves light tails and exhibits a slow convergence of the large values. As a consequence, we introduce a second design of experiment in Table

2, based on Gamma-exponential mixtures and denoted by Model GE. More precisely, suppose that

follows an exponential distribution with mean

, and follows a Gamma pdf with , and . As the Gamma Laplace transform can be written as whenever , the unconditional has a GP tail with parameter . Hence, the climatological forecaster corresponds to the unconditional distribution, i.e. a Pareto distribution, while the ideal forecaster should follow an exponential distribution with mean . The unfocused forecaster adds a uniform-type bias in his forecast, whereas the extremist forecaster adds a multiplicative bias in his forecast.

### 3.2 Comparing forecasts with expected weighted CRPS

This subsection compares with the quantile weight function different forecasts based on the two designs presented in tables 1 and 2, (see also, Diebold et al., 1997; Dawid, 1984). In Figure 2, the expected weighted CRPS (at log scale) for each model (model GE: top, model NN: bottom) is plotted for each forecast according to (left) and the corresponding ranking between forecasts (right). Properness of the CRPS ensures that the ideal forecaster has the lowest score expectation. The dotted horizontal lines correspond to the unweighted CRPS expectation. Concerning the latter, the for the ideal forecaster can be expressed explicitly (see, e.g., Friederichs and Thorarinsdottir, 2012, and Equation (20) in Appendix 6.4) as

 CRPS([G|Δt],y)=y+2Δtexp(−Δty)−32Δt, where [G|Δt]d=Exp(Δt), (8)

and for the climatological forecaster, as

 CRPS(G,y)=y−83[1−(¯¯¯¯¯H0.25(y))0.75]+47. (9)

From Figure 2, two important features can be identified: (1) the loss of information for large values, well-known by Brier score users and clearly pointed out by Stephenson et al. (2008), and (2) the fact that forecast rankings can switch order. A change in the forecast ranking is observed for the model GE in Figure 2. This can, artificially, favor a forecaster over another one. For example, the climatological forecaster could be tempted to only show his/her results for a weight function with in order to outperform the extremist forecaster.

One can argue that the weight function should not be chosen by forecast makers but rather by forecast users (see e.g. Ehm et al., 2016). This opinion was also expressed by Gneiting and Ranjan (2011) who wrote that “the scoring function [must] be specified ex ante” and relayed by Patton (2014)

“forecast consumers or survey designers should specify the single specific loss function that will be used to evaluate forecasts”. This is a well-known fact for decision makers in economics. In contrast, users of weather forecasts may be less aware of potential negative impacts of poor forecasts. A recent survey

(Hagedorn, 2017)

concludes that 81% of probabilistic information users relies on heuristic/experience rather than on cost/loss models in order to take decisions. Last but not least, given the proximity between the

CRPS values it seems highly valuable to compute tests of equal predictive performance. For example, Table 3 presents the two-sided pairwise Diebold-Mariano statistic values on the CRPS among forecasters of the model GE for 2 different quantiles (). These quantiles are before and after the ranking switch between two forecasters.

Table 3 clearly shows that the ranking of Figure 2 can be challenged. From this section and Section  2.3, one can conclude that weighting scoring rules have to be handled with care, especially for forecast makers (Gilleland et al., 2018; Lerch et al., 2017).

## 4 A CRPS-based tool using extreme value theory

### 4.1 CRPS behavior and calibration-information diagram

Ferro (2007)

proposed to link EVT with forecast verification of extreme events by giving a theoretical framework to characterize the joint distribution of forecasts and rare events. This concerned deterministic forecasts and other tools are needed, see the quote by

Stephenson et al. (2008) “development of verification methods for probability forecasts of extreme events is an important area that clearly requires attention”. A difficulty is to summarize forecast knowledge in an informative way in order to make meaningful comparisons for extreme observations.

In sections 2 and 3, we saw that a single number as the mean of the CRPS (or the mean of the weighted CRPS) has many drawbacks to compare forecasts with extreme observations. Instead of focusing on one number, we propose in this section to study the distribution generated by these CRPS as random variables. To fix ideas, consider the model GE. From equations (8) and (9), it is possible to view these ideal and climatological scores as random variables whenever is itself considered as a random variable. The properness of the CRPS ensures here that for each

 E[G|Δt][CRPS([G|Δt],Yt)]≤E[G|Δt][CRPS(G,Yt)].

At this stage, it is important to recall what can be considered as observed in applications. In practice, is not given to the forecaster at time , we just have access to one realization of , and we can compute for each . Under the assumption of the existence of , we obtain that the sample is a mixture on the information sets, and so is a sample of . The underlying distributions of from where the are drawn are unavailable in practice.

For any forecast described in tables 1 and 2, two types of samples of size can be defined and then simulated :

 S1={CRPS(Ft,yt)}t=1,…,T   and    S2={CRPS(Ft,yπ(t))}t=1,…,T, (10)

where the later is obtained by shuffling the observational vector into with the random shuffling operator . This shuffling breaks the conditional dependence between and produced by the hidden variable . Let then consider two random variables, respectively denoted by and , with respective cdf equal to the empirical cdf associated to and respectively.

By sorting separately the values of each sample and , a qq-plot type can be obtained to compare the empirical distributions of these two samples. The left panels of Figure 3 (top panel for the model GE and low panel for the model NN) indicate that the climatological forecast (blue color) provides the same distributional features for the two samples and . The same type of plots, see right panels, can be made at the uniform scale, see the pp-plots. The important message from Figure 3 is that the ideal, unfocused and extremes forecasts (black, red and green, respectively) move away from the diagonal. This behavior is linked to the notion of auto-calibration555 The auto-calibration condition ensures a good interpretation of the discrepancy between distributions as a mean to evaluate the skill of the forecast. This is in accordance with the recommendations on the extremal dependence indices (EDI) of Ferro and Stephenson (2011), quoting that the forecasts should be calibrated in order to get meaningful EDIs. In the same vein, we corroborate the Gneiting et al. (2007) paradigm of maximizing the sharpness subject to calibration by maximizing the information subject to auto-calibration.. Besides the climatological forecaster, the samples and capture relevant information about the discrepancy generated by the hidden conditioning, i.e. the unobservable random variable .

One way to understand the difference in information between and is to study the integrated difference between the associated cdfs. Indeed, assuming auto-calibration, evaluating the general amount of information brought by a forecast boils down to measure its expected sharpness. This is also equivalent to evaluate its expected score, see Tsyplakov (2013, Section 2.3), where the “expected score” has to be understood with our notation as

The lemma presented below shows that the amount of information brought by an auto-calibrated forecast is also related to the discrepancy , defined as the -distance between the cdfs of and :

 d(F):=∫∞0{FCRPS∇(F,Y)(t)−FCRPS∇(G,Y)(t)}dt.

Note that this quantity is non negative. This follows from the fact that the climatological forecast has the lowest sharpness, and .

###### Lemma 3

Assume that are auto-calibrated forecasts, and that the observation has cdf ; then the following statements hold:

• in distribution;

• .

The proof of this lemma is relegated to Appendix 6.5.

To summarize, see also Table 4, the available distributional objects are: the empirical distribution function associated to the unconditional distribution of the observations; the forecasts’ distributions; and the empirical distribution made by the . This leaves the practitioners with the three possible empirical distributions, one obtained from , the one from , and . The remaining question is to determine if the empirical distributions associated with these samples can bring relevant information concerning the extremal behavior.

Before addressing this question, we close this section by Figure 4 that summarizes a calibration-information diagram about the trade-off between calibration and information. Subject to auto-calibration, sharpness and information represent the same attributes of a forecast. This diagram can be seen as a natural extension of the idea of Bentzien and Friederichs (2014).

### 4.2 CRPS behavior for extreme events

#### 4.2.1 Pareto approximation of the CRPS

So far, our examples were based on very specific parametric forms, see tables 1 and 2. In this section, we will see how the study of the upper tail behavior of the CRPS can be moved from these specific examples towards less stringent conditions based on EVT (see Appendix 6.6 for the proofs).

Let and be two random variables with absolutely continuous cdfs and , respectively. If these two cdfs have identical upper bounds, , belongs to and is finite, then conditionally to , one has for such that , as ,

 P(CRPS(Ft,Yt)+cFt−ub(u)>s|Yt>u,Δt)⟶(1+γts)−1/γt. (11)

Equation (11) tells us that, given and for large observational values, the CRPS upper tail behavior is equivalent to the one provided by the observation. This generalises the ideas seen throughout the specific case of (8). It is also interesting to point out that has a different tail behavior. More precisely, if belongs to with , then, for such that , as :

 P(CRPS∇(G,Y)−ub(u)>s|Y>u)⟶(1+γs)−1/γ. (12)

The constant term in (11) vanishes in (12) due to the linear behavior of the auxiliary function for (see, e.g. Von Mises, 1936; Embrechts et al., 1997). The GE model in Table 2 illustrates this. Working with a dependent couple or with an independent has an enormous impact on the tail behavior of the respective CRPS. For the GE model, the limit in (11) exhibits an exponential tail behavior, whereas the Pareto limit in (12) shows a heavy-tailed behavior. The difference of tail behaviors could be beneficial for partitioners.

#### 4.2.2 Practical use of the CRPS Pareto tail behavior

To take advantage of the limiting behavior difference between (11) and (12), one needs to have access to samples from these two distributions. At this stage, we can use the convergence provided by (12) to write the following approximations (as gets large)

 CRPS∇(G,Y)|Y>u∼Y|Y>u∼Hγ,σu, (13)

where represents the Generalized Pareto distribution with scale parameter and shape parameter . The two approximations in (13) lead us to propose comparing, for extreme observations above with large, the empirical cdf generated by the with the theoretical Pareto associated to . To make this comparison, we rely on the Cramér-von Mises criterion (Cramér, 1928; Von Mises, 1928) :

 ω2u=∫+∞−∞[Ku,T(v)−Hγ,σu(v)]2dHγ,σu(v).

where corresponds to the empirical distribution based on the sample for above . To compute this criterion, we order the values (in increasing order) and called them . The integer represents the number of observations above . Then, the Cramér-von Mises statistic666A description of the algorithm calculating is provided in Table 5 becomes

 Tu=m×^ω2u=112m+m∑i=1[2i−12m−Hγ,σu(vi)]2.

Given an auto-calibrated forecast, a large value of indicates an added value of this forecast for extremes.

#### 4.2.3 An index for extremes skill subject to calibration

For large and large

, under the null hypothesis the statistic

should approximatively follow a Cramér-von Mises distribution. Hence, under this distributional assumption, it is possible to compute the quantile (p-value) associated with . This number in is denoted

here. Note that the auto-calibration condition is fundamental to avoid large type II error, as

Ferro and Stephenson (2011) already recommended for EDIs. In order to get an asymmetric index, we propose to compute the following ratio

 1−pFupclimu. (14)

This index ranges from zero to one777It can happen that the computation makes this quantity non-positive (ie. ), this pathological outcome is the result of an uncalibrated forecast and therefore a meaningless quantity. , and equals to zero for the climatological forecast. The higher the better.

To see how this index behaves in practice, we revisit the experimental design described in Table 2. For the GE model, CRPS values were computed for each type of forecast. In Figure 5, the index defined by (14) is plotted against increasing quantiles, ranging from to . The extremist forecast (green curve) appears superior to the others, but being no calibrated, it has to be discarded. Subject to auto-calibration, the perfect forecast (black curve) with an exponential distribution, conditionally on the unobserved , is the most rewarded. The climatological forecast, although heavy tailed, has the lowest index.

As expected, we can see for the GE model in Figure 5 that excluding the extremist forecaster which is not auto-calibrated, the index rewards the perfect forecast.

Concerning the threshold choice, this is not an issue for the GE example because both the exponential and Pareto distributions are threshold invariant. This setup, by removing the threshold choice issue, is ideal to understand our approach. But, it is not realistic in practice. A threshold choice has to made and, many approaches could be used to handle this non-trivial decision (see, e.g., Naveau et al., 2016; Beirlant et al., 2004). Such decision clearly depends on the application at hand. For example, the recent study by Taillardat et al. (2019) proposed and compared different post processing approaches to improve precipitation forecasts in France. As extreme rainfall can be heavy tailed distributed, we expect that the forecasting approaches coupled with EVT, referred as QRF EGP tail, EMOS GEV and EMOS EGP in Figure 6, should perform better than other non-EVT based approaches like Anologs, QF and QRF, see Taillardat et al. (2019) for a detailed description of each post-processing techniques. The behavior of our index (y-axis in Figure 6) defined by (14), independently of the threshold choice, clearly indicates the superiority of representing extremes for the forecasting methods that included a specific EVT treatment of heavy rainfall. This confirms the conclusions of Taillardat et al. (2019), especially in regards to their Figure 3.

## 5 Discussion

According to our analysis of the CRPS, the mean of the CRPS and its weighted derivations seem to be unable to discriminate forecasts with different tails behaviors, whatever the weighting scheme.

Coming back to the three types of “goodness” introduced by Murphy (1993), the forecast value seems to be the most important for extreme events. For example, severe weather warnings are still made by forecasters, and despite of a possible inaccurate prediction quantitatively speaking, the forecaster has to take the decision according to a threshold of interest. This approach is completely linked with the economic value of the forecast, or equivalently with the ability of standing out from the climatology. For deterministic forecasts, such tools are well-known, see e.g. Richardson (2000); Zhu et al. (2002). Other widely used scores based on the dependence between forecasts and observed events have been considered in Stephenson et al. (2008); Ferro and Stephenson (2011). Recently, Ehm et al. (2016) have introduced the so-called “Murphy diagrams” for deterministic forecasts. This original approach allows to appreciate dominance among different forecasts and anticipate their skill area. In the same vein, we show that, subject to auto-calibration, relevant information about extremes can be represented by the discrepancy between unconditional and conditional score’s distribution. An open question remains: subject to auto-calibration, do the score’s distribution can locally beat the perfect forecast or be beaten by the climatological forecast ? We illustrate this conjecture in Figure 7.

Inspired by Friederichs (2010), our work consists of applying extreme value theory on common verification measures itselves. We therefore consider the score as a random variable. Relying on some properties of the CRPS for large observed events we put a theoretical framework concerning the score’s behavior for extremes. As a result, we obtain a bounded index in to assess the nexus between forecasts and observations. One can view this contribution as an additional step in bridging the gap in the field of ensemble verification and extreme events, (see, e.g. Ferro, 2007; Friederichs and Thorarinsdottir, 2012; Ferro and Stephenson, 2011). The ensemble forecast information is kept by the use of the CRPS. The index introduced in Section 4.2.3 can be considered as the probabilistic alternative to the deterministic scores introduced by Ferro (2007) and Ferro and Stephenson (2011). We would say that the paradigm of maximizing the sharpness subject to calibration can be associated with the paradigm of maximizing the information for extreme events subject to auto-calibration. It would be convenient to study the specific properties of this CRPS-based tool and its potential paths and pitfalls. Another potentially interesting investigation could be to extend this procedure to other scores like the mean absolute difference, the ignorance888Indeed, the Neyman-Pearson lemma described in Lerch et al. (2017) let us think that this score could be a natural candidate. score (Diks et al., 2011) or the Dawid-Sebastiani score (Dawid and Sebastiani, 1999).

## 6 Appendix

### 6.1 Proof of the inequality (3)

Assume that the weight function is continuous. By integrating by parts and and using , the weighted CRPS defined by (1) can be rewritten as

 wCRPS(F,y)=EF|W(X)−W(y)|−12EF|W(X)−W(X′)|.

The equality gives

 EF|W(X)−W(y)| = 2EFmax(W(X),W(y))−EFW(X)−W(y), = W(y)−EFW(X)+2EF(W(X)−W(y)I[W(X)>W(y)]),

and

 EF|W(X)−W(X′)| = 2EFmax(W(X),W(X′))−2EFW(X), = 4E(W(X)FW(X)(W(X)))−2EFW(X), = 4E(W(X)F(X))−2EFW(X),

where the last line follows from the fact that and have the same distribution, which is uniform on ). As is non-decreasing, one has , and it follows that

 wCRPS(F,y) = W(y)−EFW(X)+2EF(W(X)−W(y)I[W(X)>W(y)]) −2EF(W(X)F(X))+EFW(X), = W(y)+2¯¯¯¯F(y)EF(W(X)−W(y)|X>y)−2EF(W(X)F(X)),

as announced in (3).

### 6.2 Proof of the inequality (5)

Let be a positive real. Denote a non-negative random variable with finite mean and cdf . We introduce the new random variable

 Y=X1{u≥X}+(Z+u)1{X>u} (15)

with survival function defined by

 ¯¯¯¯G(x) = {¯¯¯¯F(x),if x≤u¯¯¯¯¯H(x−u)¯¯¯¯F(u),otherwise , (16)

where has the same end point than , and

 ¯¯¯¯¯H(x−u)≤¯¯¯¯F(x)/¯¯¯¯F(u), for any x≥u. (17)

This latter condition implies that

 ¯¯¯¯G(x)≤¯¯¯¯F(x), for all x. (18)

Because of is strictly increasing, the equation (16) allows to write almost surely that

 E(W(X)1{X

Equality (19) combined with the expression of the CRPS implies that

 12[wCRPS(G,x)−wCRPS(F,x)] = EY[(W(Y)−W(x))1{Y>x}]−EX[(W(X)−W(x))1{X>x}] +EX(W(X)F(X))−EY(W(Y)G(Y)), = EY(W(Y)¯¯¯¯G(Y))−EX(W(X)¯¯¯¯F(X)) −EY[(W(Y)−W(x))1{Y≤x}]+EX[(W(X)−W(x))1{X≤x}] = EY(W(Y)¯¯¯¯G(Y))−EX(W(X)¯¯¯¯F(X))+∫xFuΔ(x)dF(x),

where

 Δ(x)=EX[(W(X)−W(x))1{X≤x}]−EY[(W(Y)−W(x))1{Y≤x}].

The stochastic ordering between and implies that This leads to

 12|EX[wCRPS(G,X)]−EX[wCRPS(F,X)]|≤∫xFuΔ(x)dF(x).

For we can write that

 Δ(x) = EX[(W(X)−W(x))1{u