Mathematical Properties of Continuous Ranked Probability Score Forecasting

05/09/2022
by   Romain Pic, et al.
Université de Franche-Comté
0

The theoretical advances on the properties of scoring rules over the past decades have broaden the use of scoring rules in probabilistic forecasting. In meteorological forecasting, statistical postprocessing techniques are essential to improve the forecasts made by deterministic physical models. Numerous state-of-the-art statistical postprocessing techniques are based on distributional regression evaluated with the Continuous Ranked Probability Score (CRPS). However, theoretical properties of such minimization of the CRPS have mostly considered the unconditional framework (i.e. without covariables) and infinite sample sizes. We circumvent these limitations and study the rate of convergence in terms of CRPS of distributional regression methods We find the optimal minimax rate of convergence for a given class of distributions. Moreover, we show that the k-nearest neighbor method and the kernel method for the distributional regression reach the optimal rate of convergence in dimension d≥2 and in any dimension, respectively.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

12/11/2019

Scale invariant proper scoring rules Scale dependence: Why the average CRPS often is inappropriate for ranking probabilistic forecasts

Averages of proper scoring rules are often used to rank probabilistic fo...
02/26/2019

Online Learning with Continuous Ranked Probability Score

Probabilistic forecasts in the form of probability distributions over fu...
08/23/2019

Evaluating probabilistic forecasts of football matches: The case against the Ranked Probability Score

A scoring rule is a function of a probabilistic forecast and a correspon...
10/22/2019

Minimax Rate Optimal Adaptive Nearest Neighbor Classification and Regression

k Nearest Neighbor (kNN) method is a simple and popular statistical meth...
02/01/2021

CRPS Learning

Combination and aggregation techniques can improve forecast accuracy sub...
07/14/2021

M5 Competition Uncertainty: Overdispersion, distributional forecasting, GAMLSS and beyond

The M5 competition uncertainty track aims for probabilistic forecasting ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In meteorology, ensemble forecasts are based on a given number of deterministic models whose parameters vary slightly in order to take into account observation errors and incomplete physical representation of the atmosphere. This leads to an ensemble of different forecasts that overall also assess the uncertainty of the forecast. Ensemble forecasts suffer from bias and underdispersion (hamill_1997; baran_2018) and need to be statistically postprocessed in order to be improved. Different postprocessing methods have been proposed, such as Ensemble Model Output Statistics (gneiting_raftery_2005)

, Quantile Regression Forests

(taillardat_2019)

or Neural Networks

(schulz_2021). These references, among other, also discuss the stakes of weather forecast statistical postproccessing.

Postprocessing methods rely on probabilistic forecast and distributional regression (gneitingkatz)

where the aim is to predict the conditional distribution of the quantity of interest (e.g. temperatures, wind-speed, or precipitation) given a set of covariates (e.g. ensemble model output statistics). Algorithms are often based on the minimization of a proper scoring rule that compares actual observations with the predictive distribution. Scoring rules can be seen as an equivalent of a loss function in classical regression. A detailed review of scoring rules is given by

gneitingraft. The Continuous Ranked Probability Score (CRPS; matheson_1976, matheson_1976), defined in Equation (4), is one of the most popular score in meteorological forecasts. The CRPS is also minimized to infer parameters of statistical models used in postprocessing (e.g. gneiting_raftery_2005, gneiting_raftery_2005; naveau_2016, naveau_2016; rasp_2018, rasp_2018; taillardat_2019, taillardat_2019).

To the best of our knowledge, most convergence statements about the CRPS are not only derived within an unconditional framework, i.e. without taking into account the covariates, but also these limiting results are based on infinite sample sizes. In this work, our goal is to bypass these two limitations. To go further, we need to set the stage by including a few notations.

In this article, we consider the regression framework with distribution

. In forecast assessment, we make the distinction between the construction of the estimator relying of the training sample

and its evaluation with respect to new data . Statistically, the goal of distributional regression is to estimate the conditional distribution of given , noted . Given the training sample , the forecaster constructs a predictor that estimates the conditional distribution , . In this context, it is crucial to assess if is close to over the range of possible values of . We denote by the marginal distribution of . The main goal of this work is to study the following positive quantity :

(1)

where denotes the expectation with respect to and following and respectively. This averaged -norm is the distance between the predicted distribution and the true conditional distribution . We focus on this specific distance because it corresponds to the excess of risk associated with the CRPS, i.e. it is the difference between the expected for the predicted distribution and the expected for the ideal prediction , see Section 2.1 for more details.

In order to study the rate of convergence of (1) as , we will adapt the notion of optimal minimax rate of convergence that quantifies the best error that an estimator can achieve uniformly on a given family of distributions when the size of the training set gets large. stone_1982 provided minimax rates of convergence within a point regression framework and the minimax theory for nonparametric regression is well-developed, see e.g. gyorfi or Tsybakov2009. To the extent of our knowledge, this paper states the first results for distributional regression.

Many predictors can be studied and achieve the optimal minimax rate of convergence. To go further, we focus on two cases : -nearest neighbor and kernel estimators.

The -nearest neighbor (-NN) method is well-known in the classical framework of regression and classification (see, e.g. biau_2015, biau_2015). In distributional regression, the -NN method can be suitably adapted to estimate the conditional distribution and the estimator is written as

(2)

where and denotes the observation at the -th nearest neighbor of . As usual, possible ties are broken at random to define nearest neighbors.
The kernel estimate in distributional regression (see, e.g. Chapter 5 of gyorfi, gyorfi) can be expressed as

(3)

if the denominator is nonzero. When the denominator is zero, we use the convention . Here the bandwidth depends on the sample size , and the function K : is called the kernel.

Minimax rates of convergence of the -NN and kernel models in point regression are well-studied and it is known that, for suitable choices of number of neighbors and bandwidth respectively, the methods are minimax rate optimal on classes of distributions with Lipschitz or more generally Hölder continuous regression functions (see e.g. Theorem 14.5 in biau_2015, biau_2015 and Theorem 5.2 in gyorfi, gyorfi). For classes of distributions where the conditional distribution satisfies some regularity requirements with respect to the covariates (see Definition 1 of class ), we are able to extend these results to distributional regression. We obtain non asymptotic bounds for the minimax rate of convergence for both the -NN and kernel models.

To summarize, this paper is organized as follows. Our main results are presented in Section 2. Section 2.1 provides the theoretical background on distributional regression and its evaluation using the CRPS. In Section 2.2, we study the k-NN estimators (2) and derive a non asymptotic upper bound for the excess of risk (1) uniformly on the class . Section 2.3 provides similar results for the kernel method (3). In Section 2.4, we find a lower minimax rate of convergence by reducing the problem to standard point regression solved by gyorfi. We can deduce that the -NN method for the distributional regression reaches the optimal rate of convergence in dimension , while the kernel method reaches the optimal rate of convergence in any dimension. The proofs of all the results presented in Section 2 are detailed in Appendix.

2 Main results

2.1 CRPS and distributional regression

The Continuous Ranked Probability Score (CRPS; matheson_1976, matheson_1976) compares a predictive distribution and a real-valued observation by computing the following integral

(4)

The expected of a predictive distribution when the observations are distributed according to is defined as

(5)

where denotes the set of all distribution functions on . This quantity is finite when both and

have a finite moment of order

. Then, the difference between the expected CRPS of the forecast and the expected of the ideal forecast can be written as

(6)

This implies that the only optimal prediction, in the sense that it minimizes the expected CRPS, is the true distribution . A score with this property is said to be strictly proper. This property is essential for distributional regression as it justifies the minimization of the expected score in order to construct or evaluate a prediction.

Example 1.

(CRPS for Generalized Pareto distributions)
Explicit parametric formulas of the exist for most classical distribution families : e.g. Gaussian, logistic, censored logistic, Generalized Extreme Value, Generalized Pareto (see gneiting_raftery_2005, gneiting_raftery_2005; taillardat_2016, taillardat_2016; friederichs_2012, friederichs_2012). We focus here on the Generalized Pareto Distribution (GPD) family and we denote by the GP distribution with shape parameter and scale parameter . Recall that it is defined, when , by

with the notation . When , the standard limit by continuity is used. For , the GPD has a finite first moment and the associated is given by (friederichs_2012)

(7)

When , the expected is (taillardat_2022)

(8)

with

In particular,


In distributional regression, a predictor is evaluated thanks to its expected risk

This quantity is important as many distributional regression methods try to minimize it in order to improve predictions. When is integrable, Equation (6) implies

(9)

We recall that Bayes risk is the minimal theoretical risk over all possible predictors and that Bayes predictor is a predictor achieving Bayes risk. Thus, Equation (9) implies that is Bayes risk and that if and only if -a.e. An introduction to the notions of theoretical risk, Bayes risk and excess of risk can be found in Section 2.4 of hastie_2001.

Example 2.

(GPD regression)
We illustrate the above statement in the case of a Generalized Pareto regression model where given follows a GPD with shape parameter and scale parameter . Then, it is possible to show that Bayes risk is equal to

when for all . For a forecast in the GPD class, i.e. is a GPD with shape parameter and scale parameter , then the risk is equal to Bayes risk if and only if and -a.e.


Finally, we consider the case of a predictor built on a training sample , as presented in the introduction, to estimate the conditional distribution of given . Then, denotes a new independent observation used to evaluate the performances of . The predictor has the expected

with expectation taken both with respect to the training sample and test observation . Once again, when is integrable, the expected risk has a unique minimum given by . The excess of risk becomes

(10)

For large sample sizes, one expects that the predictor correctly estimates the conditional distribution and that the excess of risk tends to zero. A genuine question is to investigate the rate of convergence of the excess of risk to zero as the sample size . The risk depends on the distribution of observations and we want the model to perform well on large classes of distributions. Hence, we consider the standard minimax approach, see for instance gyorfi for the standard cases of regression and classification.

We consider the following classes of distributions.

Definition 1.

For , and , let be the class of distributions such that satisfies :

  1. -a.s.;

  2. For all , ;

  3. for all .

Remark 1.

In condition , could be replaced by any compact set of . Condition requires that remains uniformly bounded by , which is a condition on the dispersion of the distribution since its implies that the absolute mean error (MAE) remains uniformly bounded. Condition is a regularity statement of the conditional distribution in the space . Conditions in Definition 1 are very similar to the conditions considered in the point regression framework, see Theorem 5.2 in gyorfi.

Example 3.

In the GPD regression framework, condition is equivalent to when , for all . The regularity condition holds with constants and as soon as and are both -Hölder. For example, the popular case were the shape parameter and the scale parameter are assumed to be linearly dependent on (i.e. and with ) is in a class of distributions of Definition 1.

In the following, we study the convergence rate of the excess of risk in order to obtain the optimal minimax convergence rate. The reasoning is divided into three steps:

  1. We provide in Section 2.2 an explicit and nonasymptotic upper bound for the excess of risk of the -nearest neighbor model uniformly on the class ; the upper bound is then optimized with a suitable choice of .

  2. In Section 2.3, we obtain similar results for the kernel model.

  3. We show in Section 2.4 that is a lower minimax rate of convergence; the main argument is that it is enough to consider a binary model when both the observation and prediction take values in ; we deduce that in this case, the coincides with the mean squared error so that we can appeal to standard results on lower minimax rate of convergence for regression.

Combining these three steps, we finally obtain Theorem 1 providing the optimal minimax rate of convergence of the excess of risk on the class . All the proofs are postponed to the Appendix.

2.2 Upper bound for the k-nearest neighbor model

The -NN method for distributional regression is defined in Equation (2). Here, we do not use only the mean of the nearest neighbor sample but its entire empirical distribution. Interestingly, the tools developed to analyze the -NN in point regression can be used in our distributional regression framework.

Proposition 1.

Assume and let be the -nearest neighbor model defined by Equation (2). Then,

where and is the volume of the unit ball in .

Let us stress that the upper bound is non-asymptotic and holds for all fixed and . Optimizing the upper bound in yields the following corollary.

Corollary 1.

Assume and consider the -NN model (2).

  • For , the optimal choice yields

    with constant .

  • For , the optimal choice yields

    with constant .

2.3 Upper bound for the kernel model

Kernel methods adapted to distributional regression are defined in Equation (3). For convenience and simplicity of notations, we develop our result for the simple uniform kernel . However, it should be stressed that all the results can be extended to boxed kernels (gyorfi, Figure 5.7 p73) to the price of some extra multiplicative constants. For the uniform kernel, the estimator writes

(11)

when the denominator is non zero and otherwise.

Proposition 2.

Assume and let be the kernel model defined by Equation (11). Then,

where only depends on .

Once again, the upper bound is non-asymptotic and holds for all fixed and . Optimizing the upper bound in yields the following corollary.

Corollary 2.

Assume and consider the kernel model (11). For any , the optimal choice

yields

with

2.4 Optimal minimax rates of convergence

We finally compare the rates of convergence obtained in Corollaries 1 and 2 with a lower minimax rate of convergence in order to see whether the optimal rate of convergence are achieved. We first recall these different notions of minimax rates of convergence.

Definition 2.

A sequence of positive numbers is called an optimal minimax rate of convergence on the class if

(12)

and

(13)

where the infimum is taken over all distributional regression models trained on . If the sequence satisfies only the lower bound (12), it is called a lower minimax rate of convergence.

To prove a lower bound on a class , it is always possible to consider a smaller class . Indeed, if , we clearly have

so that any lower minimax rate of convergence on is also a lower minimax rate of convergence on .

To establish the lower minimax rate of convergence, we focus on the following classes of binary responses.

Definition 3.

Let be the class of distributions of such that :

  1. and

    is uniformly distributed on

    ;

  2. for all .

Since a binary outcome satisfies , condition in Definition 1 holds with . Then and the following lower bound established on the smaller class also holds on the larger class.

Proposition 3.

The sequence is a lower minimax rate of convergence on the class . More precisely,

(14)

for some constant independent of .

Combining Corollaries 1 and 2 and Proposition 3, we can deduce that for , the -NN model reaches the minimax lower rate of convergence for the class and that the kernel model reaches the minimax lower rate of convergence in any dimension . This shows that the lower rate of convergence is in fact the optimal rate of convergence and proves the following theorem.

Theorem 1.

The sequence is the optimal minimax rate of convergence on the class .

It should be stressed that the rate of convergence is the same as in point regression with square error, see Theorems 3.2 and 5.2 in gyorfi for the lower bound and upper bound, respectively.

3 Conclusion and Discussion

We found that the optimal rate of convergence for distributional regression on is of the same order as the optimal rate of convergence for point regression. Thus, with regard to the sample size , distributional regression evaluated with the converges at the same rate as point regression even though the distributional estimate carries more information on the prediction of the underlying process.

We have also shown that the -NN method and the kernel method reach this optimal rate of convergence, respectively in dimension and in any dimension. However, these methods are not used in practice because of the limitations of their predictive power in moderate or high dimension

due to the curse of dimension. An extension of this work could be to study if state-of-the-art techniques reach the optimal rate of convergence obtained in this article. Random Forests

(Breiman_2001) methods, such as Quantile Regression Forests (meinshausen_2006) and Distributional Random Forests (cevid_2020), appear to be natural candidates as they are based on a generalized notion of neighborhood and have been subject to recent development in weather forecast statistical postprocessing (see, e.g., taillardat_2016, taillardat_2016).

The results of this article were obtained for the CRPS, which is widely used in practice, but can easily be extended to the weighted in its standard uses. The weighted is defined as

with the weight chosen. The weighted is used to put the focus of the score in specific regions of the outcome space (gneiting_ranjan). It is used in the study of extreme events by giving more weight to the extreme behavior of the distribution.

Moreover, an interesting development would be to obtain similar results for rate of convergence with respect to different strictly proper scoring rules or metrics, for instance energy scores or Wasserstein distances.

Acknowledgments: The authors acknowledge the support of the French Agence Nationale de la Recherche (ANR) under reference ANR-20-CE40-0025-01 (T-REX project) and of the Energy oriented Centre of Excellence-II (EoCoE-II), Grant Agreement 824158, funded within the Horizon2020 framework of the European Union. Part of this work was also supported by the ExtremesLearning grant from 80 PRIME CNRS-INSU and the ANR project Melody (ANR-19-CE46-0011).

References

Appendix A Proof of Proposition 1

For the simplicity of notation, we write simply for the expectation with respect to and . The context makes it clear enough so as to avoid confusion.

Proof.

Recall that for the CRPS, the excess of risk is equal to

(15)

We first estimate for fixed and . Denote by the nearest neighbors of and by

the associated values of the response variable. Conditionally on

,

, the random variables

, , are independent and with distribution , . This implies that, conditionally, is the average of the independent random variables

that have a Bernoulli distribution with parameter

. Therefore, the conditional bias and variance are given by

Adding up the squared conditional bias and variance and integrating with respect to , , we obtain the mean squared error

Using Jensen’s inequality and integrating with respect to , we deduce that the excess of risk (15) satisfies

Using conditions and in the definition of the class to bound from above the first and second term respectively, we get

where the last inequality uses the fact that, by definition of nearest neighbors, the distances , , are non-increasing.

The last step of the proof is to use Theorem 2.4 from biau_2015 stating that

Together with the concavity inequality (as )

we deduce

concluding the proof of Proposition 1. ∎

Appendix B Proof of Proposition 2

Proof.

Equation (11) can be rewritten as

with the closed ball centered at of radius and

the empirical measure corresponding to . Recall that we use the estimator when .

Similarly as in the proof of the Proposition 1, a bias/variance decomposition of the squared error yields

The excess of risk at is thus decomposed into three terms

that we analyze successively.

The first term (bias) is bounded from above using Jensen’s inequality and property of :