Forecast evaluation with imperfect observations and imperfect models

06/10/2018 ∙ by Philippe Naveau, et al. ∙ 0

The field of statistics has become one of the mathematical foundations in forecast evaluations studies, especially in regard to computing scoring rules. The classical paradigm of proper scoring rules is to discriminate between two different forecasts by comparing them with observations. The probability density function of the observed record is assumed to be perfect as a verification benchmark. In practice, however, observations are almost always tainted by errors. These may be due to homogenization problems, instrumental deficiencies, the need for indirect reconstructions from other sources (e.g., radar data), model errors in gridded products like reanalysis, or any other data-recording issues. If the yardstick used to compare forecasts is imprecise, one can wonder whether such types of errors may or may not have a strong influence on decisions based on classical scoring rules. Building on the recent work of Ferro (2017), we propose a new scoring rule scheme in the context of models that incorporate errors of the verification data, we compare it to existing methods, and applied it to various setups, mainly a Gaussian additive noise model and a gamma multiplicative noise model. In addition, we frame the problem of error verification in datasets as scoring a model that jointly couples forecasts and observation distributions. This is strongly connected to the so-called error-in-variables models in statistics.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In verification and recalibration of ensemble forecasts, an essential verification step is to find data that precisely identify the forecast events of interest, the so-called verification dataset (see, e.g., Jolliffe04). Verification data are of different types, for example reanalysis data; and, in practice, the true process of interest is rarely directly observed. Jolliffe04, in Section 1.3 of their book, discussed the uncertainty associated with verification data (such as sampling uncertainty, direct measurement uncertainty, or changes in locations in the verification dataset). In the field of data assimilation, uncertainty in verification data is routinely taken into account in assimilation schemes (see, e.g., Daley93; Waller14; Janjic17). This issue is also known within the verification forecast community; but, to our knowledge, it has not strongly been put forward until recently (Ferro17). A distribution-based verification approach was proposed by Murphy87

, where joint distributions for forecast and observation account for the information and interaction of both datasets. Various approaches have been proposed in the literature. Some methods attempt to correct the verification data and use regular scoring metrics to evaluate forecasts, such as perturbed ensemble methods

(Anderson96; Hamill01) or (see also Bowler08; Gorgas12)

for other approaches. Other methods consider observations as random variables

(see, e.g., Candille08; Pappenberger09; Pinson12) and express scoring metrics in that context. Weijs11, following the decomposition of a score into its reliability, resolution, and uncertainty components, accounted for the uncertainty relative to the truth or to the climatology. Siegert16

proposed a Bayesian framework to jointly model the prediction and observation in a signal-plus-noise model that embedded the variability between the two datasets. These studies highlight the benefit of accounting for observation errors; however, many of them focus on categorical variables

Bowler06 or are computationally challenging for scalar variables (see, e.g., Candille08; Pappenberger09; Pinson12).

Particularly relevant to our work is the recent work of Ferro17, who framed in precise mathematical terms the problem of error in verification data. His setup relies on modeling the probabilistic distribution of the verification data given the underlying physical process that is not observed. In terms of assessment, this framework is rooted in the concept of proper scoring rules that is now commonly used in weather forecasting centers to compare different forecasts (see, e.g., Gneiting07). Ferro17

also proposed a solution to take into account verification error when computing a scoring rule. Building on this solid foundation, we extend and improve Ferro’s work. By taking full advantage of the information assumed in the probability model at hand, we show that a new correction of proper scoring rules provides a smaller variance than does the one proposed by Ferro, while keeping the same mean with respect to the true but unobserved reference; see Section

2 and Proposition 1. Another improvement is the range of examples treated; see Sections 3-5. In addition to revisiting the univariate Gaussian error additive model, we treat other models, in particular the gamma error model; and we explore multivariate setups. Similarly, in terms of scoring rules, we not only treat the logarithmic score but also investigate the continuous ranked probability score (CRPS), a well-known score in the forecast community. Another added value of our work is that, in Section 5, we view the issue of error in verification datasets within the framework of jointly modeling forecasts and observations distributions (see, e.g., Ehm16). This coupling can be described, in a statistical sense, as a multivariate error-in-variable (EIV) model (see, e.g., Fuller87) that generalizes the classical Gaussian additive error model. One key element of our work is that, in contrast to most past studies that focused on the mean (with respect to observations) of a given score, we view all scores as random variables (driven by observational draws). In some cases, we show that one can derive the explicit expressions of the corrected score probability density function (pdf); see Sections 3 and 4. We conclude in Section 6 by highlighting the main ideas of our work and discussing a few possible directions to go further in the analysis of imperfect observations.

2 Proper scoring rules under uncertainty

In this section, we briefly describe the framework of proper scoring rules and discuss the class of scores proposed by Ferro17 to account for error in verification data and the framework that we propose. Verification data are assumed to be a realization of an unobserved process tainted with error. In this work, the error will be modeled in two cases, one additive and one multiplicative. In the following, we denote the verification data by and its generating random process by , and the underlying true random process by and its realization by . Distinguishing between the unobserved truth () and the observed but incorrect verification data () is fundamental to understanding the impact of imperfect observations on forecast scoring.

2.1 Proper scores

A probabilistic scoring rule, say , is a function with two inputs, the forecast distribution and a realization , which stands as a verification datapoint. The score provides one scalar output indicating about how far the forecast distribution is from the verification data : the smaller , the better. Complex mathematical properties are required for consistent ranking of forecast distributions. For instance, the statistical property of properness, defined by Gneiting07, prevents a scoring rule from favoring any probabilistic distribution over the distribution of the verification data. Properness is expressed as

where corresponds to the pdf of and represents any forecast pdf. This mathematical property means that, on average, a forecast can never be better than the observational distribution .

An incorrect selection of the best forecast can occur if the forecasts are ranked differently depending on the reference used, i.e. the verification data or the true underlying process. Ferro17 showed that forecasts can be misleading, in terms of score values, in the presence of noise in the verification data. Frameworks that embed the observational error are proposed and compared in the following subsections.

2.2 Corrected proper scores

Ferro17 proposed a mathematical scheme to correct a score when error is present in the verification data. This modified score, denoted in this work, is derived from a classical proper score, say . With these notations, and were linked by using the conditional expectation of given , that is,


This ensures that . In other words, the score computed from the ’s provides the same value on average that the proper score computed from the unobserved ’s. In terms of assumptions, we note that the conditional law of given needs to be known in order to compute from , e.g. see Definitions 2 and 3 in Ferro17. We also note that represents one way to correct the given proper score , but it may not be unique. One can find a different strategy by walking into Ferro’s footsteps, but backward. By using the conditional pdf of the random variable given , Equation (1) allows one to build from a given . We propose to follow the opposite path: Given , we construct a new score, say , from the conditional pdf of given and such that . This leads us to define


This strategy appears practical in the sense that we start with any classical score and transform it to take into account the observational error. The drawback is that we need to compute the pdf of the conditional variable from the pdfs of the two random variables and

. Still, this is a just an application of Bayes theorem, and it is easily feasible in many practical cases. Additionally,

is automatically a proper score with respect to the hidden vector

, in the sense that

At this stage, we have two possible ways to build a corrected score, and it is natural to compare their properties. By construction, the random variables and have the same mean

and therefore another criterion is necessary in order to compare these two scoring rules. It seems natural to investigate second-order features, and the following proposition enables us to rank these two scores with respect to their variances.

Proposition 1

Let be the pdf of any given forecast. Let be a proper score scoring rule relative to the unobserved truth represented by the variable . The scoring rules defined above, and , verify the following variance inequality:


All proofs can be found in the Appendix.

Proposition 1 indicates that if one is ready to view ’s as a random variable and prefers scores with smaller variances, then should be favored over . Still, Proposition 1 is general, and treating specific cases will be useful in order to illustrate our strategy.

3 Additive Gaussian observation error

In this section,

denotes a Gaussian distribution with mean

and variance . Ferro’s simplest example is the following:

where all Gaussian variables are assumed to be independent; is observed, but is not, being the hidden truth.

3.1 Corrected log scores

For a Gaussian predictive pdf with mean and variance , the logarithm score, defined by


is well suited and has been widely used in the literature. For the additive error Model (A), Ferro’s paper clearly showed that the classical logarithm score, is misleading and has to be modified before being computed from the noisy observations . Applying (1) brought the correct adapted score,


that satisfies the important property


Since is a proper score with respect to , the new score can provide the same expectation as if the vector was observed. Note that the score never takes advantage of the knowledge of present in Model (A). Only the noise level is used in Equation (5). The reason is that was built from .

Applying basics properties of Gaussian conditioning (see the Appendix for details), our score defined by (2) can be written as


where Note that and now appear in (7). This new corrected score integrates the knowledge contained in the conditional pdf of . As shown in Proposition 1, this added information reduces the variance of this score w.r.t. to .

Besides their variances, one may wonder how the distributions of and are different. The following proposition summarizes the distributional features of both scores.

Proposition 2

Under the Gaussian additive model (A), the random variables associated with the corrected log-scores defined by (5) and (7) can be written as

where means equality in distribution and and

represent noncentral chi-squared random variables with one degree of freedom and respective non-centrality parameters

The explicit expressions of the constants , , and can be found in the Appendix. The score also admits an explicit density; see the Appendix.

Since the variance of a noncentral chi-squared random variable with one degree of freedom and noncentrality parameter equals , one can check that

As already known from Proposition 1, the variance of is always greater than that of . This explicit variance formula highlights the key role of in comparing the two corrected scores. For cases with a large noise error , should be preferred over in terms of variance reduction.

Having explicit expressions of the score distributions in terms of chi-squared densities, one can visually compare their densities. In Figure 1, we arbitrarily choose , for the forecast pdf , and for the hidden truth , and different values of for the noise variance of . The blue, red, and green curves represent the density of the three variables , , and , respectively. The vertical orange line corresponds to the common mean of the three plotted densities

see Equation (6). As expected from Proposition 1, our proposed score (red pdf) is more centered on the orange vertical line than is Ferro’s score (blue pdf). We also note that properness is a concept based on averaging scores; but the asymmetry of these three noncentral chi-squared densities may challenge the ordering with respect to the mean. The average may not be the best feature to characterize noncentral chi-squared pdfs.

Figure 1: Probability densities of the scores defined by (4), (5), and (7) under the Gaussian framework of Model (A). These three densities have the same mean value defined by (6) and represented by the orange vertical line. The left panel corresponds to a perfect forecast ( and ), and the right panel is the forecast with and .

3.2 Corrected CRPS

Besides the logarithmic score, the CRPS is another classical proper scoring rule used in weather forecast centers. It is defined as


where and are iid copies random variables with continuous pdf . The CRPS can be rewritten as

where represents the positive part of and

corresponds to the survival function associated to the cumulative distribution function (cdf)

. For example, the CRPS for a Gaussian forecast with parameters and is equal to


where and

are the pdf and cdf of a standard normal distribution

Gneiting05; Taillardat16.

Proposition 3

Under the Gaussian additive model (A), the random variable associated with the corrected log scores defined by (2) can be written as


where and the random variable follows a Gaussian pdf with mean and variance .

In Figure 2, the distribution of the CRPS is plotted under three cases. One corresponds to the ideal case where the underlying true process is used as a verification data. The second case is the realistic one with the observation process used as verifying data. The third case corresponds to the score corrected under the conditional distribution from Equation (10

). Density estimates are shown for different values of

. The CRPS evaluated on is less centered than are the two others. One can notice the benefit of the proposed correction in terms of centering on the mean value and also in reducing variance of the score. Indeed, the corrected CRPS shows the narrowest distribution around its mean.

Figure 2:

Shown in blue is the estimated probability distribution of the CRPS defined by Equation (

8) if the underlying process were observed; shown in green is the CRPS (8) evaluated at the observed process ; and shown in red is the corrected score from (10). The left panel corresponds to a perfect forecast ( and ), and the right panel is the forecast with and .

4 Multiplicative gamma distributed observation error

The Gaussian assumption is appropriate when dealing with averages, for example, mean temperatures; however, the normal hypothesis cannot be justified for positive and skewed variables such as precipitation intensities. An often-used alternative in such cases is to use a gamma distribution, which works fairly well in practice to represent the bulk of rainfall intensities. Hence, we assume in this section that the true but unobserved

now follows a gamma distribution with parameters and

In the Gaussian case, an additive model was used to link the hidden truth and the observational vector. For positive random variables such as precipitation, Gaussian additive models cannot be used to introduce noise. Instead, we prefer to use a multiplicative model of the type


where is a positive random variable independent of . To make feasible computations, we model the error as an inverse gamma pdf with parameters and :

The basic conjugate prior properties of such gamma and inverse gamma distributions allows us to easily derive the pdf


4.1 Corrected log score

To compute a log score, we need to model the forecast pdf. In this section we consider a gamma distribution with parameters and for the prediction. With obvious notations, the logarithmic score for this forecast becomes

Proposition 4

Under the gamma multiplicative model (B), the random variable associated with the corrected log scores defined by (2) and written as (12) can be expressed as


where represents the digamma function defined as the logarithmic derivative of the gamma function, namely, .

In the multiplicative case, the variance of the verification data can decrease and hence affects the performance of the associated score. Figure 3 shows the distributions of the three log-scores presented in this section; in this case the variances of and of decrease. One can see the issue with the distribution of that is not centered around and has a decreasing variance. Similar conclusions to those in the previous sections can be drawn in terms of the the benefit of the correction of the score.

Figure 3: Estimated probability distribution of the log score under the multiplicative gamma model; shown in blue is the distribution of ; shown in green is the distribution of ; and shown in red is the distribution of from Equation (13).

4.2 Corrected CRPS

For a gamma forecast with parameters and given by (12), the corresponding CRPS (see, e.g., Taillardat16; Scheuerer15) is equal to

Proposition 5

Under the gamma multiplicative model (B), the random variable associated with the corrected CRPS defined by (2) can be expressed from (8) as

Details of the computations are found in the Appendix.

5 Joint distribution of errors in forecasts and observations

Model (A) was useful for understanding the role of observational errors, but its simplicity limits its application in practice. In particular, it does not incorporate the fundamental idea that the forecast and observational vectors are both driven by some hidden indirectly observed process, for example, the state of the atmosphere. To illustrate this joint distribution of forecasts and observations, we recall Table 2 (p. 516) of Ehm16, who studied the following type of Gaussian additive models:

where all Gaussian variables are assumed to be independent; and are observed, but is not, being the hidden truth. In a study by Ehm16, and were set equal to one. If this hidden variable (one can think of “the state of atmosphere"), , was perfectly known, then would contains (stochastically) the same information as does when . The term ideal forecaster was introduced to describe in this case. The climatological forecaster corresponds to the unconditional distribution, that is, a centered normal distribution with variance equal to . The main differences between Model (A) and Model (C) are that the forecast is now linked to the hidden truth and the error term is added to .

Model (C) introduces the fact that and both can be imperfect, that is, with error terms, and both depend on the hidden . In such a context, one can wonder whether scores can be modified to handle this joint information, not only for Model (C) but in the general context where the conditional distributions of and are assumed to be known. Our strategy is to extend Definition (2) by incorporating the information about contained in . This leads us to define


By construction, the random variable satisfies

At this stage, we highlight two points.

First, our goal is to forecast , the hidden truth, and consequently the forecast pdf used in is ideal when it is equal to the pdf of . In practice, whether the distributional forecast produced by a weather prediction center aims at forecasting or

may not be clear. In data assimilation, however, this distinction is clear in the sense that the object of interest based on Kalman filtering techniques is the hidden state


Second, the definition in (16) depends on the forecast random variable. If two weather prediction centers produce two different forecast variables, say and , then three corrected scores could be computed:

These three scores, as well as , will be proper with respect to the hidden variable . This leads to the question of how to compare them. The following proposition gives some suggestions.

Proposition 6

Let be the pdf of any given forecast. Let be a proper score scoring rule relative to the unobserved truth represented by the variable . The random variables defined from the scoring rules and verify the following variance inequality:


Basically, this proposition tells us that whenever the variable contains some information about , the conditional score “squeezes out" such a knowledge about and consequently reduces the variance of the corrected score. If does not bring information about , namely, , then . If two different forecast variables, say and , are available, then, using the same type of argument, one can show that the corrected score based on the concatenation of and has a smaller variance

To illustrate this proposition, we connect Model (C) and the EIV models (see, e.g., Fuller87), and we derive the corrected log score in a multivariate Gaussian additive context.

5.1 Error-in-variable context

In this section, we combine and extend Model (A) and Model (C) into a multivariate context:


where and represent Gaussian model errors and these error terms are independent of and between them. Basically, this system of equations tells us that the forecast aims at mimicking the truth described by ; but the forecast is imperfect, and the forecaster error has to be added. The same interpretation can be made for . Besides the multivariate aspect, the main difference between Model (C) and Model (18) is that the forecast may not be ideal because the mean and variance of and can be different. This implies that a corrected score based on and has to take into into account these discrepancies. Model (18) can be viewed as a member of the EIV class that encompasses a wide large of models (see, e.g., Fuller87). Our specific model is close to the models studied in the context of climate detection and attribution (see, e.g., Hannart14; Ribes16).

Given the observational vector of and the forecast , one can compute the conditional distribution of given (this can also be interpreted in a Bayesian framework). With respect to the EIV system defined by (18), the solution for such given is


where the mean is simply a weighted average that takes into account the information on and :

Proposition 7

In the context of the EIV model (18), the corrected version of the generalization of (7) is

where denotes the trace operator for any forecast density .

Similar to previous sections, one could investigate the analytical distribution of the score , which corresponds to the trace of a noncentral Wishart distribution. However the general expression of this distribution is hardly tractable Mathai82

but can be expressed as a generalized chi-squared distribution in particular conditions


6 Discussion and conclusion

Building on Ferro17’s elegant representation of imperfect observations, we have quantified, in terms of variances and even distributions, the need to account for the error associated with the verification observation when evaluation probabilistic forecasts with scores. An additive framework and a multiplicative framework have been studied in detail to account for the error associated with verification data; and an additional setup is proposed to account for both observational and forecast errors. Both setups involve a probabilistic model for the accounted errors and a probabilistic description of the underling non-observed physical process. Although we look only at idealized cases where the parameters of the involved distributions are assumed to be known, this approach enables us to understand the importance of accounting for the error associated with the verification data. Moreover, the study raises the important point of investigating the distribution of scores when the verification data is considered to be a random variable. Indeed, investigating the means of scores may not provide sufficient information to compare between score discriminative capabilities. One could choose to take into account the uncertainty associated with the inference of distribution parameters. In this case, a Bayesian setup could elegantly integrate the different hierarchies of knowledge and a priori information.


We thank Aurélien Ribes for his helpful comments and discussions. We also thank the Office of Science and Technology of the Embassy of France in the United States, Washington, DC, for supporting our collaboration through the initiative Make Our Planet Great Again. This material is based in part on work supported by the U.S. Department of Energy, Office of Science, under contract DE-AC02-06CH11357.

Conflict of interest

You may be asked to provide a conflict of interest statement during the submission process. Please check the journal’s author guidelines for details on what to include in this section. Please ensure you liaise with all co-authors to confirm agreement with the final statement.

Appendix A Appendix

Proof of Proposition 1:

For any random variable, say , its mean can be written conditionally to the random variable in the following way:

In our case, the variable and . This gives . To show inequality (3), we use the classical variance decomposition

With our notations, we have

This leads to


To finish our proof, we use the same variance decomposition but with the following form:

Taking gives

Coupling this with inequality (21) provides

Inequality (3) follows.

Proof of Equation (7):

To express the score proposed in (2), one needs to derive the conditional distribution from Model (A). More precisely, the Gaussian conditional distribution of given is equal to

where is a weighted sum that updates the prior information about with the observation ,

Combining this information with Equations (2) and (4) leads to

By construction, we have

This means that, to obtain the right score value, we can first compute as the best estimator of the unobserved and then use it into in the corrected score .

Proof of Proposition 2:

For Model (A), both random variables and are normally distributed with the same mean but different variances, and , respectively. Since a chi-square distribution can be defined as the square of a Gaussian random variable, it follows from (5) and (7) that

where means equality in distribution and


and and represent noncentral chi-squared random variables with one degree of freedom and respective noncentrality parameters

Proof of Proposition 3:

To compute the corrected CRPS, one needs to calculate the conditional expectation of under the distribution of . We first compute the expectation and then substitute by and its distribution with mean

and standard deviation

. From Equation (9) we obtain

If follows a normal distribution with mean and variance , that is, with a standard random variable, then we can define the continuous function with Then, we apply Stein’s lemma (Stein81), which states because is a standard random variable. It follows with the notations and that

where has a standard normal distribution





is a noncentered chi-square distribution with one degree of freedom and a noncentral parameter

with known moment generating function

It follows that