Properization: Constructing Proper Scoring Rules via Bayes Acts

06/19/2018 ∙ by Jonas Brehmer, et al. ∙ 0

Scoring rules serve to quantify predictive performance. A scoring rule is proper if truth telling is an optimal strategy in expectation. Subject to customary regularity conditions, every scoring rule can be made proper, by applying a special case of the Bayes act construction studied by Grünwald and Dawid (2004) and Dawid (2007), to which we refer as properization. We discuss examples from the recent literature and apply the construction to create new types, and reinterpret existing forms, of proper scoring rules and consistent scoring functions. In an abstract setting, we formulate sufficient conditions under which Bayes acts exist and scoring rules can be made proper.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Let be a -algebra of subsets of a general sample space . Let

be a convex class of probability measures on

. A scoring rule is any extended real-valued function on such that

is well-defined for . The scoring rule is proper relative to if


In words, we take scoring rules to be negatively oriented penalties that a forecaster wishes to minimize. If she believes that a future quantity or event has distribution , and the penalty for quoting the predictive distribution when realizes is , then (1) implies that quoting is an optimal strategy in expectation. The scoring rule is strictly proper if (1) holds with equality only if . For recent reviews of the theory and application of proper scoring rules see Dawid (2007), Gneiting and Raftery (2007), Dawid and Musio (2014), and Gneiting and Katzfuss (2014).

The intent of this note is to draw attention to the simple fact that, subject to customary regularity conditions, any scoring rule can be properized, in the sense that it can be modified in a straightforward way to yield a proper scoring rule, so that truth telling becomes an optimal strategy. Implicitly, this construction has recently been used by various authors in various types of applications; see, e.g., Diks et al. (2011), Christensen et al. (2014) and Holzmann and Klar (2017).

Theorem 1 (properization).

Let be a scoring rule. Suppose that for every

there is a probability distribution

such that


Then the function


is a proper scoring rule.

Here and in what follows we denote the real line by and the extended real line by . Any probability measure with the property (2) is commonly called Bayes act; for the existence of Bayes acts, see Section 3. In case there are multiple minimizers of the expected score , the function is well-defined by using a mapping that chooses a out of the set of minimizers. If is proper and , then , so the proper scoring rules are fixed points under the properization operator.

Importantly, Theorem 1 is a special case of a general and powerful construction studied in detail by Grünwald and Dawid (2004) and Dawid (2007). Specifically, given some action space

and a loss function

, suppose that for each there is a Bayes act , such that

Then the function

is a proper scoring rule. Note the natural connection to decision- and utility-based scoring approaches, where the quality of a forecast is judged by the monetary utility of the induced acts and decisions (Granger and Pesaran, 2000; Granger and Machina, 2006; Ehm et al., 2016).

In the remainder of the paper we focus on the above special case in which the action domain is the class . In Section 2 we identify scattered results in the literature as prominent special cases of properization (Examples 14), and we use Theorem 1 to construct new proper scoring rules from improper ones (Examples 57). Section 3 gives sufficient conditions for the existence of Bayes acts and Section 4 contains a brief discussion. All proofs and technical details are moved to the Appendix.

2 Examples

This section starts with an example in which we review the ubiquitous misclassification error from the perspective of properization. We go on to demonstrate how Theorem 1 has been used implicitly to construct proper scoring rules in econometric, meteorological, and statistical strands of literature. The notion of properization simplifies and shortens the respective proofs of propriety, makes them much more transparent, and puts the scattered examples into a unifying and principled joint framework. Further examples show other facets of properization: The scoring rules constructed in Example 5 are original, and the discussion in Example 6 illustrates a connection to the practical problem of the treatment of observational uncertainty in forecast evaluation. Finally, Example 7 includes an instance of a situation in which properization fails.

Example 1.

Consider probability forecasts of a binary event, where and is the class of the Bernoulli measures. We identify any with the probability and consider the scoring rules

The scoring rule

corresponds to the mean probability rate (MPR) in machine learning

(Ferri et al., 2009, p. 30). The scoring rule was first considered by Dawid (1986). It agrees with the special case in Section 4.2 of Parry (2016) and corresponds to the mean absolute error (MAE) as discussed by Ferri et al. (2009, p. 30).111As noted by Parry (2016), the improper score shares its (concave) expected score function with the proper Brier score. This illustrates the importance of the second condition in Theorem 1 of Gneiting and Raftery (2007): For a scoring rule the (strict) concavity of the expected score function is equivalent to the (strict) propriety of only if, furthermore, is a subtangent of at . Both and are improper with common Bayes act

and with the same properized score given by the zero-one rule

A case-averaged zero-one score is typically referred to as misclassification rate or misclassification error; undoubtedly, this is the most popular and most frequently used performance measure in binary classification. While the scoring rule is proper it fails to be strictly proper (Gneiting and Raftery, 2007, Example 4; Parry, 2016, Section 4.3). Consequently, misclassification error has serious limitations as a performance measure, as persuasively argued by Harrell (2015, p. 258), among others. Nevertheless, the scoring rule is proper, contrary to recent claims of impropriety in the blogosphere.222See, e.g., and

For the remainder of the section, let and let be the Borel -algebra. We let be the class of Borel measures with a Lebesgue density, . Furthermore, we write for the measures with finite

-th moment and

for the subclasses when Dirac measures are excluded. Whenever it simplifies notation, we identify

with its cumulative distribution function


Example 2.

Let be a proper scoring rule on some subclass of and let be a nonnegative weight function such that for . Let

this score is improper unless the weight function is constant. Indeed, by Theorem 1 of Gneiting and Ranjan (2011), the Bayes act under has density

From this we see that the key statement in Theorem 1 of Holzmann and Klar (2017) constitutes a special case of Theorem 1. In the further special case in which is the logarithmic score, the properized score (3) recovers the conditional likelihood score of Diks et al. (2011) up to equivalence, as noted in Example 1 of Holzmann and Klar (2017). For analogous results for consistent scoring functions see Theorem 5 of Gneiting (2011) and Example 2 of Holzmann and Klar (2017).

Example 3.

For a probability measure , let , , and

denote its mean, variance, and centered third moment. Let

be the ‘trial score’ in equation (16) of Christensen et al. (2014). As Christensen et al. (2014) show in their Appendix A, any Bayes act under has mean and variance

so properization yields the spread-error score,

which is proper relative to the class . Hence the construction of the spread-error score in Christensen et al. (2014) constitutes another special case of Theorem 1.

Example 4.

The predictive model choice criterion of Laud and Ibrahim (1995) and Gelfand and Ghosh (1998) uses the scoring rule , where and denote the mean and the variance of a distribution , respectively. As pointed out by Gneiting and Raftery (2007), this score fails to be proper. Specifically, any Bayes act under has mean and vanishing variance, so properization yields the ubiquitous squared error, .

The original scoring rules of Examples 3 and 4 can be interpreted as functions in the Bayes act setting of Grünwald and Dawid (2004) and Dawid (2007), where the action space is given by . Hence, the properization method can be interpreted as an application of Theorem 3 of Gneiting (2011) to consistent scoring functions for elicitable two-dimensional functionals, as discussed by Fissler and Ziegel (2016).

Detailed arguments and calculations for the subsequent examples are deferred to the Appendix.

Example 5.

For consider the scoring rule

where is identified with its cumulative distribution function (CDF). For this is the well known proper continuous ranked probability score (CRPS), as reviewed in Section 4.2 of Gneiting and Raftery (2007). For the score was proposed by Müller et al. (2005), and Zamo and Naveau (2018) show in their Appendix A that for discrete distributions every Dirac measure in a median of is a Bayes act. The same holds true for general distributions and for all . If , the Bayes act under is given by


and all in all we see that properization of works for any .

Moreover, in the case the mapping is even injective. Consequently, if the class is such that and is finite for , the properized score (3) is even strictly proper relative to . If , this can be ensured by restricting to the class . For the class of the Borel measures with compact support is a suitable choice.

Example 6.

Friederichs and Thorarinsdottir (2012, p. 58) propose a modification of the CRPS that aims to account for observational error in forecast evaluation. Specifically, they consider the scoring rule

where represents additive observation error. This scoring rule fails to be proper, as for probability measures we have


where denotes the convolution operator. Due to the strict propriety of the CRPS relative to the class , the unique Bayes act under is given by . Theorem 1 now gives the scoring rule , which is proper relative to .

In order to account for noisy observational data in forecast evaluation, equation (5) suggests using the scoring rule if the noise is independent, additive, and has distribution . This corresponds to predicting hypothetical true values, to which noise is added before they are compared to observations. The drawbacks of this approach and alternative techniques are discussed by Ferro (2017). The associated issues in forecast evaluation remain challenges to the scientific community at large; see, e.g., Ebert et al. (2013) and Ferro (2017).

Example 7.

Let be a scoring rule, and let be a distribution with Lebesgue density . Suppose is a class of distributions such that for . For define

which is again a scoring rule. If is proper, a Bayes act under is given by , since for , and if is strictly proper, the Bayes act is unique. Properization now gives the proper scoring rule . An interesting special case emerges when substituting the CRPS for . This leads to


where is the scoring rule in the previous example. For another special case, let and , to yield

which recovers the probability score of Wilson et al. (1999). We have that , where is the improper linear score and is a uniform density on . Properization is not feasible relative to sufficiently rich classes , as Bayes acts fail to exist under both the linear score and the probability score. For details, see the Appendix.

3 Existence of Bayes acts

In Example 7 we presented a scoring rule that cannot be properized, due to the non-existence of Bayes acts. This section addresses the question under which conditions on and a minimum of the expected score function exists. To illustrate the ideas, we start with a further example.

Example 8.

Using the notation of Example 3, consider the normalized squared error,

as a scoring rule on the classes of the Borel measures with variance at most , and , respectively. Relative to any Bayes act under has mean and variance , so properization yields (non-normalized) squared error up to equivalence. Relative to however, there is no Bayes act, since increasing the variance will always lead to a smaller expected score.

We now turn to a general perspective and discuss sufficient conditions for the existence of Bayes acts. At first, consider a finite probability space . In this situation, geometrical arguments yield sufficient conditions. In particular, a Bayes act under exists if the risk set

is closed from below and bounded from below; see Theorem 1 in Chapter 2.5 of Ferguson (1967). Extending this result to a general sample space is non-trivial since in this case

can be a subset of an infinite-dimensional vector space. In the following we employ well-known concepts of functional analysis in order to discuss two possible extensions. All proofs are deferred to the


Let be a set of probability measures on a general probability space and let be a topological space. We return to the setting of Section 1 and consider functions . This makes the results more general and easier to apply in situations where the scoring rule depends on only via some finite number of parameters. Concerning the latter point, note that the normalized squared error of Example 8 can be written as a composition of the mapping and the function , with being defined on . Consequently, the expected normalized squared error attains its minimum if the expected score of attains its minimum. Note that such a decomposition of the scoring rule is possible for Examples 3 and 4 as well, as alluded to in the comments that succeed these examples.

We impose the following integrability assumption on .

Definition 1.

The mapping is uniformly bounded from below if there exists a function which is integrable with respect to any and such that holds for all .

Our first result is similar to Theorem 2 in Chapter 2.9 of Ferguson (1967), which proves the existence of minimax decision rules.

Theorem 2.

Suppose is lower semicontinuous in its first component and uniformly bounded from below. If is compact, then the function attains its minimum for any .

This theorem can be used to prove the existence of a Bayes act for a given scoring rule. However, it is not applicable to Example 8. To see this, recall the decomposition of mentioned above and note that restricting to corresponds to restricting to . The latter set is not a compact space and neither is its closure. Consequently, we aim to dispense with the compactness assumption used in Theorem 2.

To do so, we need additional concepts from functional analysis. Let be a real normed vector space. Recall that a function is called coercive if for any sequence the implication

holds true, see, e.g. Definition III.5.7 in Werner (2018). By weak topology on , we mean the weakest topology such that all real-valued linear mappings on are continuous; see, e.g. Chapters 2.13 and 6.5 in Aliprantis and Border (2006). The space is called a reflexive Banach space if it is complete and the canonical embedding of into its bidual space is surjective; see, e.g. Chapter III.3 in Werner (2018) or Chapter 6.3 in Aliprantis and Border (2006). Combining these concepts, we obtain a complement to Theorem 2.

Theorem 3.

Let be a weakly closed subset of a reflexive Banach space. Moreover, suppose is weakly lower semicontinuous in its first component and uniformly bounded from below. If the function is coercive, then it attains its minimum.

This result yields the existence of Bayes acts as long as the integrated scoring rule is coercive for any , where is a reflexive Banach space. To conclude this section, we connect Theorem 3 to Example 8: The function from the decomposition of mentioned above is defined on , which is a subset of the reflexive Banach space . Moreover, is bounded from below by zero and continuous in its first component. As mentioned above, restricting the class to corresponds to restricting the domain of to and in this situation, integrating with respect to gives a coercive function. Consequently, Theorem 3 can be used to show that can be properized if restricted to .

4 Discussion

In this article we have introduced the concept of properization, which is rooted in the Bayes act construction of Grünwald and Dawid (2004) and Dawid (2007), and we have drawn attention to its widespread implicit use in the transdisciplinary literature on proper scoring rules, where our unified approach yields simplified, shorter, and considerably more instructive and transparent proofs than extant methods. Moreover, using new examples, we have demonstrated the power of the properization approach in the creation of new proper scoring rules from existing improper ones.

Since the central element in the construction of a properized score is a Bayes act, we have discussed conditions on the scoring rule and the class that guarantee its existence. Undoubtedly, there are alternative paths to existence results in the spirit of Theorems 2 and 3, and the derivation of sufficient conditions in alternative situations is an interesting open problem. Furthermore, we have not explored necessary conditions for the existence of Bayes acts in this work. Their derivation and the refinement of sufficient conditions on and remain challenges that we leave for future work.

Appendix: Proofs

Here we present detailed arguments for the technical claims in Examples 5, 6, and 7 as well as the proofs of Theorems 2 and 3.

Details for Example 5

We fix some distribution and start with the case . An application of Fubini’s theorem gives


Given we seek the value that minimizes the inner integral in (7). If is such that , the equality holds for -almost all , hence is the unique minimizer. If satisfies , define the function

which is strictly convex in with derivative

and a unique minimum at . As a consequence, the minimizing value is given by

The function defined by the minimizers , is a minimizer of and if is finite, it is unique Lebesgue almost surely. Since the function has the properties of a distribution function and hence defined by (4) is a Bayes act for . Moreover, equation (4) shows that the relation between and is one-to-one.

It remains to be checked under which conditions the properization of is not only proper but strictly proper. The representation (4) along with two Taylor expansions implies that behaves like in the tails. This has two consequences. At first, the above arguments show that for to be finite has to be integrable with respect to Lebesgue measure. Hence, the tail behavior of and the inequality for show that is finite for . Second, has a lighter tail than for and a heavier tail for . In the latter case does not necessarily imply . Hence, without additional assumptions, strict propriety of the properized score (3) can only be ensured relative to for and relative to the class for .

We now turn to . In this case, the function is strictly concave, and its unique minimum is at for and at for . If , then both and are minima. Arguing as above, every Bayes act is a Dirac measure in a median of .

Finally, implies that is linear, thus, as for , every Dirac measure in a median of is a Bayes rule. The only difference to the case is that if there is more than one median, there are Bayes acts other than Dirac measures, since is constant for all satisfying .

Details for Example 6

Let and be distribution functions. By the definition of the convolution operator

holds for . Using this identity and Fubini’s theorem leads to

which verifies equality in (5). Moreover, the strict propriety of the CRPS relative to the class gives for , thereby demonstrating that the Bayes act is unique in this situation.

Details for Example 7

For distributions and , the Fubini-Tonelli theorem and the definition of the convolution operator give

so the stated (unique) Bayes act under follows from the (strict) propriety of . Proceeding as in the details for Example 6 we verify identity (6).

For the same calculations as above show that the probability score satisfies

where is the linear score. Consequently, to demonstrate that Theorem 1 is neither applicable to nor to , it suffices to show that there is a distribution such that does not have a minimizer. We use an argument that generalizes the construction in Section 4.1 of Gneiting and Raftery (2007) who show that is improper. Let be a density, symmetric around zero and strictly increasing on . Let and define the interval for . Suppose is a density with positive mass on some interval for . Due to the properties of , the score can be reduced by substituting the density defined by

for , i.e., by shifting all probability mass from to the modal interval . Repeating this argument for any shows that no density can be a minimizer of the expected score . Note that the assumptions on are stronger than necessary in order to facilitate the argument. They can be relaxed at the cost of a more elaborate proof.

Proof of Theorem 2

Let be a sequence with . Since is lower semicontinuous in its first component and uniformly bounded from below by , Fatou’s lemma gives

for any . Hence, is a lower semicontinuous function for any and due to the assumed compactness of , the result now follows from Theorem 2.43 in Aliprantis and Border (2006).

Proof of Theorem 3

The same arguments as in the proof of Theorem 2 show that is a weakly lower semicontinuous function for any . If is such that this function is also coercive, then proceeding as in the proof of Satz III.5.8 in Werner (2018) gives a weakly convergent sequence with . Since is weakly closed by assumption, it contains the weak limit of the sequence and hence weak lower semicontinuity implies that attains its minimum at .


Tilmann Gneiting is grateful for funding by the Klaus Tschira Foundation and by the European Union Seventh Framework Programme under grant agreement 290976. Part of his research leading to these results has been done within subproject C7 “Statistical postprocessing and stochastic physics for ensemble predictions” of the Transregional Collaborative Research Center SFB / TRR 165 “Waves to Weather” ( funded by the German Research Foundation (DFG). Jonas Brehmer gratefully acknowledges support by DFG through Research Training Group RTG 1953. We thank Tobias Fissler and Matthew Parry for instructive discussions.


  • Aliprantis and Border (2006) Aliprantis, C. D. and Border, K. C. (2006). Infinite Dimensional Analysis. Springer, Berlin, third edition.
  • Christensen et al. (2014) Christensen, H. M., Moroz, I. M., and Palmer, T. N. (2014). Evaluation of ensemble forecast uncertainty using a new proper score: Application to medium-range and seasonal forecasts. Quarterly Journal of the Royal Meteorological Society, 141:538–549.
  • Dawid (1986) Dawid, A. P. (1986). Probability forecasting. In Kotz, S., Johnson, N. L., and Read, C. B., editors, Encyclopedia of Statistical Sciences, volume 7, pages 210–218. John Wiley & Sons, Inc., New York.
  • Dawid (2007) Dawid, A. P. (2007). The geometry of proper scoring rules. Annals of the Institute of Statistical Mathematics, 59:77–93.
  • Dawid and Musio (2014) Dawid, A. P. and Musio, M. (2014). Theory and applications of proper scoring rules. Metron, 72:169–183.
  • Diks et al. (2011) Diks, C., Panchenko, V., and van Dijk, D. (2011). Likelihood-based scoring rules for comparing density forecasts in tails. Journal of Econometrics, 163:215–230.
  • Ebert et al. (2013) Ebert, E., Wilson, L., Weigel, A., Mittermaier, M., Nurmi, P., Gill, P., Göber, M., Joslyn, S., Brown, B., Fowler, T., and Watkins, A. (2013). Progress and challenges in forecast verification. Meteorological Applications, 20:130–139.
  • Ehm et al. (2016) Ehm, W., Gneiting, T., Jordan, A., and Krüger, F. (2016).

    Of quantiles and expectiles: Consistent scoring functions, Choquet representations and forecast rankings.

    Journal of the Royal Statistical Society. Series B. Statistical Methodology, 78:505–562.
  • Ferguson (1967) Ferguson, T. S. (1967). Mathematical Statistics: A Decision Theoretic Approach. Probability and Mathematical Statistics, Vol. 1. Academic Press, New York-London.
  • Ferri et al. (2009) Ferri, C., Hernández-Orallo, J., and Modroiu, R. (2009). An experimental comparison of performance measures for classification. Pattern Recognition Letters, 30:27–38.
  • Ferro (2017) Ferro, C. A. T. (2017). Measuring forecast performance in the presence of observation error. Quarterly Journal of the Royal Meteorological Society, 143:2665–2676.
  • Fissler and Ziegel (2016) Fissler, T. and Ziegel, J. F. (2016). Higher order elicitability and Osband’s principle. The Annals of Statistics, 44:1680–1707.
  • Friederichs and Thorarinsdottir (2012) Friederichs, P. and Thorarinsdottir, T. L. (2012). Forecast verification for extreme value distributions with an application to probabilistic peak wind prediction. Environmetrics, 23:579–594.
  • Gelfand and Ghosh (1998) Gelfand, A. E. and Ghosh, S. K. (1998).

    Model choice: A minimum posterior predictive loss approach.

    Biometrika, 85:1–11.
  • Gneiting (2011) Gneiting, T. (2011). Making and evaluating point forecasts. Journal of the American Statistical Association, 106:746–762.
  • Gneiting and Katzfuss (2014) Gneiting, T. and Katzfuss, M. (2014). Probabilistic forecasting. Annual Review of Statistics and Its Application, 1:125–151.
  • Gneiting and Raftery (2007) Gneiting, T. and Raftery, A. E. (2007).

    Strictly proper scoring rules, prediction, and estimation.

    Journal of the American Statistical Association, 102:359–378.
  • Gneiting and Ranjan (2011) Gneiting, T. and Ranjan, R. (2011). Comparing density forecasts using threshold- and quantile-weighted scoring rules. Journal of Business & Economic Statistics, 29:411–422.
  • Granger and Machina (2006) Granger, C. W. and Machina, M. J. (2006). Forecasting and Decision Theory. In Elliott, G., Granger, C., and Timmermann, A., editors, Handbook of Economic Forecasting, volume 1, pages 81–98. Elsevier.
  • Granger and Pesaran (2000) Granger, C. W. J. and Pesaran, M. H. (2000). Economic and statistical measures of forecast accuracy. Journal of Forecasting, 19:537–560.
  • Grünwald and Dawid (2004) Grünwald, P. D. and Dawid, A. P. (2004). Game theory, maximum entropy, minimum discrepancy and robust Bayesian decision theory. The Annals of Statistics, 32:1367–1433.
  • Harrell (2015) Harrell, Jr., F. E. (2015). Regression Modeling Strategies. Springer Series in Statistics. Springer International Publishing, 2nd edition.
  • Holzmann and Klar (2017) Holzmann, H. and Klar, B. (2017). Focusing on regions of interest in forecast evaluation. The Annals of Applied Statistics, 11:2404–2431.
  • Laud and Ibrahim (1995) Laud, P. W. and Ibrahim, J. G. (1995). Predictive model selection. Journal of the Royal Statistical Society. Series B. Methodological, 57:247–262.
  • Müller et al. (2005) Müller, W. A., Appenzeller, C., Doblas-Reyes, F. J., and Liniger, M. A. (2005). A debiased ranked probability skill score to evaluate probabilistic ensemble forecasts with small ensemble sizes. Journal of Climate, 18:1513–1523.
  • Parry (2016) Parry, M. (2016). Linear scoring rules for probabilistic binary classification. Electronic Journal of Statistics, 10:1596–1607.
  • Werner (2018) Werner, D. (2018). Funktionalanalysis. Springer, Berlin, eighth edition.
  • Wilson et al. (1999) Wilson, L. J., Burrows, W. R., and Lanzinger, A. (1999). A strategy for verification of weather element forecasts from an ensemble prediction system. Monthly Weather Review, 127:956–970.
  • Zamo and Naveau (2018) Zamo, M. and Naveau, P. (2018). Estimation of the continuous ranked probability score with limited information and applications to ensemble weather forecasts. Mathematical Geosciences, 50:209–234.