 # Probabilistic coherence and proper scoring rules

We provide self-contained proof of a theorem relating probabilistic coherence of forecasts to their non-domination by rival forecasts with respect to any proper scoring rule. The theorem appears to be new but is closely related to results achieved by other investigators.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Scoring rules measure the quality of a probability-estimate for a given event, with lower scores signifying probabilities that are closer to the event’s status (

if it occurs, otherwise). The sum of the scores for estimates

of a vector

of events is called the “penalty” for . Consider two potential defects in .

• There may be rival estimates for  whose penalty is guaranteed to be lower than the one for , regardless of which events come to pass.

• The events in  may be related by inclusion or partition, and might violate constraints imposed by the probability calculus (for example, that the estimate for an event not exceed the estimate for any event that includes it).

Building on the work of earlier investigators (see below), we show that for a broad class of scoring rules known as “proper” the two defects are equivalent. An exact statement appears as Theorem 1. To reach it, we first explain key concepts intuitively (the next section) then formally (Section 3). Proof of the the theorem proceeds via three propositions of independent interest (Section 4). We conclude with generalizations of our results and an open question.

## 2 Intuitive account of concepts

Imagine that you attribute probabilities and to events and , respectively, where . It subsequently turns out that comes to pass but not . How shall we assess the perspicacity of your two estimates, which may jointly be called a probabilistic forecast? According to one method (due to Brier, 1950) truth and falsity are coded by and , and your estimate of the chance of is assigned a score of since did not come true (so your estimate should ideally have been zero). Your estimate for is likewise assigned since it should have been one. The sum of these numbers serves as overall penalty.

Let us calculate your expected penalty for (prior to discovering the facts). With probability you expected a score of , and with the remaining probability you expected a score of , hence your overall expectation was . Now suppose that you attempted to improve (lower) this expectation by insincerely announcing as the chance of , even though your real estimate is . Then your expected penalty would be , worse than before. Differential calculus reveals the general fact:

Suppose your probability for an event is , that your announced probability is , and that your penalty is assessed according to the rule: if comes out true; otherwise. Then your expected penalty is uniquely minimized by choosing .

Our scoring rule thus encourages sincerity since your interest lies in announcing probabilities that conform to your beliefs. Rules like this are called proper. (We add a continuity condition in our formal treatment, below.) For an example of an improper rule, substitute absolute deviation for squared deviation in the original scheme. According to the new rule, your expected penalty for is whereas it drops to if you fib as before.

Consider next the rival forecast of for and for . Because , this forecast is inconsistent with the probability calculus (or incoherent). Table 1 shows that the original forecast dominates the rival inasmuch as its penalty is lower however the facts play out. This association of incoherence and domination is not an accident. No matter what proper scoring rule is in force, any incoherent forecast can be replaced by a coherent one whose penalty is lower in every possible circumstance; there is no such replacement for a coherent forecast. This fact is formulated as Theorem 1 in the next section. It can be seen as partial vindication of probability as an expression of chance.111The other classic vindication involves sure-loss contracts; see Skyrms (2000).

These ideas have been discussed before, first by de Finetti (1974) who began the investigation of dominated forecasts and probabilistic consistency (called coherence). His work relied on the quadratic scoring rule, introduced above.222For analysis of de Finetti’s work, see Joyce, 1998. Note that some authors use the term inadmissible to qualify dominated forecasts. Lindley (1982) generalized de Finetti’s theorem to a broad class of scoring rules. Specifically, he proved that for every sufficiently regular generalization of the quadratic score, there is a transformation such that a forecast is not dominated by any other forecast with respect to if and only if the transformation of by is probabilistically coherent. The reliance on the transformation , however, clouds the interpretation of Lindley’s theorem.

Fresh insight into proper scoring rules comes from relating them to a generalization of metric distance known as Bregman divergence (Bregman, 1967). This relationship was studied by Savage (1971), albeit implicitly, and more recently by Banerjee et al. (2005) and Gneiting and Raftery (2007). So far as we know, their results have yet to be connected to the issue of dominance.

To pull together the threads of earlier discussions, the present work offers a self-contained account of the relations among (i) coherent forecasts, (ii) Bregman divergences, and (iii) domination with respect to proper scoring rules. Only elementary analysis is presupposed. We begin by formalizing the concepts introduced above.333For application of scoring rules to the assessment of opinion, see Gneiting and Raftery (2007) along with Bernardo and Smith (1994, §2.7.2) and references cited there.

## 3 Framework and Main Result

Let be a nonempty sample space. Subsets of are called events. Let  be a vector of events over . We assume that and  have been chosen and are now fixed for the remainder of the discussion. We require  to have finite dimension but otherwise our results hold for any choice of sample space and events. In particular, can be infinite. We rely on the usual notation to denote, respectively, the closed interval , the open interval and the two-point set containing .

###### Definition 1.

Any element of is called a (probability) forecast (for ). A forecast is coherent just in case there is a probability measure  over such that for all , .

A forecast is thus a list of numbers drawn from the unit interval. They are interpreted as claims about the chances of the corresponding events in . The first event in  is assigned the probability given by the first number () in , and so forth. A forecast is coherent if it is consistent with some probability measure over .

This brings us to scoring rules. In what follows, the numbers and are used to represent falsity and truth, respectively.

###### Definition 2.

A function is said to be a proper scoring rule in case

1. is uniquely minimized at for all .

2. is continuous, meaning that for , for any sequence converging to .

For condition 2(a), think of as the probability you have in mind, and as the one you announce. Then is your expected score. Fixing (your genuine belief), the latter expression is a function of the announcement . Proper scoring rules encourage candor by minimizing the expected score exactly when you announce .444Some authors call such rules strictly proper.

The continuity condition is consistent with assuming the value . This can only occur for the arguments or , representing categorically mistaken judgment. For if for some , then can not have a unique minimum at ; similarly, for . A typical example of an unbounded proper scoring rule is (Good, 1952). A comparison of alternative rules is offered in Selten (1998).

For an event , we let

be the characteristic function of

; that is, for all , if and otherwise. Intuitively, reports whether is true or false if Nature chooses .

###### Definition 3.

Given proper scoring rule , the penalty  based on for forecast and is given by:

 Ps(ω,\mathboldf)=∑i≤ns(CEi(ω),fi). (1)

Thus,  sums the scores (conceived as penalties) for all the events under consideration. Henceforth, the proper scoring rule is regarded as given and fixed. The theorem below holds for any choice we make.

###### Definition 4.

Let a forecast be given.

1. is weakly dominated by a forecast in case for all .

2. is strongly dominated by a forecast in case for all .

Strong domination by a rival, coherent forecast is the price to be paid for an incoherent forecast . Indeed, we shall prove:

###### Theorem 1.

Let a forecast be given.

1. If is coherent then it is not weakly dominated by any forecast .

2. If is incoherent then it is strongly dominated by some coherent forecast .

Thus, if and are coherent and then neither weakly dominates the other. The theorem follows from three propositions of independent interest, stated in the next section. We close the present section with a corollary.

###### Corollary 1.

A forecast is weakly dominated by a forecast if and only if is strongly dominated by a coherent forecast.

###### Proof of Corollary 1.

The right-to-left direction is immediate from Definition 4. For the left-to-right direction, suppose forecast is weakly dominated by some . Then by Theorem 1(a), is not coherent. So by Theorem 1(b), is strongly dominated by some coherent forecast. ∎

## 4 Three Propositions

The first proposition is a characterization of coherence. It is due to de Finetti (1974).

###### Definition 5.

Let . Let the cardinality of be . Let be the convex hull of , i.e., consists of all vectors of form , where , , and .

The may be related in various ways, so is possible (indeed, this is the case of interest).

###### Proposition 1.

A forecast is coherent if and only if .

The next proposition characterizes scoring rules in terms of convex functions. Recall that a convex function on a convex subset of satisfies for all and all , in the subset. Strict convexity means that the inequality is strict unless . Variants of the following fact are proved in Savage (1971), Banerjee et al. (2005), and Gneiting and Raftery (2007).

###### Proposition 2.

Let be a proper scoring rule. Then the function defined by is a bounded, continuous and strictly convex function, differentiable for . Moreover,

 s(i,x)=−φ(x)−φ′(x)(i−x)∀x∈(0,1). (2)

Conversely, if a function satisfies (2), with bounded, strictly convex and differentiable on , and is continuous on , then is a proper scoring rule.

We note that the right side of (2), which is only defined for , can be continuously extended to . This is the content of the Lemma 1 in the next section. If the extended satisfies (2) then:

 s(0,0)=−φ(0)ands(1,1)=−φ(1). (3)

Finally, our third proposition concerns a well known property of Bregman divergences (see, e.g., Censor and Zenios, 1997). When we apply the proposition to the proof of Theorem 1, will be the unit cube in .

###### Definition 6.

Let be a convex subset of with non-empty interior. Let be a strictly convex function, differentiable in the interior of , whose gradient extends to a bounded, continuous function on . For , the Bregman divergence corresponding to is given by

 dΦ(\mathboldy,\mathboldx)=Φ(\mathboldy)−Φ(\mathboldx)−∇Φ(\mathboldx)⋅(\mathboldy−\mathboldx).

Because of the strict convexity of , with equality if and only if .

###### Proposition 3.

Let be a Bregman divergence, and let be a closed convex subset of . For , there exists a unique , called the projection of onto , such that

 dΦ(\mathboldπ\mathboldx,\mathboldx)≤dΦ(\mathboldy,\mathboldx)∀\mathboldy∈Z.

Moreover,

 dΦ(\mathboldy,\mathboldπ\mathboldx)≤dΦ(\mathboldy,\mathboldx)−dΦ(\mathboldπ\mathboldx,\mathboldx)∀\mathboldy∈Z,\mathboldx∈C∖Z. (4)

Its worth observing that Proposition 3 also holds if , in which case and (4) is trivially satisfied.

## 5 Proof of Theorem 1

The main idea of the proof is more apparent when is bounded. So we consider this case on its own before allowing to reach .

Bounded Case.

Suppose is bounded. In this case, the derivative of the corresponding from Eq. (2) in Proposition 2 is continuous and bounded all the way up to the boundary of .

Let be a forecast and, for , let be the vector with components . Let . Then

 Ps(ω,\mathboldf) = n∑i=1s(CEi(ω),fi)[% Definition ???] (5) = n∑i=1−φ(fi)−φ′(fi)(CEi(ω)−fi)[Proposition ???] = dΦ(\mathboldvω,\mathboldf)−n∑i=1φ(CEi(ω))[Definition~{}???] = dΦ(\mathboldvω,\mathboldf)+n∑i=1s(CEi(ω),CEi(ω))[Equation ???].

Now assume that is incoherent which, by Proposition 1, means that . According to Eq. (4) of Proposition 3, there exists a , namely the projection of onto , such that for all and hence, in particular, for . Since this proves part of Theorem 1.

To prove part first note that weak dominance of by means that for all , by Eq. (5). In this case, for all , since depends linearly on . If is coherent, by Proposition 1, and hence . This implies that .

Unbounded Case.

Next, consider the case when is unbounded. In this case, the derivative of the corresponding from Proposition 2 diverges either at or , or at both values, and hence we can not directly apply Proposition 3. Eq. (5) is still valid, with both sides of the equation possibly being . However, if lies either in the interior of , or on a point on the boundary where the derivative of does not diverge, an examination of the proof of Proposition 3 shows that the result still applies, as we show now.

If is finite, the minimum of over is uniquely attained at some . Moreover, is necessarily finite. Repeating the argument in the proof of Proposition 3 shows that for any , which is the desired inequality needed in the proof of Theorem 1. We are thus left with the case in which lies on an dimensional face of where the normal derivative diverges. Consider first the case . Then either , in which case is coherent, or or , in which case it is clear that the unique coherent vector strongly dominates .

We now proceed by induction on the dimension of the forecast . In the dimensional hypercube, either lies inside or on a point of the boundary where the normal derivative of is finite, in which case we have just argued that there exists a that is coherent and satisfies for all such that lies in the dimensional face. In the other case, the induction hypothesis implies that we can find such a . Note that for all the other , . Now simply pick an and choose , where the denote all the elements of outside the -dimensional hypercube. Then for all and also, using Lemma 1, . Hence we can choose small enough to conclude that for all . This finishes the proof of part in the general case of unbounded .

To prove part in the general case, we note that if for and , then necessarily . That is, any coherent is a convex combination of such that . This follows from the fact that a component of can be only if this component is for all the ’s. The same is true for the value . But the can be infinite only if some component of is and the corresponding one for is , or vice versa.

Since for the in question, also by Eq. (5) and the assumption that is weakly dominated by . Moreover, . But , hence . ∎

## 6 Proofs of Propositions 1–3

###### Proof of Proposition 1.

Recall that is the dimension of , and that is the number of elements in . Let be the collection of all nonempty sets of form , where is either or its complement. ( corresponds to the minimal non-empty regions appearing in the Venn diagram of .) It is easy to see that:

1. partitions .

It is also clear that there is a one-to-one correspondence between and with the property that is mapped to such that for all , iff . (Here, denotes the th component of .) Thus, there are elements in . We enumerate them as , and the corresponding by . Plainly, for all , is the disjoint union of , and hence:

1. For any measure  , for all .

For the left-to-right direction of the proposition, suppose that forecast is coherent via probability measure . Then for all and hence by (b), . But the are non-negative and sum to one by (a), which shows that .

For the converse, suppose that , which means that there are non-negative ’s, with , such that . Let be some probability measure such that for all . By (a) and the assumption about the , it is clear that such a measure  exists. For all , by (b), thereby exhibiting as coherent. ∎

Before giving the proof of Proposition 2, we state and prove the following technical Lemma.

###### Lemma 1.

Let be bounded, convex and differentiable on . Then the limits and exist, the latter possibly being equal to at or at . Moreover,

 limp→0pφ′(p)=limp→1φ′(p)(1−p)=0. (6)
###### Proof of Lemma 1.

Since is convex, the limits exist, and they are finite since is bounded. Moreover, is a monotone increasing function, and hence also exists (but possibly equals at or at ). Finally, Eq. (6) follows again from monotonicity of and boundedness of , using that , and likewise at . ∎

###### Proof of Proposition 2.

Let be a proper scoring rule. For , let

 φ(p)=−minx{ps(1,x)+(1−p)s(0,x)}. (7)

By Definition 2(a), the minimum in (7) is achieved at , hence .

As a minimum over linear functions, is concave; hence is convex. Clearly, is bounded (because implies, from (7), that , but a convex function can become unbounded only by going to ).

The fact that the minimum is achieved uniquely (Def. 2) implies that is strictly convex for the following reason. We take and and set . Then by uniqueness of the minimizer at . Similarly, . By adding times the first inequality to times the second we obtain , which is precisely the statement of strict convexity.

Let . If is differentiable and for all , then (2) is satisfied, as simple algebra shows.

We shall now show that is, in fact, differentiable and . For any and small enough , we have

 1ϵ(φ(p+ϵ)−φ(p))=ψ(p)−1ϵ[(p+ϵ)(s(1,p+ϵ)−s(1,p))+(1−p−ϵ)(s(0,p+ϵ)−s(0,p))].

Since is minimized at by Definition 2(a), the last term in square brackets is negative. Hence

 limϵ→01ϵ(φ(p+ϵ)−φ(p))≥ψ(p),

and similarly one shows

 limϵ→01ϵ(φ(p)−φ(p−ϵ))≤ψ(p).

Since is continuous by Definition 2(b), this shows that is differentiable, and hence . This proves Eq. (2). Continuity of up to the boundary of follows from continuity of and Lemma 1.

To prove the converse, first note that if is bounded and convex on , it can be extended to a continuous function on , as shown in Lemma 1. Because of strict convexity of we have, for and ,

 ps(1,x)+(1−p)s(0,x)=−φ(x)−φ′(x)(p−x)≥−φ(p), (8)

with equality if and only if .

It remains to show that the same is true for . Consider first the case . We have to show that for . By continuity of , Eq. (2) and Lemma 1, we have , while . If , the result is immediate. If is finite, we have again by strict convexity of .

Likewise, one shows that for . This finishes the proof that is a proper scoring rule. ∎

###### Proof of Proposition 3.

For fixed , the function is strictly convex, and hence achieves a unique minimum at a point in the convex, closed set .

Let . For , , and hence by the definition of . Since is differentiable in the first argument, we can divide by and let to obtain

 0≤limϵ→01ϵ(dΦ((1−ϵ)\mathboldπ\mathboldx+ϵ\mathboldy,\mathboldx)−dΦ(\mathboldπ\mathboldx,\mathboldx))=(∇Φ(\mathboldπ\mathboldx)−∇Φ(\mathboldx))⋅(\mathboldy−\mathboldπ\mathboldx).

The fact that

 dΦ(\mathboldy,\mathboldx)−dΦ(\mathboldπ\mathboldx,\mathboldx)−dΦ(\mathboldy,\mathboldπ\mathboldx)=(∇Φ(\mathboldπ\mathboldx)−∇Φ(\mathboldx))⋅(\mathboldy−\mathboldπ\mathboldx)

proves the claim. ∎

## 7 Generalizations

### 7.1 Penalty functions

Theorem 1 holds for a larger class of penalty functions. In fact, one can use different proper scoring rules for every event, and replace (1) by

 Ps(ω,\mathboldf)=∑i≤nsi(CEi(ω),fi),

where the are possibly distinct proper scoring rules. In this way, forecasts for some events can be penalized differently than others. The relevant Bregman divergence in this case is given by , where is determined by via (2). Proof of this generalization closely follows the argument given above, so it is omitted. Additionally, by considering more general convex functions our argument generalizes to certain non-additive penalties.

### 7.2 Generalized scoring rules

#### 7.2.1 Non-uniqueness

If one relaxes the condition of unique minimization in Definition 2(a), a weaker form of Theorem 1 still holds. Namely, for any incoherent forecast there exists a coherent forecast that weakly dominates . Strong dominance will not hold in general, as the example of shows.

Proposition 2 also holds in this generalized case, but the function need not be strictly convex. Likewise, Proposition 3 can be generalized to merely convex (not necessarily strictly convex) but in this case the projection need not be unique. Eq. (4) remains valid.

#### 7.2.2 Discontinuity

A generalization that is more interesting mathematically is to discontinuous scoring rules. Proposition 2 can be generalized to scoring rules that satisfy neither the continuity condition in Definition 2 nor unique minimization. (This is also shown in Gneiting and Raftery, 2007).

###### Proposition 4.

Let satisfy

 ps(1,x)+(1−p)s(0,x)≥ps(1,p)+(1−p)s(0,p)∀x,p∈[0,1]. (9)

Then the function defined by is bounded and convex. Moreover, there exists a monotone non-decreasing function , with the property that

 ψ(x) ≥limϵ→01ϵ(φ(x)−φ(x−ϵ))∀x∈(0,1], (10) ψ(x) ≤limϵ→01ϵ(φ(x+ϵ)−φ(x))∀x∈[0,1), (11)

such that

 s(i,x)=−φ(x)−ψ(x)(i−x)∀x∈(0,1). (12)

Function is strictly convex if and only if the inequality (9) is strict for .

Conversely, if is of the form (12), with bounded and convex and satisfying (10)–(11), then satisfies (9).

It is a fact (Hardy et al., 1934) that every convex function on is continuous on and has a right and left derivative, and (defined by the right sides of (11) and (10), respectively) at every point (except the endpoints, where it has only a right or left derivative, respectively). Both and are non-decreasing functions, and for all . Except for countably many points, , i.e., is differentiable. Eqs. (10)–(11) say that .

Note that although and may be discontinuous, the combination is continuous. Hence, if jumps up at a point , has to jump down by an amount proportional to .

The proof of Proposition 4 is virtually the same as the proof of Proposition 2, so we omit it.

### 7.3 Open question

Whether Theorem 1 holds for this generalized notion of a discontinuous scoring rule remains open. The proof of Theorem 1 given here does not extend to the discontinuous case, since for inequality (4) to hold, differentiability of is necessary, in general.

## References

• Banerjee et al. (2005) A. Banerjee, X. Guo, and H. Wang. On the optimality of conditional expectation as a Bregman predictor. IEEE Transactions on Information Theory, 51(7):2664 2669, 2005.
• Bernardo and Smith (1994) J. M. Bernardo and A. F. M. Smith. Bayesian Theory. John Wiley & Sons, West Sussex, England, 1994.
• Bregman (1967) L. M. Bregman. The relaxation method of finding a common point of convex sets andits application to the solution of problems in convex programming. U. S. S. R. Computational Mathematics and Mathematical Physics, 78(384):200–217, 1967.
• Brier (1950) G. Brier. Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78:1–3, 1950.
• Censor and Zenios (1997) Y. Censor and S. A. Zenios. Parallel Optimization: Theory, Algorithms, and Applications. Oxford University Press, 1997.
• de Finetti (1974) B. de Finetti. Theory of Probability, volume 1. John Wiley and Sons, New York, NY, 1974.
• Gneiting and Raftery (2007) T. Gneiting and A. E. Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359–378, March 2007.
• Good (1952) I. J. Good. Rational decisions. Journal of the Royal Statistical Society, 14:107–114, 1952.
• Hardy et al. (1934) G. H. Hardy, J. E. Littlewood, and G. Pólya. Inequalities. Cambridge University Press, 1934.
• Joyce (1998) J. M. Joyce. A nonpragmatic vindication of probabilism. Philosophy of Science, 65:575 603, 1998.
• Lindley (1982) D. V. Lindley. Scoring rules and the inevitability of probability. International Statistical Review, 50:1–26, 1982.
• Savage (1971) L. J. Savage. Elicitation of personal probabilities and expectations. Journal of the Americal Statistical Association, 66(336):783– 801, 1971.
• Selten (1998) R. Selten. Axiomatic characterization of the quadratic scoring rule. Experimental Economics, 1:43–62, 1998.
• Skyrms (2000) B. Skyrms. Choice & Chance: An Introduction to Inductive Logic. Wadsworth, Belmont CA, 2000.