 # Safe Probability

We formalize the idea of probability distributions that lead to reliable predictions about some, but not all aspects of a domain. The resulting notion of `safety' provides a fresh perspective on foundational issues in statistics, providing a middle ground between imprecise probability and multiple-prior models on the one hand and strictly Bayesian approaches on the other. It also allows us to formalize fiducial distributions in terms of the set of random variables that they can safely predict, thus taking some of the sting out of the fiducial idea. By restricting probabilistic inference to safe uses, one also automatically avoids paradoxes such as the Monty Hall problem. Safety comes in a variety of degrees, such as "validity" (the strongest notion), "calibration", "confidence safety" and "unbiasedness" (almost the weakest notion).

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We formalize the idea of probability distributions that lead to reliable predictions about some, but not all aspects of a domain. Very broadly speaking, we call a distribution safe for predicting random variable given random variable if predictions concerning based on tend to be as good as one would expect them to be if were an accurate description of one’s uncertainty, even if may not represent one’s actual beliefs, let alone the truth. Our formalization of this notion of ‘safety’ has repercussions for the foundations of statistics, providing a joint perspective on issues hitherto viewed as distinct:

#### 1. All models are wrong…111…yet some are useful, as famously remarked by Box (1979).

Some statistical models are evidently both entirely wrong yet very useful. For example, in some highly successful applications of Bayesian statistics, such as latent Dirichlet allocation for topic modeling

(Blei et al., 2003)

, one assumes that natural language text is i.i.d., which is fine for the task at hand (topic modeling) — yet no-one would want to use these models for predicting the next word of a text given the past. Yet, one can use a Bayesian posterior to make such predictions any way — Bayesian inference has no mechanism to distinguish between ‘safe’ and ‘unsafe’ inferences. Safe probability allows us to impose such a distinction.

#### 2. The Eternal Discussion222When the single-vs. multiple-prior issue came up in a discussion on the decision-theory forum mailing list, the well-known economist I. Gilboa referred to it as ‘the eternal discussion’.

More generally, representing uncertainty by a single distribution, as is standard in Bayesian inference, implies a willingness to make definite predictions about random variables that, some claim, one really knows nothing about. Disagreement on this issue goes back at least to Keynes (1921) and Ramsey (1931), has led many economists to sympathize with multiple-prior models (Gilboa and Schmeidler, 1989) and some statisticians to embrace the related imprecise probability (Walley, 1991, Augustin et al., 2014) in which so-called ‘Knightian’ uncertainty is modeled by a set of distributions. But imprecise probability is not without problems of its own, an important one being dilation (Example 1 below). Safe probability can be understood as starting from a set , but then mapping the set of distributions to a single distribution, where the mapping invoked may depend on the prediction task at hand — thus avoiding both dilation and overly precise predictions. The use of such mappings has been advocated before, under the name pignistic transformation (Smets, 1989, Hampel, 2001), but a general theory for constructing and evaluating them has been lacking (see also Section 5).

#### 3. Fisher’s Biggest Blunder333While Fisher is generally regarded as (one of) the greatest statisticians of all time, fiducial inference is often considered to be his ‘big blunder’ — see Hampel (2006) and Efron (1996), who writes Maybe Fisher’s biggest blunder will become a big hit in the 21st century!

Fisher (1930) introduced fiducial inference, a method to come up with a ‘posterior’ on a model’s parameter space based on data , but without anything like a ‘prior’, in an approach to statistics that was neither Bayesian nor frequentist. The approach turned out problematic however, and, despite progress on related structural inference (Fraser, 1968, 1979) was largely abandoned. Recently, however, fiducial distributions have made a comeback (Hannig, 2009, Taraldsen and Lindqvist, 2013, Martin and Liu, 2013, Veronese and Melilli, 2015), in some instances with a more modest, frequentist interpretation as confidence distributions (Schweder and Hjort, 2002, 2016). As noted by Xie and Singh (2013)

, these ‘contain a wealth of information for inference’, e.g. to determine valid confidence intervals and unbiased estimation of the median, but their interpretation remains difficult, viz. the insistence by

Hampel (2006), Xie and Singh (2013) and many others that, although is defined as a distribution on the parameter space, the parameter itself is not random. Safe probability offers an alternative perspective, where the insistence that ‘ is not random’ is replaced by the weaker (and perhaps liberating) statement that ‘we can treat as random’ as long as we restrict ourselves to safe inferences about it’ — in Section 3.1 we determine precisely what these safe inferences are and how they fit into a general hierarchy:

#### 4. The Hierarchy

Pursuing the idea that some distributions are reliable for a smaller subset of random variables/prediction tasks than others, leads to a natural hierarchy of safeties — a first taste of which is in Figure 1 on page 1, with notations explained later. At the top are distributions that are fully reliable for whatever task one has in mind; at the bottom those that are reliable only for a single task in a weak, average sense. In between there is a natural place for distributions that are calibrated (Example 2 below), that are confidence–safe (i.e. valid confidence distributions) and that are optimal for squared-error prediction.

#### 5. “The concept of a conditional probability with regard to an isolated hypothesis…444… whose probability equals 0 is inadmissible,” as remarked by Kolmogorov (1933). As will be seen, safe probability suggests an even more radical statement related to the Monty Hall sanity check.

Upon first hearing of the Monty Hall (quiz master, three doors) problem (vos Savant, 1990, Gill, 2011), most people naively think that the probability of winning the car is the same whether one switches doors or not. Most can eventually, after much arguing, be convinced that this is wrong, but wouldn’t it be nice to have a simple sanity check that immediately tells you that the naive answer must be wrong, without even pondering the ‘right’ way to approach the problem? Safe probability provides such a check: one can immediately tell that the naive answer is not safe, and thus cannot be right. Such a check is applicable more generally, whenever conditioning on events rather than on random variables (Example 4 and Section 4).

#### 6. “Could Neyman, Jeffreys and Fisher have agreed on testing?555…”, as asked by Jim Berger (2003).

Ryabko and Monarev (2005) shows that sequences of 0s and 1s produced by standard random number generators can be substantially compressed by standard data compression algorithms such as rar or zip. While this is clear evidence that such sequences are not random, this method is neither a valid Neyman-Pearson hypothesis test nor a valid Bayesian test (in the tradition of Jeffreys). The reason is that both these standard paradigms require the existence of an alternative statistical model, and start out by the assumption that, if the null model (i.i.d. Bernoulli (1/2)) is incorrect, then the alternative must be correct. However, there is no clear sense in which zip could be ‘correct’ — see Section 5. There is a third testing paradigm, due to Fisher, which does view testing as accumulating evidence against , and not necessarily as confirming some precisely specified . Yet Fisher’s paradigm is not without serious problems either — see Section 5.

Berger et al. (1994) started a line of work culminating in Berger (2003), who presents tests that have interpretations in all three paradigms and that avoid some of the problems of their original implementations. However, it is essentially an objective Bayes approach and thus inevitably, strong evidence against

implies a high posterior probability that

is true. If one is really doing Fisherian testing, this is unwanted. Using the idea of safety, we can extend Berger’s paradigm by stipulating the inferences for which we think it is safe: roughly speaking, if we are in a Fisherian set-up, then we declare all inferences conditional on to be unsafe, and inferences conditional on to be safe; if we really believe that may represent the state of the world, we can declare inferences conditional on to be safe. But much more is possible using safe probability — a DM can decide, on a case by case basis, what inferences based on her tests would be safe, and under what situations the test results itself are safe — for example, some tests remain safe under optional stopping, whereas others (even Bayesian ones!) do not. While we will report on this application of safety (which comprises a long paper in itself) elsewhere, we will briefly return to it in the conclusion.

#### 7. Further Applications: Objective Bayes, Epistemic Probability

Apart from the applications above, the results in this paper suggest that safe probability be used to formalize the status of default priors in objective Bayesian inferences, and to enable an alternative look at epistemic probability. But this remains a topic for future work, to which we briefly return at the end of the paper.

#### The Dream

Imagine a world in which one would require any statistical analysis — whether it be testing, prediction, regression, density estimation or anything else — to be accompanied by a safety statement. Such a statement should list what inferences, the analysists think, can be safely made based on the conclusion of the analysis, and in what formal ‘safety’ sense. Is the alternative really true even though is found to be false? Is the suggested predictive distribution valid or merely calibrated? Is the posterior really just good for making predictions via the predictive distribution, or is it confidence-safe, or is it generally safe? Does the inferred regression function only work well on covariates drawn randomly from the same distribution, or also under covariate shift? (an application of safety we did not address here but which we can easily incorporate). The present, initial formulation of safe probability is too complicated to have any realistic hopes for a practice like this to emerge, but I can’t help hoping that the ideas can be simplified substantially, and a safer practice of statistics might emerge.

Starting with Grünwald (1999), my own work — often in collaboration with J. Halpern — has regularly used the idea of ‘safety’, for example in the context of Maximum Entropy inference (Grünwald, 2000), and also dilation (Grünwald and Halpern, 2004), calibration (Grünwald and Halpern, 2011), and probability puzzles like Monty Hall (Grünwald and Halpern, 2003, Grünwald, 2013). However, the insights of earlier papers were very partial and scattered, and the present paper presents for the first time a general formalism, definitions and a hierarchy. It is also the first one to make a connection to confidence distributions and pivots.

### 1.1 Informal Overview

Below we explain the basic ideas using three recurring examples. We assume that we are given a set of distributions on some space of outcomes . Under a frequentist interpretation, is the set of distributions that we regard as ‘potentially true’; under a subjectivist interpretation, it is the credal set that describes our uncertainty or ‘beliefs’; all developments below work under both interpretations.

All probability distributions mentioned below are either an element of , or they are a pragmatic distribution , which some decision-maker (DM) uses to predict the outcomes of some variable given the value of some other variable ,where both and are random quantities defined on . is also used to estimate the quality of such predictions. (which may be, but is not always in ) is ‘pragmatic’ because we assume from the outset that some element of might actually lead to better predictions — we just do not know which one. Figure 1: A Hierarchy of Relations for ~P. The concepts on the right correspond (broadly) to existing notions, whose name is given on the left (with the exception of U∣⟨V⟩, for which no regular name seems to exist). A→B means that safety of ~P for A implies safety for B — at least, under some conditions: for all solid arrows, this is proven under the assumption of V with countable range (see underneath Proposition 1). For the dashed arrows, this is proven under additional conditions (see Theorem 2 and subsequent remark). On the right are shown transformations on U under which safety is preserved, e.g. if ~P is calibrated for U|V then it is also calibrated for U′∣V for every U′ with U⇝U′ (see remark underneath Theorem 2). Weakening the conditions for the proofs and providing more detailed interrelations is a major goal for future work, as well as investigating whether the hierarchy has a natural place for causal notions, such as ~P(U∣\sc do(v)) as in Pearl’s (2009) do-calculus.
###### Example 1

[Dilation] A DM has to make a prediction or decision about random variable given the value of . She knows that the marginal probability ; she suspects that may depend on , but has no idea whether and are positively or negatively correlated or how strong the correlation is. She may thus model her uncertainty as the set of all distributions on that satisfy

 P(U=1)=∑v∈VP(U=1,V=v)=0.9. (1)

Given that , what should she predict for ? A standard answer in imprecise probability (Walley, 1991) is to pointwise condition the set , leading one to adopt the probabilities . But this set contains every distribution on , including (the latter would obtain for the with ). It therefore seems that, after observing , the DM has lost rather than gained information. By symmetry, the same happens after observing , so whatever DM observes, she loses information — a phenomenon known as dilation (Seidenfeld and Wasserman, 1993). This is intuitively disturbing, and it may perhaps be better to simply ignore and predict using the distribution that acts as if and has

 ~P(U=1∣V=v)=P(U=1)\ \ \ for all v∈V, (2)

i.e. . While from a purely subjective Bayesian standpoint information is never useless and this seems silly, it is certainly what humans often do in practice, and usually, they get away with it (Dempster, 1968) — for concrete examples see Grünwald and Halpern (2004). Here is where Safe Probability comes in — it tells us that is safe to use, in the following simple sense: for any function , we have:

 for all P∈P∗, all v∈V:   \rm EU∼P[g(U)]=\rm EU∼~P[g(U)∣V=v]. (3)

In particular, if we have a loss function

mapping outcomes and actions to associated losses, then, for any action , we can plug in above and then we find that (assuming contains the truth):

DM’s predictions are guaranteed to be exactly as good, in expectation, as she would expect them to be if were actually ‘true’ — even if is not true at all.

We immediately add though that if we had a loss function which would itself depend on (e.g. if DM is offered a different bet on than if ) then the based on ignoring is not safe any more — (3) may not hold any more, and the actual expectation may be different from DM’s. In terms of the formalism we develop below (Definition 1, 2 and 3), this will be expressed as ‘ is safe for predicting with loss function but not loss function ’, or, in formal notation, is safe for but not for . The intuitive meaning is that DM can safely use to make predictions against (her predictions will be as good as she expects) but not against . These statements will be immediate consequences of the more general statements ‘ is safe for but not safe for ’.

In some cases, we will not be able to come up with a satisfying (3), and we have to settle for a that satisfies a weaker notion of safety, such as, for all , all functions ,

 \rm EV∼P[\rm EU∼~P[g(U)∣V]]=EU∼P[g(U)], (4)

which says that DM predicts as well on average as DM would expect to predict on average if were true, even though may not be true. This will be denoted as ‘ is safe for ’; and if (4) only holds for the identity (which makes no difference if , but in general it does) we have the even weaker safety for (Figure 1). In Section 2.2 we thus obtain five basic notions of safety, varying from weak safety, in an average sense, to very strong safety, safety for , which essentially means that must be the correct conditional distribution.

In this example we used frequentist terminology, such as ‘correct’ and ‘true’, and we continue to do so in this paper. Still, a subjective interpretation remains valid in this and future examples as well: if the DM’s real beliefs are given by the full set , she can safely act as if her belief is represented by the singleton as long as she also believes that her loss does not depend on .

###### Example 2

[Calibration] Consider the weather forecaster on your local television station. Every night the forecaster makes a prediction about whether or not it will rain the next day in the area where you live. She does this by asserting that the probability of rain is , where . How should we interpret these probabilities? The usual interpretation is that, in the long run, on those days at which the weather forecaster predict probability , it will rain approximately of the time. Thus, for example, among all days for which she predicted , the fraction of days with rain was close to . A weather forecaster (DM) with this property is said to be calibrated (Dawid, 1982, Foster and Vohra, 1998). Like safety itself, calibration is a minimal requirement: for example, a weather forecaster who predicts, each day of the year, that the probability of rain tomorrow is will be approximately calibrated in the Netherlands, but her predictions are not very useful — and it is easily seen that, when using a proper scoring rule, optimal forecasts are calibrated, but calibrated forecasts can be far from optimal. On the other hand, in practice we often see calibrated weather forecasters that predict well, but do not predict with anything close to the ‘truth’ — their predictions depend on high-dimensional covariates consisting of measurements of air pressure, temperature etc. at numerous locations in the world, and it seems quite unlikely (and, for practical purposes, unnecessary!) that, given any specific values of these covariates, they issue the correct conditional distribution. While calibration is usually defined relative to empirical data, a re-definition in terms of an underlying set of distributions is straightforward (Vovk et al., 2005, Grünwald and Halpern, 2011), and in Section 2.3 we show that the probabilistic definition of calibration has a natural expression in terms of the safety notions introduced above: is calibrated for if it is safe for , for some with (all notation to be explained) — which implies that (3) is itself an instance of calibration.

###### Example 3

[Bayesian, Fiducial and Confidence Distributions] We are given a parametric probability model where for some , each defines a probability density or mass function on data of sample size , each outcome taking a value in some space . The goal is to make inferences about , based on the data or some statistic thereof. In the common case with fixed and inference based on the full data, , we can transfer this statistical scenario to our setup by defining as a set of distributions on . RVs and are then defined as, for each , and . DM employs a set of prior distributions on , where each

induces a joint distribution

on with marginal on determined by and, given , density of given by , so that if has density , we get the joint density . We set to be the set of all such joint distributions. In the special case in which DM really is a subjective Bayesian who believes that a single prior captures all uncertainty, we have that contains just a single joint parameter-data distribution, and we are in the standard Bayesian scenario. Then DM can set , the standard posterior, and any type of inference about is safe relative to . Here we focus on another special case, in which contains exactly one density for each , namely the degenerate distribution putting all its mass on . We denote this distribution by and notice that then , with , and for any measurable set , determined by density , satisfying

 pθ(xn)=pθ(xn∣Θ=θ)=qθ(xn).

Still, any choice of pragmatic distribution can be interpreted as a distribution on given the data , analogous to a Bayesian posterior. In Section 3 we investigate how one can construct distributions of this kind that are safe for inference about confidence intervals. for simplicity we restrict ourselves to the 1-dimensional case, for which we find that the construction we provide leads to that are confidence-safe, written in our notation as ‘safe for ’, with

being the CDF (cumulative distribution function) of

. Confidence safety is roughly the same as coverage Sweeting (2001): it means that the ‘true’ probability that is contained in a particular type of -credible sets (sets with ‘posterior’ probability given the data ), is equal to .

The we construct are essentially equivalent to the confidence distributions of (Schweder and Hjort, 2002), that were designed with the explicit goal of having good confidence properties; they also often coincide with Fisher’s 1930 fiducial distributions, which in later work (Fisher, 1935) he started treating as ordinary probability distributions that could be used without any restrictions. This cannot be right (see e.g. (Hampel, 2006, page 514)), but the question has always remained how a probability calculus for fiducial distributions could be derived that incorporates the right restrictions. Our work provides a step in this direction, in that we show how such snugly fit into our general framework: confidence safety is a strictly weaker property than calibration, and has again a natural representation in terms of the notation mentioned above. Moreover, it is a special case of pivotal safety which also has repercussions in quite different contexts — see Example 4.

The example illustrates two important points:

1. In some cases the literature suggests some method for constructing a pragmatic . An example is the latent Dirichlet allocation model (Blei et al., 2003) mentioned above, in which data are text corpora, , not explicitly given, is a complicated set of realistic distributions over under which data are non-i.i.d., and the literature suggests to take as the Bayesian posterior for a cleverly designed i.i.d. model.

2. In other cases, DM may want to construct a herself. In Example 1, the safe was obtained by replacing an (unknown) conditional distribution with a (known) marginal — a special case of what was called -conditioning by Grünwald and Halpern (2011). Marginal distributions and distributions that ignore aspects of play a more central role in this construction process: they also do in the confidence construction mentioned above, where one sets equal to a distribution such that , where is some auxiliary random variable (a pivot), becomes independent of . For the original RV though, in the dilation example, DM acts as if and are independent even though they may not be; in the confidence distribution example, DM acts in a ‘dual’ manner, namely as if and are dependent, even though under they are not — which is fine, as long as her conclusions are safe.

###### Example 4

[Event-Based Conditioning and Pivotal Safety via Monty Hall] More generally, we may look at safety for pragmatic distributions that condition on events rather than random variables. To illustrate, consider the Monty Hall Problem (vos Savant, 1990, Gill, 2011): suppose that you’re on a game show and given a choice of three doors Behind one is a car; behind the others are goats. You pick door . Before opening door , Monty Hall, the host opens one of the other two doors, say, door which has a goat. He then asks you if you still want to take what’s behind door , or to take what’s behind door instead. Should you switch? You may assume that initially, the car was equally likely to be behind each of the doors and that, after you go to door , Monty will always open a door with a goat behind. Basically you observe either the event (if Monty opens door ) or (if Monty opens ). You can then calculate your optimal decision according to some distribution , where is the event you observed. Naive conditioning suggests to take , and it takes a long time to convince most people that this is wrong — but, if DM’s would adhere to safe probability, then no convincing and explanation would be needed: translation of the example into our ‘safety’ setting immediately shows, without any further thinking about the problem, that this choice of is unsafe, under all notions of safety we consider! (Section 4).

Another aspect of the Monty Hall problem is that, in most analyses that are usually viewed as ‘correct’, one implicitly assumes that the quiz master flips a fair coin to decide whether to open door or if you choose door 1 so that he has a choice. There have been heated discussions (e.g. on wikipedia talk pages) about whether this assumption is justified. In Example 11 we show that the which assumes a fair coin flip by Monty is an instance of a pivotally safe pragmatic distribution. These have the properties that for many loss functions (including -loss as in Monty Hall), they lead one to making optimal decisions. Thus, while assuming a fair coin flip may be wrong, it is still harmless to base one’s decisions upon it.

#### Overview of the Paper

In Section 2, we treat the case of countable space , defining the basic notions of safety in Section 2.2 (where we return to dilation), and showing how calibration can be cleanly expressed using our notions in Section 2.3. In Section 3 we extend the setting to general , which is needed to handle the case of confidence safety (Section 3.1), pivots (Section 3.2) and squared error optimality, where we observe continuous-valued random variables. Section 4 briefly discusses non-numerical observations as well as probability updates that cannot be viewed as conditional distributions. We end with a discussion of further potential applications of safety as well as open problems. Proofs and further technical details are delegated to the appendix.

## 2 Basic Definitions for Discrete Random Variables

For simplicity, we introduce our basic notions only considering countable , which allows us to sidestep measurability issues altogether. Thus below, is countable; we treat the general case in Section 3.

### 2.1 Concepts and notations regarding distributions on Z

We define a random variable (abbreviated to RV) to be any function for some

. Thus RVs can be multidimensional (i.e. what is usually called ‘random vector’). By an ‘

-valued RV’ or simply ‘generalized RV’ we mean any function mapping to an arbitrary set . For two RVs where and are 1-dimensional random variables, we define to be the RV with components .

For any generalized RVs and on and function we write if for all , . We write (“ determines ”, or equivalently “ is a coarsening of ”) if there is a function such that . We write if and . For two GRVs and we write if they define the same function on , and for a distribution we write if . We write if , and if there exists some for which this holds. Clearly implies that for all distributions on , , but not vice versa. Let be a function on . The range of , denoted , the support of under a distribution , and the range of given that another function on takes value , are denoted as

 \sc range(S):={s∈S:s=S(z)\ for some z∈Z}  ;  \sc suppP(S):={s∈S:P(S=s)>0}, \sc range(S∣T=t)={s∈S:s=S(z)\ for% some z∈Z with t=T(z)} (5)

where we note that , with equality if has full support.

For a distribution on , and -valued RV , we write as short-hand to denote the distribution of under (i.e. is a probability measure).

We generally omit double brackets, i.e. if we write for RVs and , we really mean where is the RV ,

Any generalized RV that maps all to the same constant is called trivial, in particular the RV which maps all to . For an event , we define the indicator random variable to be if holds and otherwise.

#### Conditional Distributions as Generalized RVs

For given distribution on and generalized RVs and , we denote, for all , as the conditional distribution on given , in the standard manner. We further define to be the set of distributions on that can be arrived at from by conditioning on , for all supported by some .

We further denote, for all , as the conditional distribution of given , defined as the distribution on given by (whereas is defined as a distribution on , is a distribution on the more restricted space ).

Suppose DM is interested in predicting RV given RV and does this using some conditional distribution (usually this will be the ‘pragmatic’ , but the definition that follows holds generally). Adopting the standard convention for conditional expectation, we call any function from to the set of distributions on that coincides with for all a version of the conditional distribution . If we make a statement of the form ‘ satisfies …’, we really mean ‘every version of satisfies…’. We thus treat as a -valued random variable where , where, for all with , , and set to an arbitrary value otherwise.

#### Unique and Well-Definedness

Recall that DM starts with a set of distributions on that she considers the right description of her uncertainty. She will predict sume RV given some generalized RV using a pragmatic distribution .

For RV and generalized RV , we say that, for given distribution on , is essentially uniquely defined (relative to ) if for all , (so that -almost surely takes value with ). We use this definition both for and for ; note that we always evaluate whether is uniquely defined under distributions in the ‘true’ though.

We say that is well-defined if, writing , and, , , we have, for , either with -probability 1, or with -probability 1. This is a very weak requirement that ensures that calculating expectations never involves the operation , making all expectations well-defined.

#### The Pragmatic Distribution ~P

We assume that DM makes her predictions based on a probability distribution on which we generally refer to as the pragmatic distribution. In practice, DM will usually be presented with a decision problem in which she has to predict some fixed RV based on some fixed RV , and then she is only interested in the conditional distribution , and for some other RVs and , may be left undefined. In other cases she only may want to predict the expectation of given — in that case she only needs to specify as a function of , and all other details of may be left unspecified. In Appendix A.1 we explain how to deal with such partially specified . In the main text though, for simplicity we assume that is a fully-specified distribution on ; DM can fill up irrelevant details any way she likes. The very goal of our paper being to restrict to making ‘safe’ predictions however, DM may come up with to predict given and there may be many RVs and definable on the domain such that has no bearing to and would lead to terrible predictions; as long as we make sure that is not used for such and — which we will — this will not harm the DM.

### 2.2 The Basic Notions of Safety

All our subsequent notions of ‘safety’ will be constructed in terms of the following first, simple definitions.

###### Definition 1

Let be an outcome space and be a set of distributions on , let be an RV and be a generalized RV on , and let be a distribution on . We say that is safe for (pronounced as ‘ is safe for predicting given ’), if

 for all P∈P∗:infv∈\sc supp~P(V)\rm E~P[U|V=v]≤ \rm EP[U]≤supv∈% \sc supp~P(V)\rm E~P[U|V=v]. (6)

We say that is safe for , if

 for all P∈P∗: \rm EP[U]=\rm EP[\rm E~P[U|V]]. (7)

We say that is safe for , if (6) holds with both inequalities replaced by an equality, i.e. for all ,

 for all P∈P∗: \rm EP[U]=\rm E~P[U|V=v]. (8)

In this definition, as in all definitions and results to come, whenever we write ‘ statement ’ we really mean ‘all conditional probabilities in the following statement are essentially uniquely defined, all expectations are well-defined, and statement ’. Hence, (7) really means ‘for all , is essentially uniquely defined, , , and are well-defined, and the latter two are equal to each other’. Also, when we wrote is safe for , we really meant that it is safe for relative to the given ; we will in general leave out the phrase ‘relative to ’, whenever this cannot cause confusion.

To be fully clear about notation, note that in double expectations like in (7), we consider the right random variable to be bound by the outer expectation; thus it can be rewritten in any of the following ways:

 \rm EU∼P[U] =\rm EV∼P\rm EU∼~P∣V[U] \rm EV∼P\rm EU∼P∣V[U] =\rm EV∼P\rm EU∼~P∣V[U] ∑u∈\sc range(U)P(U=u)⋅u =∑v∈\sc range(V)P(V=v)⋅∑u∈\sc range% (U)~P(U=u∣V=v)⋅u,

where the second equality follows from the tower property of conditional expectation.

#### Towards a Hierarchy

It is immediately seen that, if is safe for , then it is also safe for , and if it is safe for , then it is also safe for . Safety for is thus the weakest notion — it allows a DM to give valid upper- and lower-bounds on the actual expectation of , by quoting and , respectively, but nothing more. It will hardly be used here, except for a remark below Theorem 2; it plays an important role though in applications of safety to hypothesis testing, on which we will report in future work.

Safety for evidently bears relations to unbiased estimation: if is safe for , i.e. (7) holds, then we can think of as an unbiased estimate, based on observing , of the random quantity (see also Example 8 later on). Safety for implies that all distributions in agree on the expectation of and that is the same for (essentially) all values of , and is thus a much stronger notion.

###### Example 5

[Dilation: Example 1, Cont.] The first application of definition (7) was already given in Example 1, where we used a that ignored and was safe for and , as we see from (4) with the identity. Let us extend the example, replacing in that example by , with again defined as the set of all distributions satisfying (1) and defined by, for , , . Then would still be safe for , but not for : contains a distribution whose marginal distribution , and (7) would not hold for that distribution.

Comparing the ‘safety condition’ (4) in Example 1 to (7) in Definition 1 we see that Definition 1 only imposes a requirement on expectations of whereas (4) imposed a requirement also on RVs equal to functions of . For with more than two elements as in Example 5 above, such a requirement is strictly stronger. We now proceed to define this stronger notion formally.

###### Definition 2

Let , and be as above. We say that is safe for if for all RVs with , is safe for .
Similarly, is safe for if for all RVs with , is safe for , and is safe for if for all RVs with , is safe for .

We see that safety of for implies that is the same for all values of in the support of , and all functions of . This can only be the case if ignores , i.e. , for all supported . We must then also have that, for all , that , which means that all distributions in agree on the marginal distribution of , and is equal to this marginal distribution. Thus, is safe for iff it is marginally valid. A prime example of such a that ignores and is marginally correct is the we encountered in Example 1.

To get everything in place, we need a final definition.

###### Definition 3

Let , and be as above, and let be another generalized RV.

1. We say that is safe for if for all , is safe for relative to . We say that is safe for if for all RVs with , is safe for .

2. The same definitions apply with replaced by and .

3. We say that is safe for if it is safe for ; it is safe for if it is safe for .

These definitions simply say that safety for ‘’ means that the space can be partitioned according to the value taken by , and that for each element of the partition (indexed by ) one has ‘local’ safety given that one is in that element of the partition.

Proposition 1 gives reinterpretations of some of the notions above. The first one, (9) will mostly be useful for the proof of other results; the other three serve to make the original definitions more transparent:

###### Proposition 1

[Basic Interpretations of Safety] Consider the setting above. We have:

1. is safe for iff for all , there exists a distribution on with for all , , that satisfies

 P′(U)=P(U). (9)
2. is safe for iff for all ,

 \rm EP[U∣V]=P\rm E~P[U∣V]. (10)
3. is safe for iff for all ,

 P(U∣V)=P~P(U∣V). (11)
4. is safe for iff for all ,

 P(U∣W)=P~P(U∣V,W). (12)

Together with the preceding definitions, this proposition establishes the arrows in Figure 1 from to , from to and from to . The remaining arrows will be established by Theorem 1 and 2.

Note that (12) says that is safe for if ignores given , i.e. according to , is conditionally independent of given . Thus, can be safe for and still may depend on ; the definition only requires that is ignored once is given.

(11) effectively expresses that is valid (a frequentist might say ‘true’) for predicting based on observing , where as always we assume that itself correctly describes our beliefs or potential truths (in particular, if is a singleton, then any which coincides a.s. with is automatically valid). Thus, ‘validity for ’, to be interpreted as is a valid distribution to use when predicting given observations of is a natural name for safety for . We also have a natural name for safety for : for 1-dimensional , (10) simply expresses that all distributions in agree on the conditional expectation of , and that is a version of it. which implies (see e.g. Williams (1991)) that, with the function ,

 \rm E(U,V)∼P[(U−g(V))2]=minf\rm E(U,V)∼P[(U−f(V))2], (13)

the minimum being taken over all functions from to . This means that encodes the optimal regression function for given and hence suggests the name squared-error optimality. Summarizing the names we encountered (see Figure 1):

###### Definition 4

[(Potential) Validity, Squared Error-Optimality, Unbiasedness, Marginal Validity] If is safe for , i.e. (11) holds for all , then we also call valid for (again, pronounce as ‘valid for predicting given ’). If (11) holds for some , we call potentially valid for . If is safe for , we call squared error-optimal for . If is safe for , we call unbiased for . If is safe for , we say that it is marginally valid for .

It turns out that there also is a natural name for safety for whenever . The next example reiterates its importance, and the next section will provide the name: calibration.

###### Example 6

Suppose is safe for . From Proposition 1, (12) we see that this means that for all , all , that

 \rm EP[U′∣V2=v2]=\rm E~P[U′∣V1=v1,V2=v2], (14)

The special case with has already been encountered in Example 1, (3). As discussed in that example, for , (14) expresses our basic interpretation of safety that predictions based on will always be as good, in expectation, as the DM who uses expects them to be. Clearly this continues to be the case if (14) holds for some nontrivial .

### 2.3 Calibration Safety

In this section, we show that calibration, as informally defined in Example 2, has a natural formulation in terms of our safety notions. We first define calibration formally, and then, in our first main result, Theorem 1, show how being calibrated for predicting based on observing is essentially equivalent to being safe for for some types of that need not be equal to itself, including . Thus, we now effectively unify the ideas underlying Example 1 (dilation) and Example 2 (calibration).

Following Grünwald and Halpern (2011) we define calibration directly in terms of distributions rather than empirical data, in the following way:

###### Definition 5

[Calibration]  Let , , , and be as above. We say that is calibrated (or calibration–safe) for if for all , all ,

 \rm EP[U∣\rm E~P[U∣V]=μ]=μ. (15)

We say that is calibrated for if for all , all ,

 P(U∣~P(U∣V)=p)=p (16)

Hence, calibration (for ) means that given that a DM who uses predicts a specific distribution for , the actual distribution is indeed equal to the predicted distribution. Note that here we once again treat as a generalized RV.

In practice we would want to weaken Definition 5 to allow some slack, requiring the (viz. ) inside the conditioning to be only within some of the (viz. ) outside, but the present idealized definition is sufficient for our purposes here. Note also that the definition refers to a simple form of calibration, which does not involve selection rules based on past data such as used by, e.g., Dawid (1982).

We now express calibration in terms of our safety notions. We will only do this for the ‘full distribution’–version (16); a similar result can be established for the average-version.

###### Theorem 1

Let and be as above. The following three statements are equivalent:

1. is calibrated for ;

2. There exists a RV on with such that is safe for

3. is safe for where is the generalized RV given by .

Note that, since safety for implies safety for for , (2.) (1.) shows that safety for implies calibration for . By mere definition chasing (details omitted) one also finds that (2.) implies that is safe for and, again by definition chasing, that is safe for . Thus, this result establishes two more arrows of the hierarchy of Figure 1. Its proof is based on the following simple result, interesting in its own right:

###### Proposition 2

Let and be generalized RVs such that for some function . The following statements are equivalent:

1. ignores , i.e. .

2. For all , for all with : .

3. and ignores , where .

Moreover, if is safe for and ignores , then is safe for .

## 3 Continuous-Valued U and V; Confidence and Pivotal Safety

Our definitions of safety were given for countable , making all random variables involved have countable range as well. Now we allow general and hence continuous-valued and general uncountable as well, but we consider a version of safety in which we do not have safety for itself, but for for some with such that the range of is still countable. To make this work we have to equip with an appropriate -algebra and have to add to the definition of a RV that it must be measurable,666Formally we assume that is equipped with some -algebra that contains all singleton subsets of . We associate the co-domain of any function with the standard Borel -algebra on , and we call such an RV whenever the -algebra on is such that the function is measurable. and we have to modify the definition of support to the standard measure-theoretic definition (which specializes to our definition (2.1) whenever there exists a countable such that ). Yet nothing else changes and all previous definitions and propositions can still be used.777If we were to consider safety of the form for uncountable