We formalize the idea of probability distributions that lead to reliable predictions about some, but not all aspects of a domain. Very broadly speaking, we call a distribution safe for predicting random variable given random variable if predictions concerning based on tend to be as good as one would expect them to be if were an accurate description of one’s uncertainty, even if may not represent one’s actual beliefs, let alone the truth. Our formalization of this notion of ‘safety’ has repercussions for the foundations of statistics, providing a joint perspective on issues hitherto viewed as distinct:
1. All models are wrong…111…yet some are useful, as famously remarked by Box (1979).
Some statistical models are evidently both entirely wrong yet very useful. For example, in some highly successful applications of Bayesian statistics, such as latent Dirichlet allocation for topic modeling(Blei et al., 2003)
, one assumes that natural language text is i.i.d., which is fine for the task at hand (topic modeling) — yet no-one would want to use these models for predicting the next word of a text given the past. Yet, one can use a Bayesian posterior to make such predictions any way — Bayesian inference has no mechanism to distinguish between ‘safe’ and ‘unsafe’ inferences. Safe probability allows us to impose such a distinction.
2. The Eternal Discussion222When the single-vs. multiple-prior issue came up in a discussion on the decision-theory forum mailing list, the well-known economist I. Gilboa referred to it as ‘the eternal discussion’.
More generally, representing uncertainty by a single distribution, as is standard in Bayesian inference, implies a willingness to make definite predictions about random variables that, some claim, one really knows nothing about. Disagreement on this issue goes back at least to Keynes (1921) and Ramsey (1931), has led many economists to sympathize with multiple-prior models (Gilboa and Schmeidler, 1989) and some statisticians to embrace the related imprecise probability (Walley, 1991, Augustin et al., 2014) in which so-called ‘Knightian’ uncertainty is modeled by a set of distributions. But imprecise probability is not without problems of its own, an important one being dilation (Example 1 below). Safe probability can be understood as starting from a set , but then mapping the set of distributions to a single distribution, where the mapping invoked may depend on the prediction task at hand — thus avoiding both dilation and overly precise predictions. The use of such mappings has been advocated before, under the name pignistic transformation (Smets, 1989, Hampel, 2001), but a general theory for constructing and evaluating them has been lacking (see also Section 5).
3. Fisher’s Biggest Blunder333While Fisher is generally regarded as (one of) the greatest statisticians of all time, fiducial inference is often considered to be his ‘big blunder’ — see Hampel (2006) and Efron (1996), who writes Maybe Fisher’s biggest blunder will become a big hit in the 21st century!
Fisher (1930) introduced fiducial inference, a method to come up with a ‘posterior’ on a model’s parameter space based on data , but without anything like a ‘prior’, in an approach to statistics that was neither Bayesian nor frequentist. The approach turned out problematic however, and, despite progress on related structural inference (Fraser, 1968, 1979) was largely abandoned. Recently, however, fiducial distributions have made a comeback (Hannig, 2009, Taraldsen and Lindqvist, 2013, Martin and Liu, 2013, Veronese and Melilli, 2015), in some instances with a more modest, frequentist interpretation as confidence distributions (Schweder and Hjort, 2002, 2016). As noted by Xie and Singh (2013)
, these ‘contain a wealth of information for inference’, e.g. to determine valid confidence intervals and unbiased estimation of the median, but their interpretation remains difficult, viz. the insistence byHampel (2006), Xie and Singh (2013) and many others that, although is defined as a distribution on the parameter space, the parameter itself is not random. Safe probability offers an alternative perspective, where the insistence that ‘ is not random’ is replaced by the weaker (and perhaps liberating) statement that ‘we can treat as random’ as long as we restrict ourselves to safe inferences about it’ — in Section 3.1 we determine precisely what these safe inferences are and how they fit into a general hierarchy:
4. The Hierarchy
Pursuing the idea that some distributions are reliable for a smaller subset of random variables/prediction tasks than others, leads to a natural hierarchy of safeties — a first taste of which is in Figure 1 on page 1, with notations explained later. At the top are distributions that are fully reliable for whatever task one has in mind; at the bottom those that are reliable only for a single task in a weak, average sense. In between there is a natural place for distributions that are calibrated (Example 2 below), that are confidence–safe (i.e. valid confidence distributions) and that are optimal for squared-error prediction.
5. “The concept of a conditional probability with regard to an isolated hypothesis…444… whose probability equals 0 is inadmissible,” as remarked by Kolmogorov (1933). As will be seen, safe probability suggests an even more radical statement related to the Monty Hall sanity check.
Upon first hearing of the Monty Hall (quiz master, three doors) problem (vos Savant, 1990, Gill, 2011), most people naively think that the probability of winning the car is the same whether one switches doors or not. Most can eventually, after much arguing, be convinced that this is wrong, but wouldn’t it be nice to have a simple sanity check that immediately tells you that the naive answer must be wrong, without even pondering the ‘right’ way to approach the problem? Safe probability provides such a check: one can immediately tell that the naive answer is not safe, and thus cannot be right. Such a check is applicable more generally, whenever conditioning on events rather than on random variables (Example 4 and Section 4).
6. “Could Neyman, Jeffreys and Fisher have agreed on testing?555…”, as asked by Jim Berger (2003).
Ryabko and Monarev (2005) shows that sequences of 0s and 1s produced by standard random number generators can be substantially compressed by standard data compression algorithms such as rar or zip. While this is clear evidence that such sequences are not random, this method is neither a valid Neyman-Pearson hypothesis test nor a valid Bayesian test (in the tradition of Jeffreys). The reason is that both these standard paradigms require the existence of an alternative statistical model, and start out by the assumption that, if the null model (i.i.d. Bernoulli (1/2)) is incorrect, then the alternative must be correct. However, there is no clear sense in which zip could be ‘correct’ — see Section 5. There is a third testing paradigm, due to Fisher, which does view testing as accumulating evidence against , and not necessarily as confirming some precisely specified . Yet Fisher’s paradigm is not without serious problems either — see Section 5.
Berger et al. (1994) started a line of work culminating in Berger (2003), who presents tests that have interpretations in all three paradigms and that avoid some of the problems of their original implementations. However, it is essentially an objective Bayes approach and thus inevitably, strong evidence against
implies a high posterior probability thatis true. If one is really doing Fisherian testing, this is unwanted. Using the idea of safety, we can extend Berger’s paradigm by stipulating the inferences for which we think it is safe: roughly speaking, if we are in a Fisherian set-up, then we declare all inferences conditional on to be unsafe, and inferences conditional on to be safe; if we really believe that may represent the state of the world, we can declare inferences conditional on to be safe. But much more is possible using safe probability — a DM can decide, on a case by case basis, what inferences based on her tests would be safe, and under what situations the test results itself are safe — for example, some tests remain safe under optional stopping, whereas others (even Bayesian ones!) do not. While we will report on this application of safety (which comprises a long paper in itself) elsewhere, we will briefly return to it in the conclusion.
7. Further Applications: Objective Bayes, Epistemic Probability
Apart from the applications above, the results in this paper suggest that safe probability be used to formalize the status of default priors in objective Bayesian inferences, and to enable an alternative look at epistemic probability. But this remains a topic for future work, to which we briefly return at the end of the paper.
Imagine a world in which one would require any statistical analysis — whether it be testing, prediction, regression, density estimation or anything else — to be accompanied by a safety statement. Such a statement should list what inferences, the analysists think, can be safely made based on the conclusion of the analysis, and in what formal ‘safety’ sense. Is the alternative really true even though is found to be false? Is the suggested predictive distribution valid or merely calibrated? Is the posterior really just good for making predictions via the predictive distribution, or is it confidence-safe, or is it generally safe? Does the inferred regression function only work well on covariates drawn randomly from the same distribution, or also under covariate shift? (an application of safety we did not address here but which we can easily incorporate). The present, initial formulation of safe probability is too complicated to have any realistic hopes for a practice like this to emerge, but I can’t help hoping that the ideas can be simplified substantially, and a safer practice of statistics might emerge.
Starting with Grünwald (1999), my own work — often in collaboration with J. Halpern — has regularly used the idea of ‘safety’, for example in the context of Maximum Entropy inference (Grünwald, 2000), and also dilation (Grünwald and Halpern, 2004), calibration (Grünwald and Halpern, 2011), and probability puzzles like Monty Hall (Grünwald and Halpern, 2003, Grünwald, 2013). However, the insights of earlier papers were very partial and scattered, and the present paper presents for the first time a general formalism, definitions and a hierarchy. It is also the first one to make a connection to confidence distributions and pivots.
1.1 Informal Overview
Below we explain the basic ideas using three recurring examples. We assume that we are given a set of distributions on some space of outcomes . Under a frequentist interpretation, is the set of distributions that we regard as ‘potentially true’; under a subjectivist interpretation, it is the credal set that describes our uncertainty or ‘beliefs’; all developments below work under both interpretations.
All probability distributions mentioned below are either an element of , or they are a pragmatic distribution , which some decision-maker (DM) uses to predict the outcomes of some variable given the value of some other variable ,where both and are random quantities defined on . is also used to estimate the quality of such predictions. (which may be, but is not always in ) is ‘pragmatic’ because we assume from the outset that some element of might actually lead to better predictions — we just do not know which one.
In this example we used frequentist terminology, such as ‘correct’ and ‘true’, and we continue to do so in this paper. Still, a subjective interpretation remains valid in this and future examples as well: if the DM’s real beliefs are given by the full set , she can safely act as if her belief is represented by the singleton as long as she also believes that her loss does not depend on .
The example illustrates two important points:
In some cases the literature suggests some method for constructing a pragmatic . An example is the latent Dirichlet allocation model (Blei et al., 2003) mentioned above, in which data are text corpora, , not explicitly given, is a complicated set of realistic distributions over under which data are non-i.i.d., and the literature suggests to take as the Bayesian posterior for a cleverly designed i.i.d. model.
In other cases, DM may want to construct a herself. In Example 1, the safe was obtained by replacing an (unknown) conditional distribution with a (known) marginal — a special case of what was called -conditioning by Grünwald and Halpern (2011). Marginal distributions and distributions that ignore aspects of play a more central role in this construction process: they also do in the confidence construction mentioned above, where one sets equal to a distribution such that , where is some auxiliary random variable (a pivot), becomes independent of . For the original RV though, in the dilation example, DM acts as if and are independent even though they may not be; in the confidence distribution example, DM acts in a ‘dual’ manner, namely as if and are dependent, even though under they are not — which is fine, as long as her conclusions are safe.
Overview of the Paper
In Section 2, we treat the case of countable space , defining the basic notions of safety in Section 2.2 (where we return to dilation), and showing how calibration can be cleanly expressed using our notions in Section 2.3. In Section 3 we extend the setting to general , which is needed to handle the case of confidence safety (Section 3.1), pivots (Section 3.2) and squared error optimality, where we observe continuous-valued random variables. Section 4 briefly discusses non-numerical observations as well as probability updates that cannot be viewed as conditional distributions. We end with a discussion of further potential applications of safety as well as open problems. Proofs and further technical details are delegated to the appendix.
2 Basic Definitions for Discrete Random Variables
For simplicity, we introduce our basic notions only considering countable , which allows us to sidestep measurability issues altogether. Thus below, is countable; we treat the general case in Section 3.
2.1 Concepts and notations regarding distributions on
We define a random variable (abbreviated to RV) to be any function for some
. Thus RVs can be multidimensional (i.e. what is usually called ‘random vector’). By an ‘-valued RV’ or simply ‘generalized RV’ we mean any function mapping to an arbitrary set . For two RVs where and are 1-dimensional random variables, we define to be the RV with components .
For any generalized RVs and on and function we write if for all , . We write (“ determines ”, or equivalently “ is a coarsening of ”) if there is a function such that . We write if and . For two GRVs and we write if they define the same function on , and for a distribution we write if . We write if , and if there exists some for which this holds. Clearly implies that for all distributions on , , but not vice versa. Let be a function on . The range of , denoted , the support of under a distribution , and the range of given that another function on takes value , are denoted as
where we note that , with equality if has full support.
For a distribution on , and -valued RV , we write as short-hand to denote the distribution of under (i.e. is a probability measure).
We generally omit double brackets, i.e. if we write for RVs and , we really mean where is the RV ,
Any generalized RV that maps all to the same constant is called trivial, in particular the RV which maps all to . For an event , we define the indicator random variable to be if holds and otherwise.
Conditional Distributions as Generalized RVs
For given distribution on and generalized RVs and , we denote, for all , as the conditional distribution on given , in the standard manner. We further define to be the set of distributions on that can be arrived at from by conditioning on , for all supported by some .
We further denote, for all , as the conditional distribution of given , defined as the distribution on given by (whereas is defined as a distribution on , is a distribution on the more restricted space ).
Suppose DM is interested in predicting RV given RV and does this using some conditional distribution (usually this will be the ‘pragmatic’ , but the definition that follows holds generally). Adopting the standard convention for conditional expectation, we call any function from to the set of distributions on that coincides with for all a version of the conditional distribution . If we make a statement of the form ‘ satisfies …’, we really mean ‘every version of satisfies…’. We thus treat as a -valued random variable where , where, for all with , , and set to an arbitrary value otherwise.
Unique and Well-Definedness
Recall that DM starts with a set of distributions on that she considers the right description of her uncertainty. She will predict sume RV given some generalized RV using a pragmatic distribution .
For RV and generalized RV , we say that, for given distribution on , is essentially uniquely defined (relative to ) if for all , (so that -almost surely takes value with ). We use this definition both for and for ; note that we always evaluate whether is uniquely defined under distributions in the ‘true’ though.
We say that is well-defined if, writing , and, , , we have, for , either with -probability 1, or with -probability 1. This is a very weak requirement that ensures that calculating expectations never involves the operation , making all expectations well-defined.
The Pragmatic Distribution
We assume that DM makes her predictions based on a probability distribution on which we generally refer to as the pragmatic distribution. In practice, DM will usually be presented with a decision problem in which she has to predict some fixed RV based on some fixed RV , and then she is only interested in the conditional distribution , and for some other RVs and , may be left undefined. In other cases she only may want to predict the expectation of given — in that case she only needs to specify as a function of , and all other details of may be left unspecified. In Appendix A.1 we explain how to deal with such partially specified . In the main text though, for simplicity we assume that is a fully-specified distribution on ; DM can fill up irrelevant details any way she likes. The very goal of our paper being to restrict to making ‘safe’ predictions however, DM may come up with to predict given and there may be many RVs and definable on the domain such that has no bearing to and would lead to terrible predictions; as long as we make sure that is not used for such and — which we will — this will not harm the DM.
2.2 The Basic Notions of Safety
All our subsequent notions of ‘safety’ will be constructed in terms of the following first, simple definitions.
Let be an outcome space and be a set of distributions on , let be an RV and be a generalized RV on , and let be a distribution on . We say that is safe for (pronounced as ‘ is safe for predicting given ’), if
We say that is safe for , if
We say that is safe for , if (6) holds with both inequalities replaced by an equality, i.e. for all ,
In this definition, as in all definitions and results to come, whenever we write ‘ statement ’ we really mean ‘all conditional probabilities in the following statement are essentially uniquely defined, all expectations are well-defined, and statement ’. Hence, (7) really means ‘for all , is essentially uniquely defined, , , and are well-defined, and the latter two are equal to each other’. Also, when we wrote is safe for , we really meant that it is safe for relative to the given ; we will in general leave out the phrase ‘relative to ’, whenever this cannot cause confusion.
To be fully clear about notation, note that in double expectations like in (7), we consider the right random variable to be bound by the outer expectation; thus it can be rewritten in any of the following ways:
where the second equality follows from the tower property of conditional expectation.
Towards a Hierarchy
It is immediately seen that, if is safe for , then it is also safe for , and if it is safe for , then it is also safe for . Safety for is thus the weakest notion — it allows a DM to give valid upper- and lower-bounds on the actual expectation of , by quoting and , respectively, but nothing more. It will hardly be used here, except for a remark below Theorem 2; it plays an important role though in applications of safety to hypothesis testing, on which we will report in future work.
Safety for evidently bears relations to unbiased estimation: if is safe for , i.e. (7) holds, then we can think of as an unbiased estimate, based on observing , of the random quantity (see also Example 8 later on). Safety for implies that all distributions in agree on the expectation of and that is the same for (essentially) all values of , and is thus a much stronger notion.
Comparing the ‘safety condition’ (4) in Example 1 to (7) in Definition 1 we see that Definition 1 only imposes a requirement on expectations of whereas (4) imposed a requirement also on RVs equal to functions of . For with more than two elements as in Example 5 above, such a requirement is strictly stronger. We now proceed to define this stronger notion formally.
Let , and be as above. We say that
is safe for if for all RVs
with , is safe for .
Similarly, is safe for if for all RVs with , is safe for , and is safe for if for all RVs with , is safe for .
We see that safety of for implies that is the same for all values of in the support of , and all functions of . This can only be the case if ignores , i.e. , for all supported . We must then also have that, for all , that , which means that all distributions in agree on the marginal distribution of , and is equal to this marginal distribution. Thus, is safe for iff it is marginally valid. A prime example of such a that ignores and is marginally correct is the we encountered in Example 1.
To get everything in place, we need a final definition.
Let , and be as above, and let be another generalized RV.
We say that is safe for if for all , is safe for relative to . We say that is safe for if for all RVs with , is safe for .
The same definitions apply with replaced by and .
We say that is safe for if it is safe for ; it is safe for if it is safe for .
These definitions simply say that safety for ‘’ means that the space can be partitioned according to the value taken by , and that for each element of the partition (indexed by ) one has ‘local’ safety given that one is in that element of the partition.
Proposition 1 gives reinterpretations of some of the notions above. The first one, (9) will mostly be useful for the proof of other results; the other three serve to make the original definitions more transparent:
[Basic Interpretations of Safety] Consider the setting above. We have:
is safe for iff for all , there exists a distribution on with for all , , that satisfies
is safe for iff for all ,
is safe for iff for all ,
is safe for iff for all ,
Note that (12) says that is safe for if ignores given , i.e. according to , is conditionally independent of given . Thus, can be safe for and still may depend on ; the definition only requires that is ignored once is given.
(11) effectively expresses that is valid (a frequentist might say ‘true’) for predicting based on observing , where as always we assume that itself correctly describes our beliefs or potential truths (in particular, if is a singleton, then any which coincides a.s. with is automatically valid). Thus, ‘validity for ’, to be interpreted as is a valid distribution to use when predicting given observations of is a natural name for safety for . We also have a natural name for safety for : for 1-dimensional , (10) simply expresses that all distributions in agree on the conditional expectation of , and that is a version of it. which implies (see e.g. Williams (1991)) that, with the function ,
the minimum being taken over all functions from to . This means that encodes the optimal regression function for given and hence suggests the name squared-error optimality. Summarizing the names we encountered (see Figure 1):
[(Potential) Validity, Squared Error-Optimality, Unbiasedness, Marginal Validity] If is safe for , i.e. (11) holds for all , then we also call valid for (again, pronounce as ‘valid for predicting given ’). If (11) holds for some , we call potentially valid for . If is safe for , we call squared error-optimal for . If is safe for , we call unbiased for . If is safe for , we say that it is marginally valid for .
It turns out that there also is a natural name for safety for whenever . The next example reiterates its importance, and the next section will provide the name: calibration.
2.3 Calibration Safety
In this section, we show that calibration, as informally defined in Example 2, has a natural formulation in terms of our safety notions. We first define calibration formally, and then, in our first main result, Theorem 1, show how being calibrated for predicting based on observing is essentially equivalent to being safe for for some types of that need not be equal to itself, including . Thus, we now effectively unify the ideas underlying Example 1 (dilation) and Example 2 (calibration).
Following Grünwald and Halpern (2011) we define calibration directly in terms of distributions rather than empirical data, in the following way:
[Calibration] Let , , , and be as above. We say that is calibrated (or calibration–safe) for if for all , all ,
We say that is calibrated for if for all , all ,
Hence, calibration (for ) means that given that a DM who uses predicts a specific distribution for , the actual distribution is indeed equal to the predicted distribution. Note that here we once again treat as a generalized RV.
In practice we would want to weaken Definition 5 to allow some slack, requiring the (viz. ) inside the conditioning to be only within some of the (viz. ) outside, but the present idealized definition is sufficient for our purposes here. Note also that the definition refers to a simple form of calibration, which does not involve selection rules based on past data such as used by, e.g., Dawid (1982).
We now express calibration in terms of our safety notions. We will only do this for the ‘full distribution’–version (16); a similar result can be established for the average-version.
Let and be as above. The following three statements are equivalent:
is calibrated for ;
There exists a RV on with such that is safe for
is safe for where is the generalized RV given by .
Note that, since safety for implies safety for for , (2.) (1.) shows that safety for implies calibration for . By mere definition chasing (details omitted) one also finds that (2.) implies that is safe for and, again by definition chasing, that is safe for . Thus, this result establishes two more arrows of the hierarchy of Figure 1. Its proof is based on the following simple result, interesting in its own right:
Let and be generalized RVs such that for some function . The following statements are equivalent:
ignores , i.e. .
For all , for all with : .
and ignores , where .
Moreover, if is safe for and ignores , then is safe for .
3 Continuous-Valued and ; Confidence and Pivotal Safety
Our definitions of safety were given for countable , making all random variables involved have countable range as well. Now we allow general and hence continuous-valued and general uncountable as well, but we consider a version of safety in which we do not have safety for itself, but for for some with such that the range of is still countable. To make this work we have to equip with an appropriate -algebra and have to add to the definition of a RV that it must be measurable,666Formally we assume that is equipped with some -algebra that contains all singleton subsets of . We associate the co-domain of any function with the standard Borel -algebra on , and we call such an RV whenever the -algebra on is such that the function is measurable. and we have to modify the definition of support to the standard measure-theoretic definition (which specializes to our definition (2.1) whenever there exists a countable such that ). Yet nothing else changes and all previous definitions and propositions can still be used.777If we were to consider safety of the form for uncountable