 # Statistics with improper posteriors

In 1933 Kolmogorov constructed a general theory that defines the modern concept of conditional probability. In 1955 Renyi fomulated a new axiomatic theory for probability motivated by the need to include unbounded measures. We introduce a general concept of conditional probability in Renyi spaces. In this theory improper priors are allowed, and the resulting posteriors can also be improper.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

An often voiced criticism of the use of improper priors in Bayesian inference is that such priors sometimes don’t lead to a proper posterior distribution. This can happen when the marginal law of the data is not -finite (Taraldsen and Lindqvist, 2010), as sometimes encountered in applied settings with sparse data (Bord et al., 2018; Tufto et al., 2012).

The dangers of improper posteriors in Markov Chain Monte Carlo based methods of inference are well recognized

(e.g. Hobert and Casella, 1996). Within the theory to be presented here, improper posteriors as such are well-defined, however, and in practical applied statistics it will be of interest to develop numerical methods for computing such posterior densities. One possible method is indicated by Tufto et al. (2012, Appendix S4) for data on tropical butterflies, and is illustrated in Fig 1. Figure 1: An estimate of the log of an improper posterior density. It is obtained by alignment of kernel density estimates based on separate MCMC run, each run restricted to different subintervals.

The key idea is to consider the family of posteriors obtained from restriction to intervals, and then glue the resulting posteriors together in a postprocessing step. This simple idea is also the key for the general definition of the posterior we introduce in Section 3. The definition is based on the family of conditional probabilities appearing in the axioms of a conditional probability space as introduced by Renyi (1955).

As a simpler motivating example, suppose you observe a homogeneous Poisson process with a scale invariant prior density (Jeffreys, 1939, p.122)

 π(λ)=cλ (1)

on the Poisson intensity . The constant is arbitrary, carries no information, and is used below. Similar arbitrary constants will, however, play an important role in the theory in later parts of this paper. The marginal law of the number of events in the interval is then not -finite since

 P(X=0)=∫∞0P(X=0|λ)π(λ)dλ=∫∞0(λt)00!e−λtdλλ=∞ (2)

If you observe and formally multiply the prior by the likelihood you obtain an improper posterior

 π(λ|X=0)=e−λtλ (3)

This posterior law for is different from the initial prior law, and we claim that this is a correct way of incorporating the information given by . High values for are less probable given the observation . Further updating can be done with this posterior as a prior, and this is consistent with only one updating based on the initial prior.

A related example is the Beta posterior density for the success probability given by

 π(p\operatornamewithlimits∣x)=px−1(1−p)n−x−1 (4)

for a Bernoulli sequence with successes out of trials. This corresponds to the improper Haldane (1932) prior (Jeffreys, 1939, p.123)

 π(p)=p−1(1−p)−1 (5)

The posterior is improper if is zero as in the previous example. In all cases, however, the observation of the number of successes results in a corresponding updating of the uncertainty associated with . The posterior is in this case improper for and for . In all cases, however, the posterior in equation (4) contains the information given by the observation and the prior in equation (5).

The Haldane prior is the prior that corresponds to the formal Bayes estimator (Robert, 2007, p.29)

. This is the optimal frequentist estimator in the sense of being the unique uniformly minimum variance unbiased estimator. Similar optimality phenomena motivates the use of improper priors more generally

(Berger, 1985, p.409), and is also linked to fiducial inference (Taraldsen and Lindqvist, 2013).

Unfortunately, even people accepting the use of improper priors reject the above form of inference, on the ground that the posterior is not a probability distribution, and a mathematical theory is lacking for this

(Robert et al., 2009). This is understandable, and we agree initially with this point of view. We will demonstrate, however, that the above forms of, so far, formal inference can be made consistent with the axiomatic system of Rényi which allows improper laws. In Section 3 we develop his mathematical theory further to include general conditioning on a -field. This gives a rigorous mathematical foundation for inference with unbounded laws - including the previous three examples. The aim of this paper is to present key elements in a mathematical and philosophical theory of statistics that allows improper laws both as priors and posteriors based on the concept of a Rényi space as defined in the next section.

## 2 Statistics in Rényi space

The mathematical theory of statistics is thoroughly presented by Schervish (1995). We will next present the initial ingredients in this theory. The purpose is to have a platform for a generalization that effectively replaces the probability space of Kolmogorov (1933) with the conditional probability space of Renyi (1970). The reader may feel that we include too many elementary standard definitions, and we apologise for this. The reason is that there are small differences in most books on foundations, and the easiest way to be precise is to be explicit. For measure theory we follow mostly the conventions in the elegant treatment by Rudin (1987). A particular interpretation is indicated together with the mathematical theory, but the reader should recognise that many other interpretations are possible. We do not claim that there is one single “correct” interpretation, but we do claim that the indicated philosophical interpretation is useful in many applied concrete problems.

The initial ingredient is an abstract underlying space . This space is a non-empty set equipped with a law which assigns a weight to all measurable sets . The family of measurable sets is assumed to be a -field: (i) , (ii) implies , and (iii) implies . The set equipped with the family of measurable sets is then a measurable space (Rudin, 1987, p.8). A measurable set is also referred to as an event with a corresponding philosophical interpretation (Renyi, 1970, p.1-37).

The law is a positive measure defined on : (i) , (ii) implies , (iii) If are disjoint, then . Property (iii) is referred to as countable additivity. It is the distinctive feature that separates the theory here from the alternative approach of including improper laws by allowing finitely additive measures (Heath and Sudderth, 1989) (Schervish, 1995, p.21). Additionally, in the theory of Kolmogorov, is assumed. The set equipped with and is then a probability space. This is assumed in the following paragraphs until the concept of a Rényi space is introduced.

A random quantity is a measurable function (Schervish, 1995, p.583, p.606). A function is measurable if is measurable for any measurable . Using this notation, the law of is well defined by

 PZ(A)=P(Z∈A) (6)

Another random quantity is defined by when is measurable. This implies so the law of is determined by the law of . The general change-of-variables theorem is also a consequence (Schervish, 1995, Thm B.12). The notation is here used for the expectation of . These observations explain partly why the abstract space can be left unspecified in applications.

The previous paragraph defines the law of a random quantity . The notation is here used for the conditional law on . It is defined by the equation

 P((T∈C)A)=∫CPt(A)PT(dt) (7)

The left-hand-side defines for each event a measure on which is absolutely continuous with respect to , and is the unique density obtained from the Radon-Nikodym theorem (Rudin, 1987, p.121). The conditional law of given is defined by

 PtZ(A)=Pt(Z∈A)=P(Z∈A\operatornamewithlimits∣T=t) (8)

The general change-of-variables theorem implies that so the conditional law is determined by the joint law .

The previous gives some basic ingredients from probability theory needed in a mathematical theory of statistics. For the theory of statistics

Schervish (1995, p.82) assumes, as we also do, that there is a single space underlying also the statistical analysis for a particular model with observed data . The data and the model are random quantities. This means that and are measurable functions. The data space is the set corresponding to the possible observations . It is commonly referred to as the sample space. The model space is the set of possible model parameters . It is sometimes referred to as the model parameter space. A parameter is by definition a function of the model . It is hence also consistent to refer to as the model parameter. A statistic is by definition a function of the data . The function is sometimes referred to as an action, and the set is then the action space. The parameter is sometimes referred to as the focus parameter, and the set is then the focus space. These concepts are as explained in much more detail by Schervish (1995), but with some differences in notation and naming conventions. The involved concepts are illustrated in the commutative diagram in Figure 2.

A statistical model for observed data is conventionally specified by a family of probability measures indexed by the unknown model parameter (Lehmann and Casella, 1998, p.1). We assume additionally, as does Schervish (1995, p.83), that the statistical model is given by the conditional law

 PθX(A)=P(X∈A\operatornamewithlimits∣Θ=θ) (9)

of the data given . This requires also a specification of the data space and the model space . The task for the statistician is to infer something about a chosen focus parameter from the observed data . This is done by reporting a statistic . The problem is then to choose or characterise a suitable action , and to implement and perform associated calculations.

In Bayesian inference the prior is also specified, and together with the conditional law this determines the joint law of and . The joint law of the data and the model determines the posterior law . The simplicity and generality of this transformation of prior knowledge into the posterior knowledge given the data is one major argument in favour of the Bayesian paradigm. Additionally, it can be observed that the Bayes posterior expectation exemplify that the Bayes posterior can be used to define many possible actions in addition to the distribution estimator given by the posterior law itself.

The previous paragraphs give a condensed presentation of some of the initial ingredients in the well established mathematical theory of statistics as presented in considerable more detail by Schervish (1995). We now turn to the more general case where is a Rényi space as will be defined in the next few paragraphs. Assume first that is a -finite measure space. Let denote the family of elementary conditions defined by the requirement . A family of conditional probability measures is then defined by

 P(A\operatornamewithlimits∣B)=P(AB)P(B),∀A∈E,∀B∈B (10)

It can be verified that is a bunch (Renyi, 1970, Def.2.2.1): (i) , (ii) implies , (iii) There exists a sequence with . Condition (iii) follows since is -finite. Furthermore, and imply , and imply also the consistency requirement

 P(A\operatornamewithlimits∣B1)=P(AB1\operatornamewithlimits∣B2)P(B1\operatornamewithlimits∣B2) (11)

This shows that a -finite measure generates a conditional probability space . A conditional probability space is a measurable space equipped with a consistent family of conditional probabilities indexed by a bunch (Renyi, 1970, Def.2.2.2). The Rényi structure theorem shows that every conditional probability space is generated by a corresponding -finite measure (Renyi, 1970; Taraldsen and Lindqvist, 2016). It should be noted, however, that the above construction given by equation (10) gives a maximal bunch and then a maximal family of conditional probabilities. Consequently, every conditional probability space can be extended to a maximal conditional probability space.

It can be noted that the family of conditional probabilities and the family of elementary conditions are unchanged if is replaced by where is a positive constant. The Rényi state defined by is the equivalence class . The measures and are equivalent when interpreted as Rényi states, and the conditional probabilities give the philosophical interpretation in statistical models. A Rényi space is here defined to be a measurable space equipped with a Rényi state. It corresponds to a conditional probability space where the bunch is maximal. Our definition here of a Rényi space is equivalent with the definition of a full conditional probability space as used by Renyi (1970, p.43). We will follow conventional abuse of notation and use the same symbol for the equivalence class, a representative -finite measure, and the family of conditional measures.

Consider now again the commutative diagram in Figure 2 corresponding to a general statistical inference problem. It can be interpreted as before also when is assumed to be a Rényi space. A random quantity is a measurable function. It is said to be -finite if the law is -finite. The -finite functions define the natural arrows in the category of Rényi spaces. In the case our definition of being -finite is equivalent with

being a regular random variable as defined by

Renyi (1970, p.73).

The prior defines a Rényi state if is -finite. The interpretation is in terms of the conditional probabilities for . If the variable and the data are -finite, then the posterior is well defined with . This is discussed and exemplified by Taraldsen and Lindqvist (2010) and Lindqvist and Taraldsen (2018). In the next section this theory will be generalised so that the posterior is also allowed to be a conditional Rényi state as needed for the butterfly, Poisson process, and Bernoulli examples in Section 1

## 3 Improper posteriors as conditional Rényi states

Taraldsen and Lindqvist (2010, 2016) define the posterior law for the case where the data is -finite. The aim now is to prove existence and uniqueness of a posterior law without assuming that is -finite. The simple idea in the following is to define from a family indexed by the elementary conditions . The later is defined by the family of conditional probabilities defined by the Rényi state . It is assumed throughout this Section that is a Rényi space, and that all random quantities are defined on this space. The bunch associated to is the family of events defined by the requirement .

Assume that is a random quantity. If , then

 P((T∈C)A\operatornamewithlimits∣B)=∫CPt(A\operatornamewithlimits∣B)PT(dt\operatornamewithlimits∣B) (12)

defines similarly to how was defined by equation (7). The left-hand-side defines for each event a measure on which is absolutely continuous with respect to , and is the unique density obtained from the Radon-Nikodym theorem (Rudin, 1987, p.121).

If is a random quantity, then the previous defines a family of posterior laws indexed by . This is the necessary ingredient for the interpretation of a posterior law. This family is taken as the definition of the posterior law . The construction holds also more generally for a conditional probability space with an arbitrary bunch. In the following we restrict attention to Rényi spaces. The posterior law defines then a conditional Rényi state.

The next aim is to prove existence of a posterior law directly, and show that

 Px(AB)=Px(A\operatornamewithlimits∣B)Px(B),∀A∈E,∀B∈B,∀′x∈ΩX (13)

This generalization of the structure theorem of Rényi is the main result given below in Theorem 1. Its precise statement requires some more definitions.

A -finite measure is by definition a pseudo-law of a random quantity if . If is another pseudo-law, then the Radon-Nikodym theorem gives existence of a unique in such that . Existence of a pseudo-law follows by defining where is a probability measure such that with (Rudin, 1987, 6.9 Lemma). Given a pseudo-law , or more generally a law that dominates , we define the conditional law by the relation

 P((T∈C)A)=∫CPt(A)QT(dt) (14)

The left-hand-side defines for each event a measure on which is absolutely continuous with respect to , and is the unique density obtained from the Radon-Nikodym theorem (Rudin, 1987, p.121, p.123). If , then the previous shows that is the conditional law corresponding to the pseudo-law . This defines an equivalence between conditional laws, and defines the unique conditional Rényi state as an equivalence class. The main mathematical result can now be stated.

###### Theorem 1.

A random quantity determines a unique conditional Rényi state , and a unique family of conditional Rényi states for such that

 Pt(AB)=Pt(A\operatornamewithlimits∣B)Pt(B),∀′t∈ΩT (15)
###### Proof.

All that remains to prove is equation (15). Observe first that and give

 PT(dt\operatornamewithlimits∣B)=Pt(B)P(B)QT(dt)

Using this gives
, so

 ∫CPt(AB)QT(dt)=∫C%Pt(A\operatornamewithlimits∣B)Pt(B)QT(dt)

and equation (15) is proved. ∎

All of the previous can be repeated with a replacement of the measurable set with a positive measurable function and . Conditional expectation of complex valued functions can be defined by decomposition in positive and negative parts and then in real and complex parts. Consideration of the dual space gives conditional expectation of a separable Banach space valued . The conditional expectation is in particular well defined when takes values in a separable Hilbert space. Separability is assumed to ensure almost everywhere definition on .

Conditional expectation with respect to a -field is defined by where and . It can be noted that we define directly following Kolmogorov instead of more indirectly by first defining as is more common. This has the advantage of allowing a completely general measurable space , whereas the common approach requires separability properties according to Schervish (1995, p.616, Prop.B.24).

## 4 Examples

### 4.1 The uniform Rényi state on R

The most familiar example of an improper prior is given by Lebesgue measure on the real line equipped with the family of Borel sets . The corresponding Rényi state is the equivalence class . The Rényi state is equivalently given by the bunch , and the family of conditional probabilities for and . This defines a full conditional probability space in the sense of Renyi (1970, p.43), or equivalently, in our terminology, a Rényi space.

In the context of statistical modeling it is furthermore assumed, as in Figure 2, that is a random variable defined on the underlying Rényi space with . The latter can also be written as or as where, as always, . These equations are interpreted as given for representative measures in the equivalence classes.

The family does not contain the empty set, it is closed under finite unions, and there is a sequence with . The family is therefor a bunch. The family of probability measures defined by for defines a conditional probability space in the sense of Renyi (1970, Definition 2.2.2, p.38). The Rényi structure theorem ensures that this space is generated by a -finite measure, or equivalently, that this conditional probability space can be extended to a unique Rényi space. This Rényi space is given by the Rényi state described in the previous paragraphs.

In applications the uniform law on the real line is often described as the limit of the probability measures as . The previous paragraph identifies the uniform law not as a limit, but as given by the collection of probability measures itself. The uniform Rényi state can, however, also be obtained as . Each and are interpreted as Rényi states. The limit can be defined as in the convergence of conditional probability spaces defined by (Renyi, 1970, p.57), but also in the sense of convergence of Rényi states given by equivalence classes (Taraldsen and Lindqvist, 2016, p.5015)(Bioche and Druilhet, 2016).

The interpretation of comes from the definition of a conditional probability space as discussed in more detail by Renyi (1970, p.34-38). Given that the law is the probability distribution concentrated on . The interpretation of all these conditional probabilities can be, depending on the situation at hand, in a frequentist sense or in a subjective Bayesian sense. This generalizes to other unbounded laws including the priors and posteriors for the butterfly, Poisson process, and Bernoulli examples in the Introduction. It is most important since it gives the needed interpretation of the mathematical theory in the context of statistical inference. The same interpretation is in particular used for both the prior and the posterior. They are on an equal footing, and this is how uncertainty is represented in the statistical model.

### 4.2 Conditional Rényi state densities

Assume that for -finite measures and . It follows that by choosing . This can be verified directly by the defining equation (14). It follows in particular that this is consistent with the definition of an improper posterior as used by Bioche and Druilhet (2016, p.1716). The previous can also be reformulated simply as

 f(θ\operatornamewithlimits∣x)=f(x,θ) (16)

There is no need for a normalization constant since two proportional densities are equivalent when considered as conditional Rényi states. The symbol is used here, and in the following, as a generic symbol for a density and also for conditional densities. The arguments and give the interpretation as different functions.

Let be a function with . In the context here this statement is interpreted as stating that is measurable and that . Similar context dependent interpretations are also used elsewhere, but then without further explanation. It follows then that , and so

 f(⋅\operatornamewithlimits∣θ)=c(θ)f(⋅\operatornamewithlimits∣θ) (17)

when interpreted as a conditional Rényi densities. We will then also write with this interpretation. The resulting equivalence class of conditional densities is the conditional Rényi state density. These observations are special cases of the discussion before Theorem 1 leading to the definition of a conditional Rényi state as an equivalence class.

A formal prior density gives the joint density , and this shows that the interpretation of as prior information is dubious in this case. It is only when the density is normalized that the common procedure of combining a prior density with the model density into a resulting joint density and a posterior density is well defined. In all cases, however, the posterior density is well defined as a conditional Rényi state density from the joint density as in equation (16). In general, the problem with the prior arises when the statistical model itself is allowed to be a conditional Rényi state. The likelihood is not well defined in this case.

A concrete example with an undefined likelihood is discussed by Lavine and Hodges (2012, p.43) and Lindqvist and Taraldsen (2018, p.102). They consider a Gaussian density

 f(x\operatornamewithlimits∣θ)=c(θ)exp(−θxTQx2) (18)

with a known precision matrix . This is an improper density if

has at least one eigenvalue equal to zero, and then the likelihood is undefined due to the ambiquity introduced by

. The normalization constant is undefined. A seemingly natural candidate, motivated by the proper model case , is given by , and this was used initially in the computer software WinBUGS (Lindqvist and Taraldsen, 2018, p.102). This choice in WinBUGS was later changed into . It is clear that, in this situation, a prior information in the form of a prior density can not be combined with the given improper model to give a well defined posterior density .

A possible solution is given by restricing to the orthogonal complement of the null space of . The model density is then proper, and is the correct normalization when is the dimension of the null space of . In the case considered by Lindqvist and Taraldsen (2018, p.103) this corresponds to a change from a uniform to a point mass distribution at for . More generally, the model in equation (18

) can be further specified as a Gaussian distribution for

with point masses at components. Anyhow, a well defined posterior requires that the joint density

, or more generally as just exemplified, a well defined joint distribution of the data

and the model . Lindqvist and Taraldsen (2018, p.103) obtain a unique normalized posterior only in the case where the data is -finite. Theorem 1 ensures, however, that a unique posterior Rényi state is defined also without requiring a -finite .

A more transparent example is given by letting correspond to Lebesgue measure in the plane. The law of given and the posterior law of given correspond then both to Lebesgue measure on the line. The factorization with is completely arbitrary. This can be interpreted according to Hartigan (1983, p.26) as saying that the marginal law is not determined by the joint law. The choice of a pseudo-law plays a role similar to the role of choosing a marginal law in the theory of Hartigan. The interpretation of Hartigan is discussed in more detail by Taraldsen and Lindqvist (2010), but it differs from the interpretation here. We insist that the marginal law of is uniquely determined from the joint law of and . In the case here it is given by the measure which is not -finite. It follows in particular that the decomposition fails in this case. However, regardless of the choice of a pseudo-law , the decomposition defines uniquely as a conditional Rényi state.

### 4.3 Elementary conditional Rényi states

Let be Lebesgue measure in the plane, and consider the indicator function of the upper half plane: . It follows that so is not -finite. The conditional law is, however, a well defined unique conditional Rényi state. It corresponds to the dominating measure . The conditional law is Lebesgue measure restricted to the upper half-plane and is Lebesgue measure restricted to the lower half plane. This demonstrates directly that the conditional law is also defined when is not -finite.

Consider more generally a random natural number . A dominating measure for is the counting measure on . This gives . Let . The previous gives then the elementary definition of the law

 P(A\operatornamewithlimits∣B)=%P(AB),P(B)>0 (19)

The conditional is not defined from this argument when since can be arbitrarily specified in this case. The previous is consistent with the familiar for the case where . A Rényi state is arbitrary up to multiplication by a positive constant. It is an equivalence class of -finite measures. Theorem 1 gives the existence of conditional expectations in full generality - including this elementary case. The restriction for defining has here been relaxed to the condition by Theorem 1.

Stone and Dawid (1972, p.370) consider inference for the ratio of two exponential means. They assume that and

are independent exponentially distributed with hazard rates

and respectively, so will have a distribution that only depends on . In fact, , where has a Fisher distribution with and degrees of freedom since a standard exponential variable is distributed like a variable. Stone and Dawid (1972) conclude that the density is

 f(z\operatornamewithlimits∣θ)=θ−1(1+z/θ)−2=θ(θ+z)−2 (20)

and that the posterior density corresponding to a prior density is

 π(θ\operatornamewithlimits∣z)∝θπ(θ)(θ+z)2 (21)

A second argument considers a joint density for from a joint prior . This gives , and the posterior density of follows by integration over to be

 π(θ\operatornamewithlimits∣x,y)∝θπ(θ)(θx+y)3∝θπ(θ)(θ+z)3 (22)

Equation (22) gives a posterior given the data that differs from the posterior found in equation (21). This constitutes the argument and paradox presented originally by Stone and Dawid (1972).

We will next reconsider the above example in view of the theory presented in the previous Section. This has already been indicated by Taraldsen and Lindqvist (2010), and is discussed in more detail by Lindqvist and Taraldsen (2018). Lindqvist and Taraldsen (2018) rely on a theory where it is only allowed to condition on -finite statistics. We extend this argument now with reference to Theorem 1 which allows conditioning on any statistic.

The initial assumptions are equivalent with a joint distribution given by the density:

 f(x,z,θ,ϕ)=π(θ)f(x,z\operatornamewithlimits∣θ,ϕ)=π(θ)θϕ2xe−ϕx(θ+z) (23)

Integration over gives

 f(x,z,θ)=π(θ)θx−2(θ+z)−3 (24)

which implies

 π(θ\operatornamewithlimits∣x,z)=π(θ)θx−2(θ+z)−3=π(θ)θ(θ+z)−3 (25)

The second equality holds since it is equality in the sense given by an equivalence class as in Theorem 1. The right hand side can be multiplied by an arbitrary positive function without changing the equality sign. Equation (25) is equivalent with equation (22) since there is a one-one correspondence between and .

An alternative is to integrate equation (23) over to obtain

 f(z,θ,ϕ)=π(θ)θ(θ+z)−2 (26)

which implies

 π(θ\operatornamewithlimits∣z,ϕ)=π(θ)θ(θ+z)−2 (27)

This is similar to equation (21), but the conditioning differs.

Reconsider now the argument leading to equation (21). The first observation was that has a distribution that only depends on . This is true, but it is still conditionally given both and as assumed initially in the model. Equation (21) and equation (20) are wrong as stated, interpreted as conditional Rényi states, but can be corrected by a replacement of by and by . The error in the original argument, as interpreted in the theory presented here, is that it can not be concluded that even though the later does not depend on . Equation (27) is not in conflict with equation (25) for the same reason.

More generally, it can be noted that even if a conditional law does not depend on it can not be concluded that it equals . This is demonstrated by equation (25) and equation (27). The rule holds for probability distributions, and also more generally if and are -finite given that does not depend on . Stone and Dawid (1972) calculated formally as if the rule where generally valid. This resulted in two conflicting results. This example, and the other examples constructed by Stone and Dawid (1972) are most important since they illustrate important differences between the theories of Kolmogorov and Rényi . Stone and Dawid (1972) pointed out that purely formal manipulations with improper distributions, treated as if they obeyed all the rules of proper distributions, could lead to paradoxical inconsistencies — which by reductio ad absurdum — is an argument against doing such formal computations.

Observations can give rejection of a simple hypothesis at the

level, but a Bayesian analysis can give the hypothesis a posterior probability larger than

. Lindley (1957) discussed this seemingly paradoxical phenomena with reference to previous work by Jeffreys (1939). Both Berger (1985, p.148-156) and Robert (2007, p.230-236)

give thorough discussions of the problem of testing a point null hypothesis, and explain that the use of improper priors is a delicate issue in this case. This has also been emphasized in several discussion papers

(Shafer, 1982; Berger and Sellke, 1987; Berger and Delampady, 1987; Robert et al., 2009; Robert, 2014). A full discussion of this problem in the context of the theory of Rényi will not be given here, but we will indicate some consequences and observations.

The most important is to note that any prior, improper or not, contains information. We agree with Robert (2007, p.29) that it is a mistake to think of improper priors as representing ignorance. This is particularly important when testing a point null hypothesis, which in most situations implies a non-symmetric treatment of the hypothesis and the alternative hypothesis. The relevance of the information is specific to each particular case with its own interpretation. Rényi explained that improper laws can be interpreted in terms of the associated family of conditional probabilities. This holds for both prior and posterior laws, and also so in a hypothesis testing problem.

Assume that with unknown mean and known variance so . Consider the hypothesis versus the alternative . This basic hypothesis testing problem is often the first example of hypothesis testing presented to statistics students using the notation versus . Our notation identifies the hypothesis and its alternative more explicitly with a partition of the model parameter space .

The uniformly most powerful unbiased level test rejects if (Casella and Berger, 1990, p.374)

 t=ϕ(x)=2Φ(−|x/σ|)≤α (28)

where with

. The test statistic

is the p-value. It is a probability, but it must not be confused with the posterior probability of given the data. The posterior probability is undetermined in this classical analysis.

Consider next a Bayesian analysis with a prior density with respect to the measure . The Dirac measure is dimensionless, and it is hence assumed that , , and are dimensionless in the following. A Bayesian test with minimal posterior risk rejects if the posterior probability of given the data is small (Berger, 1985, p.164)

 s=ϕπ(x)=π(0\operatornamewithlimits∣x)=[1+∫f(x\operatornamewithlimits∣θ)π(θ)dθf(x\operatornamewithlimits∣0)π(0)]−1≤LIILII+LI (29)

is the loss corresponding to a type I error,

is the loss corresponding to a type II error, and the loss is zero otherwise. The classical and Bayesian tests are similar in form, but the Bayesian test statistic

depends on the prior density . We have here restricted attention to the case where the posterior is proper, and the above integral is then finite.

Consider first the constant prior . This gives

 ϕ∞(x)=[1+f(x\operatornamewithlimits∣0)−1]−1=[1+√2πσexp(x22σ2)]−1 (30)

Figure 3 Figure 3: Two test statistics for testing H0:θ=0 versus H1:θ≠0 based on x∼N(θ,σ2).

shows the remarkable similarity between the p-value and the posterior probability for the case . Robert (2007, p.234)

also notes this similarity in his Table 5.2.5, and discusses this phenomena. This similarity, and more generally the close resemblance in practice between Bayesian and classical methods for many common statistical problems, is a common theme in the early fundamental texts on Bayesian statistics

(Jeffreys, 1939; Lindley, 1965; Savage, 1954).

A common way to justify usage of improper priors is to consider limits of proper priors. Consider the sequence of proper prior densities defined by and for . Let and for . Define also and for . If the variance , then

intuitively as densities since a normally distributed variable with infinite variance should correspond to a variable with a constant density. This convergence is in fact true if interpreted as densities with respect to

in the sense of -vague convergence (Bioche and Druilhet, 2016). It seems hence reasonable to take the sequence of proper densities as an approximation of the improper density . The point mass at is then fixed, and the densities for approximate the flat density. Equation (29) gives, however, that , since . This is in conflict with . The source of the problem is that the generalized density with respect to does not converge to the density as intuition would suggest, but instead as explained by Bioche and Druilhet (2016).

The previous can be used to illustrate the Jeffreys-Lindley paradox. Assume that is statistically significantly different from at the level of significance in the sense of as defined by equation (28). The convergence ensures, however, that the posterior probability for a prior with a sufficiently large . This can, of course, only be considered to be paradoxical in a situation where the prior is reasonable. Consider instead a symmetric proper prior on the form and for where is non-increasing in . The critical values , and for corresponds to the p-values , and as also indicated in Figure 3. The corresponding posterior values are, however, bounded from below by , , and for any prior on the given form (Berger, 1985, p.154, Table 4.4). The conclusion is that a large class of reasonable symmetric proper priors gives a posterior probability much larger than the classical -value. This is an even more striking illustration of the Jeffreys-Lindley paradox.

Consider again the improper prior density with respect to . The value of the constant is of no concern, as we know from the general theory, and also explicitly from equation (30). Equation (30) has, however, a dependency on that is a concern. It follows that the prior corresponds effectively to two different priors if a concrete problem is formulated first in terms of one measurement scale and then alternatively in terms of a different measurement scale. The p-value does not share this defect, since it only depends on the scaled variable as given in equation (28). An alternative prior is obtained by reformulating the original problem in scaled variables and using the prior for this problem. Transforming back gives the improper density and for . The result is a posterior probability that only depends on , and it is remarkably close to the p-value as shown in Figure 3 for all values of .

It was seen above that a seemingly reasonable approximation by proper priors failed. An alternative sequence of proper priors is obtained by using the interpretation Rényi gives for the prior corresponding to the density . Let for . The conditional probability is then given by the proper density defined by and for . In this case , and this gives in particular a proper prior with a posterior that approximates the p-value as shown in Figure 3. The appropriateness of a prior on this form can not be decided in general, but must be decided in each concrete case.

Consider finally a concrete problem where it is assumed that the prior density gives a reasonable prior for . Assume that a measurement is done and is observed. In this case, the classical and Bayesian procedures are very similar if . Assume next that the experimenter chooses to repeat the measurement more times. The prior information is, of course, not changed by this decision, so the prior is still given by . A sufficient statistic is given by the empirical mean . The classical p-value and the posterior probability are in this case very different if s large. It follows in particular that as for all fixed , and the Jeffreys-Lindley paradox reappears. We see this, in fact, as no paradox, but as a most important and striking demonstration of an important difference between Bayesian and classical inference.

### 4.6 Hypothesis testing with improper posteriors

The possibility of improper posteriors was not considered in the previous discussion of the Jeffreys-Lindley paradox. It was, in fact, demonstrated that there exist a proper prior so that the classical and the Bayesian decision rules essentially coincides as shown in Figure 3. This proper prior appears naturally from the Rényi interpretation of a corresponding improper prior in terms of a family of conditional probabilities. It was also noted that this improper prior can be approximated arbitrary well by a sequence of proper priors in a natural topology for Rényi states given by q-vague convergence (Bioche and Druilhet, 2016).

Another observation is that, in general, a classical matching prior is typically improper. DeGroot (1973) demonstrates, by an elegant argument, how a matching prior can be determined for a different problem. A classical matching prior is here defined to be a prior such that the posterior coincides with the p-value. The prior is only approximately matching as shown in Figure 3. A matching prior - if it exists - is determined by the integral equation that follows by equating and in equation (28) and equation (29

). We will not discuss this further here, but observe that a solution is given explicitly by an inverse Fourier transformation.

The butterfly, Poisson process, and Bernoulli examples in the Introduction can be used to exemplify a hypothesis testing problem with an improper posterior. Consider, instead, testing of versus based on observing with unknown . This problem, but with known variance , is considered by Berger (1985, p.147-148). He notes that a constant prior corresponds to an infinite mass to both hypothesis, but argues that this can be tackled by consideration of increasingly larger intervals. The essence of the following argument is that this argument should in principle then be equally possible for the posterior. This is given by the general interpretation of any Rényi state by its corresponding family of conditional probabilities.

Assume that the prior density is with respect to . The posterior is then improper, and given by the density