 # Improper posteriors are not improper

In 1933 Kolmogorov constructed a general theory that defines the modern concept of conditional expectation. In 1955 Renyi fomulated a new axiomatic theory for probability motivated by the need to include unbounded measures. We introduce a general concept of conditional expectation in Renyi spaces. In this theory improper priors are allowed, and the resulting posterior can also be improper. In 1965 Lindley published his classic text on Bayesian statistics using the theory of Renyi, but retracted this idea in 1973 due to the appearance of marginalization paradoxes presented by Dawid, Stone, and Zidek. The paradoxes are investigated, and the seemingly conflicting results are explained. The theory of Renyi can hence be used as an axiomatic basis for statistics that allows use of unbounded priors. Keywords: Haldane's prior; Poisson intensity; Marginalization paradox; Measure theory; conditional probability space; axioms for statistics; conditioning on a sigma field; improper prior

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

An often voiced criticism of the use of improper priors in Bayesian inference is that such priors sometimes don’t lead to a proper posterior distribution. This can happen when the marginal prior distribution of the data is not

-finite (Taraldsen and Lindqvist, 2010), as sometimes encountered in applied settings with sparse data (e.g. Druilhet et al., 2016; Tufto et al., 2012, Appendix S4).

As a simple motivating example, suppose that we observe a homogeneous Poisson process, and that we start with a non-informative scale prior on the Poisson intensity . The distribution of the number of events in the interval is then not -finite since is infinite. If we observe and formally multiply the prior by the likelihood we obtain an improper posterior . This distribution for is different from the initial prior, and we claim that this is a correct way of incorporating the information given by .

A related example is the Beta posterior density for the success probability given by for a Bernoulli sequence with successes out of trials. This corresponds to the Haldane (1932) improper prior , and the posterior is improper if or is zero. In all cases, however, the observation of the number of successes results in a corresponding updating of the uncertainty associated with . This is given by the possibly improper posterior.

Unfortunately, even people accepting the use of improper priors reject this form of inference, on the ground that the posterior is not a probability distribution, and a mathematical theory is lacking for this

(Robert et al., 2009). This is understandable, and we agree initially with this point of view. We will demonstrate, however, that the above forms of, so far, formal inference can be made consistent with the axiomatic system of Rényi which allows improper laws. We propose and claim that the mathematical theory developed in the following gives a rigorous foundation for inference with unbounded laws.

The most familiar example of an unbounded law is the uniform law on the real line . Following Renyi (1970) and Taraldsen and Lindqvist (2016) the uniform law is identified with the countable collection of uniform laws on each interval . This gives then also the interpretation of : Given that the law is the uniform probability distribution on . The family is a bunch and the family defines a Rényi space. The improper laws for the intensity and the success probability in the initial examples are interpreted similarly. The concept of a Rényi space and other elements from measure theory are summarized in Appendix A.

The aim of this paper is to present a theory of statistics that allows improper laws both as priors and posteriors. This extends the results of Taraldsen and Lindqvist (2010, 2016), and provides stronger links to the results on improper laws presented by Hartigan (1983) and Bioche and Druilhet (2016). The main mathematical result is Theorem 1 which proves existence and uniqueness of conditional expectation on Rényi spaces. The existence and uniqueness proof relies on the Radon-Nikodym theorem and a generalization of the Rényi structure theorem. The theory of conditional expectation has been most important for the development of measure theory, probability and statistics based on Kolmogorov’s concept of a probability space. The generalization of this to the setting of Rényi spaces can hence be expected to be important for future developments in mathematics and related fields.

Within this framework we reach the view that improper posteriors, just as improper priors, are not ‘improper’ but may reflect complete or partial ignorance about a parameter after conditioning on the data. Returning to the above Poisson-process example, at time , we have clearly learned something about in that our belief in large values of the Poisson intensity has decreased while our relative degree of belief in small values of has remained approximately unchanged. That the posterior is improper do not imply that our prior was wrong, but only that more data perhaps needs to be collected if possible. Proceeding by using the improper posterior at time as prior in subsequent inference, say based on the number of occurrences observed in a sufficiently long subsequent interval , we indeed eventually reach the same proper final posterior as the one reached by combining the initial scale prior and the likelihood for the data on . We hope that the reader can appreciate that this argument indicates also the potential philosophical importance of unbounded laws more generally.

The most influential initial work on Bayesian inference is given by the book of Jeffreys (1939). Parts of his arguments were mainly intuitive, and there is a lack of mathematical rigor as also observed by Robert et al. (2009). The needed mathematical theory for a rigorous reformulation of the original arguments of Jeffreys (1939) is presented next.

## 2 Existence and uniqueness conditional expectation

Taraldsen and Lindqvist (2010, 2016) define the posterior law for the case where the data is -finite. The aim now is to prove existence and uniqueness of a posterior law without assuming that is -finite. It will be convenient to do this as an extension of conventional measure theory, and then get the result for the posterior law as a special case. The reader is advised to consult Appendix A for the definition of a Rényi space and other elements from measure theory if needed.

Let be a measurable space and let be a measure space (Rudin, 1987). Let a measurable function be given for each . We define to be a strong random measure on with respect to the law if the following holds for all disjoint measurable

1. for almost all .

2. for almost all .

The notation denotes the union of two disjoint and measurable sets. Equality for almost all means that equality holds for all in a set where . All functions of here and in the following are assumed only to be defined almost everywhere, and the set corresponding to depends on .

If there exists with and , then is said to be -finite. If, additionally, is a measure on for all , then is a random measure. In this paper we define the space X to be regular if every -finite strong random measure can be represented by a random measure. The Borel -algebra of a complete separable metric space X gives a regular space.

The concept of a strong random measure is here introduced similarly to how Skorohod (1984, p.1-2) introduces the concept of a strong random operator. He also defines the notion of a weak random operator by duality, but this is equivalent with strong when the image space is the complex numbers. It should be noted that the naming convention here is counter intuitive in the sense that a strong random operator is a weaker concept than a random operator, but there are good reasons for adopting the conventions of Skorohod.

Let be a measure on X and let be measurable. We define a strong random measure to be the conditional law of given if

 μ(A[T∈C]\operatornamewithlimits∣B)=∫Cμt(A\operatornamewithlimits∣B)μT(dt\operatornamewithlimits∣B) (1)

for all measurable and all with , where , , and (no normalization here!). The notation is similar to the notation used by Doob (1953, p.1). For a conditional law we use the notation . The double use of the symbol in the above is justified by:

###### Theorem 1.

A unique -finite conditional law exists if is a -finite measure on X and is measurable.

###### Proof.

If is measurable, then shows that uniqueness up to multiplication by a positive function is the best possible uniqueness. The measure is dominated by the -finite measure , so a unique normalized strongly random follows from the Radon-Nikodym theorem. It remains to prove that for a -finite strong random measure .

Let with and , and define . It follows that and . For put and define for a general .

It must be proved that the above construction is well defined. Let . It must be proved that (*) . Observe first that gives . The (*) claim follows from

 μ(A(T∈D))=∫Dμt(A\operatornamewithlimits∣Bn)μt(Bn\operatornamewithlimits∣Bm)μT(dt\operatornamewithlimits∣Bm)=∫Dμt(A\operatornamewithlimits∣Bm)μT(dt\operatornamewithlimits∣Bm) (2)

since follows from the case in equation (2). This defines a unique . The remaining claims follow by verification and is left to the reader. ∎

All of the previous can be repeated with a replacement of the measurable set with a positive measurable function and conditional expectation of complex valued functions can be defined by decomposition in positive and negative parts and then in real and complex parts. Consideration of the dual space gives conditional expectation of a separable Banach space valued . The conditional expectation is in particular well defined when is a separable Hilbert space valued. Separability is assumed to ensure almost everywhere definition on T.

An alternative approach, as noted by Kolmogorov (1933, p.54, eq.10), is to define the conditional expectation by integration with respect to the conditional probability. For B equal to the set of real numbers or the set of complex numbers the alternative approach gives the same strong random linear functional, but for more general B there are many alternative routes with different results.

Conditional expectation with respect to a -field is defined by and . It can be noted that we define directly instead of more indirectly by first defining as is more common. This has the advantage of allowing a completely general measurable space T, whereas the common approach requires separability properties of X according to Schervish (1995, p.616, Prop.B.24).

It can finally be observed that the proof of Theorem 1 contains a proof of a structure theorem for strong random Rényi spaces defined by a consistent family of strong random conditional probabilities for for a fixed bunch : The family is generated by a strong random measure such that . The consistency requirement is that implies and

 μt(A\operatornamewithlimits∣B1)=μt(AB1\operatornamewithlimits∣B2)μt(B1\operatornamewithlimits∣B2) (3)

## 3 Examples

### 3.1 Mathematical statistics

A statistical model is given by the structure

 [row sep=normal, ampersand replacement=&] &Ω_Θ[r, ”ψ”] & Ω_Γ (Ω, E, P) [ur, ”Θ”] [drr, ”Y” near end] [dr, ”X”’] [urr, ”Γ”’ near end] & & &Ω_X [r, ”ϕ”’] & Ω_Y (4)

In conventional theory (Schervish, 1995) the space is a probability space. In the more general setting of a Rényi space considered here the underlying law is a conditional probability law with a corresponding bunch . The law of the data given the parameter is defined by . The law of the model parameter given the data is defined by . The posterior law of a parameter and the law of a statistic are determined by this and equation (4).

In previous work by Taraldsen and Lindqvist (2010, 2016); Lindqvist and Taraldsen (2017) it was required that the data is -finite, but Theorem 1 shows that this requirement is not needed. There is, however, a prize: The posterior must in general be interpreted as an improper law. The initial examples demonstrate, however, that even if the data is not -finite, it may happen that the posterior is a proper distribution for some values of . Assuming that is -finite ensures that the posterior is always proper.

A similar comment holds for the law of the data . The factorization holds uniquely if and only if is -finite. In this case, a -finite is required if the most common Bayesian recipe is to be used:

1. Specify a statistical model law

2. Specify a prior

3. Compute the posterior

If is not -finite, the first two steps must be replaced by a direct specification of a joint -finite law , and Theorem 1 ensures then that the posterior is uniquely defined.

### 3.2 Densities

Assume that . It follows that , and that . This can be verified directly by the defining equation (1). It follows in particular that this is consistent with the definition of an improper posterior used by Bioche and Druilhet (2016, p.1716). The previous can also be reformulated as : There is no need for a normalization constant since two proportional densities are equivalent!

Let be an otherwise arbitrary measurable function. It follows then that when interpreted as a strong random conditional law. A formal prior density gives the joint density , and this shows that the interpretation of as prior information is dubious in this case as pointed out by Lavine and Hodges (2012) and Lindqvist and Taraldsen (2017)

using a model with intrinsic conditional auto-regressions. It is the resulting joint distribution and the conditional laws that can be interpreted. The usual decomposition in a prior and a model can only be interpreted uniquely if the prior for the model parameter

is -finite. Conversely, Lindqvist and Taraldsen (2017) obtain a posterior only in the case where the data is -finite, but Theorem 1 ensures that a posterior is defined also without requiring a -finite .

A concrete simple example is given by letting correspond to Lebesgue measure in the plane. The law of given and the posterior law of given correspond then both to Lebesgue measure on the line. The factorization with is completely arbitrary. This can be interpreted according to Hartigan (1983, p.26) as saying that the marginal distribution is not determined by the joint distribution. This interpretation is discussed in more detail by Taraldsen and Lindqvist (2010), but it differs from the interpretation here. The marginal law of is unique, but given by the non--finite measure . It follows in particular that the decomposition fails in this case.

Let be Lebesgue measure in the plane, and consider the indicator function of the upper half plane: . It follows that so is not -finite. The conditional law is, however, well defined. The conditional law is Lebesgue measure restricted to the upper half-plane and is Lebesgue measure restricted to the lower half plane. This demonstrates directly that the conditional law is also defined when is not -finite.

Consider more generally a function . This gives and then also the elementary definition of the law for any with . This is consistent with the familiar for the case where . A law is arbitrary up to multiplication by a positive constant: It is an equivalence class of -finite measures. Theorem 1 gives the existence of conditional expectations in full generality - including this elementary case.

Stone and Dawid (1972, p.370) consider inference for the ratio of two exponential means. They assume that and

are independent exponentially distributed with hazard rates

and respectively. It is then clear that will have a distribution that only depends on . In fact, , where has a Fisher distribution with and degrees of freedom since a standard exponential variable is distributed like a variable. It follows then that the density is

 f(z\operatornamewithlimits∣θ)=θ−1(1+z/θ)−2=θ(θ+z)−2 (5)

The posterior density corresponding to a prior density is then

 π(θ\operatornamewithlimits∣z)∝θπ(θ)(θ+z)2 (6)

A different argument goes as follows. The joint density with a joint prior gives The marginal posterior of follows by integration over which gives

 π(θ\operatornamewithlimits∣x,y)∝θπ(θ)(θx+y)3∝θπ(θ)(θ+z)3 (7)

Equation (7) gives a posterior given the data that differs from the posterior found in equation (6). This constitutes the argument and paradox presented originally by Stone and Dawid (1972, p.370).

A range of similar paradoxes were presented later by Dawid et al. (1973) with discussion of links to fiducial inference. They claim that the Fraser (1968) theory of structural inference is intrinsically paradoxical under marginalization. Furthermore, Lindley, in his discussion of the paper (Dawid et al., 1973, p.218) writes:

The paradoxes displayed here are too serious to be ignored and impropriety must go. Let me personally retract the ideas contained in my own book.

This is of particular relevance here since in 1964, in his book, Lindley (1980, p.xi) wrote:

The axiomatic structure used here is not the usual one associated with the name of Kolmogorov. Instead one based on the ideas of Rényi has been used.

We argue here and in the following that Lindley’s initial intuition was correct: The theory of Rényi gives a mathematical foundation for statistics that allows unbounded measures.

We disagree with the criticism of Fraser’s structural inference, but more importantly we will next explain that there is no paradox related to the above problem when it is treated within the theory of Rényi. This has already been indicated by Taraldsen and Lindqvist (2010), and is discussed in more detail by Lindqvist and Taraldsen (2017). Lindqvist and Taraldsen (2017) rely on a theory where it is only allowed to condition on -finite statistics. We extend this argument now with reference to Theorem 1 which allows conditioning on any statistic.

The initial assumptions are interpreted to imply a joint distribution given by the density:

 f(x,z,θ,ϕ)=π(θ)θϕ2xe−ϕx(θ+z) (8)

Taraldsen and Lindqvist (2010) explain that any marginal is determined by a joint density and integration over gives then

 f(x,z,θ)=π(θ)θx−2(θ+z)−3 (9)

which implies

 f(θ\operatornamewithlimits∣x,z)=π(θ)θx−2(θ+z)−3=π(θ)θ(θ+z)−3 (10)

The second equality holds since it is equality in the sense given by an equivalence class as in Theorem 1. The right hand side can be multiplied by an arbitrary positive function without changing the equality sign. Equation (10) is equivalent with equation (7) since there is a one-one correspondence between and .

An alternative is to integrate over to obtain

 f(z,θ,ϕ)=π(θ)θ(θ+z)−2 (11)

which implies

 f(θ\operatornamewithlimits∣z,ϕ)=π(θ)θ(θ+z)−2 (12)

This is similar to equation (6), but it is different since equation (6) only condition on . Equation (12) is not in conflict with equation (10) for the same reason.

Starting with either equation (9) or equation (11) gives

 f(z,θ)=∞⋅π(θ) (13)

which shows that neither nor can be represented by a -finite measure. This implies that the argument in equations (5-6) is wrong given equation (8). Equation (7), or equivalently equation (10), gives the correct posterior distribution for .

If, instead, the prior is used, then the result will be

 f(θ\operatornamewithlimits∣z,ϕ)=ϕ−1π(θ)θ(θ+z)−2=π(θ)θ(θ+z)−2 (14)

and

 f(θ\operatornamewithlimits∣x,z)=π(θ)θx(x(θ+z))−2=π(θ)θ(θ+z)−2 (15)

The conditionals coincide, but it is still true that neither equals the law of since the law of still fails to be -finite.

If, however, equation (5) together with a prior is taken as the initial -finite law for , then equation (6) is the correct posterior. The paradox is then removed since the conflicting conclusions are consequences of different initial assumptions.

It can finally be noted that even if a conditional law does not depend on it can not be concluded that it equals . This is demonstrated by equation (10) and equation (12). This rule holds for probability distributions, and also more generally if and are -finite. Stone and Dawid (1972) calculated formally as explained above as if the rule where generally valid. This error resulted in two conflicting results.

## 4 Final remarks

It follows from the previous quotations of Lindley that he initially supported the use of conditional probability spaces as introduced by Rényi. We have argued that this initial suggestion is indeed a natural approach to Bayesian statistics including commonly used objective priors.

As explained, the marginalization paradoxes seem to have been the main reason for Lindley’s change in opinion on this. Tony O’Hagan interviewed Lindley for the Royal Statistical Society’s Bayes 250 Conference held in June 2013. Lindley explains very nicely that all probabilities are conditional probabilities, but also recalls his reaction to the marginalization paradoxes:

Good heavens, the world is collapsing about me.

Lindley continuous to argue that Bayesian statistics without improper priors is a sound theory, and that the focus should be on how to quantify the prior uncertainty of the unknown parameters. The parameters should be viewed as real physical quantities regardless of which experiment is later used for decreasing their uncertainty. This clearly disqualifies the choice of data dependent priors, and even the choice of priors depending on the particular statistical model used. We wholeheartedly agree with Lindley on this, but we claim that this can be done also within the more general theory introduced by Rényi and continued here.

An unbounded law can according to Rényi be interpreted by the corresponding family of conditional probabilities given by conditioning on the events in the bunch. These elementary conditional probabilities are probabilities in the sense of Kolmogorov, and the interpretation depends on the application. They can, as Lindley advocates convincingly, be interpreted as personal probabilities corresponding to a range of real life events. They can also, as needed in for instance quantum physics, be interpreted as as objectively true probabilities representing a law for how a system behaves when observed repeatedly under idealized conditions.

Assume now that we accept a theory where the prior uncertainty is given by a possibly unbounded law. It is then natural to accept that a resulting posterior uncertainty can also be represented by a possibly unbounded law. Both the prior and the posterior represent uncertainty of the same kind. Hopefully, many can agree on this on an intuitive level. The main result presented here is Theorem 1 which provides a mathematical theory in which this can be done consistently without paradoxical results.

## Appendix A Appendix on measure theory

### a.1 Measurable space and measure

A measurable space is a set X equipped with a -field of subsets of X. A -field is a collection of subsets of a fixed set that contains the empty set and is closed under complements and countable unions. A set is measurable if . A measure is a function with that is countably additive: . A probability measure is a measure on a measurable space X with . A measure space is a measurable space X equiped with a measure (Rudin, 1987, p.16). A probability space is a measurable space X equipped with a probability measure.

A sigma-field of a measure space is sigma-finite if there exist measurable sets with and (Taraldsen and Lindqvist, 2016, p.5010). A measure space is sigma-finite if is sigma-finite, and is then also said to be sigma-finite (Rudin, 1987, p.121).

### a.2 Conditional probability space

A bunch in a measurable space is a family of measurable sets closed under finite unions that does not contain the empty set, but contains a countable family of sets whos union is the whole set. A bunch is ordered if implies or .

A Rényi space (Taraldsen and Lindqvist, 2016, p.5013) is a measurable space X equipped with a family of probability measures indexed by a bunch which fulfill and , and the identity

 ν(A\operatornamewithlimits∣B1)=ν(A∩B1\operatornamewithlimits∣B2)ν(B1\operatornamewithlimits∣B2) (16)

A sigma-finite measure on a measurable space X generates a probability law with corresponding conditional probabilities for . The Rényi space generated by the probability law is given by the family . A conditional probability space is a set X equipped with a probability law.

### a.3 Statistical model

A statistical model is a triple where the space is a conditional probability space, the data is a measurable function , and the model parameter is a measurable function . These definitions, and the ones that follow, are as given by Schervish (1995) except for choice of symbols and the generalization given by assuming that is a conditional probability space. There is one probability law defined on the sigma-algebra of events in - and all other concepts are defined from the basic space (Taraldsen and Lindqvist, 2016, p.5011).

A statistic is a measurable function of the data and a parameter is a measurable function of the model parameter as illustrated in equation (17)

 [row sep=normal, ampersand replacement=&] &Ω_Θ[r, ”ψ”] & Ω_Γ (Ω, E, P) [ur, ”Θ”] [drr, ”Y” near end] [dr, ”X”’] [urr, ”Γ”’ near end] & & &Ω_X [r, ”ϕ”’] & Ω_Y (17)

A random quantity is a measurable function , and its law is defined by where . We abuse notation here and interpret as one fixed representative of the equivalence class that defines as a conditional measure. A random quantity is sigma-finite if its law is sigma-finite. If the model parameter is sigma-finte, then the conditional probabilities define a family of probability measures on the sample space indexed by the model parameter in the model parameter space . Likewise, if the data is sigma-finite, then the posterior is a probability measure on the model parameter space . The mappings and are measurable for all events , but existence of families of probability measures as claimed above requires regularity assumptions: It is sufficient to assume that the sample space and the model parameter space are Borel spaces (Schervish, 1995, p.619)(Taraldsen and Lindqvist, 2016, p.5011).

## References

• Bioche and Druilhet (2016) Bioche, C. and P. Druilhet (2016). Approximation of improper priors. Bernoulli 3(22), 1709–1728.
• Dawid et al. (1973) Dawid, A. P., M. Stone, and J. V. Zidek (1973). Marginalization Paradoxes in Bayesian and Structural Inference. Journal of the Royal Statistical Society Series B-Statistical Methodology 35(2), 189–233.
• Doob (1953) Doob, J. L. (1953). Stochastic Processes. Wiley Classics Library Edition (1990). Wiley.
• Druilhet et al. (2016) Druilhet, P., C. Bioche, and P. Druilhet (2016).

A cautionary note on Bayesian estimation of population size by removal sampling with diffuse priors.

Biometrical Journal 00(0000).
• Fraser (1968) Fraser, D. A. S. (1968). The structure of inference. John Wiley.
• Haldane (1932) Haldane, J. B. S. (1932). A note on inverse probability. Mathematical Proceedings of the Cambridge Philosophical Society 28, 55–61.
• Hartigan (1983) Hartigan, J. (1983). Bayes theory. New York: Springer.
• Jeffreys (1939) Jeffreys, H. (1939). Theory of probability (1966 ed) (Third ed.). New York: Oxford.
• Kolmogorov (1933) Kolmogorov, A. (1933). Foundations of the theory of probability (Second ed.). Chelsea edition (1956).
• Lavine and Hodges (2012) Lavine, M. L. and J. S. Hodges (2012). On Rigorous Specification of ICAR Models. The American Statistician 66(1), 42–49.
• Lindley (1980) Lindley, D. V. (1980, March). Introduction to Probability and Statistics from a Bayesian Viewpoint, Part 1, Probability (Pt. 1). Cambridge University Press.
• Lindqvist and Taraldsen (2017) Lindqvist, B. and G. Taraldsen ((in press) 2017). On the proper treatment of improper distributions. J. Statist. Plann. Inference.
• Renyi (1970) Renyi, A. (1970). Foundations of Probability. Holden-Day.
• Robert et al. (2009) Robert, C. P., N. Chopin, and J. Rousseau (2009). Harold Jeffreys’s Theory of Probability Revisited. Statistical Science 24(2), 141–172.
• Rudin (1987) Rudin, W. (1987). Real and Complex Analysis. McGraw-Hill.
• Schervish (1995) Schervish, M. J. (1995). Theory of Statistics. Springer.
• Skorohod (1984) Skorohod, A. V. (1984, November). Random Linear Operators (1 ed.). Springer.
• Stone and Dawid (1972) Stone, M. and A. P. Dawid (1972). Un-Bayesian Implications of Improper Bayes Inference in Routine Statistical Problems. Biometrika 59(2), 369–375.
• Taraldsen and Lindqvist (2010) Taraldsen, G. and B. H. Lindqvist (2010). Improper Priors Are Not Improper. The American Statistician 64(2), 154–158.
• Taraldsen and Lindqvist (2016) Taraldsen, G. and B. H. Lindqvist (2016). Conditional probability and improper priors. Communications in Statistics 45(17), 5007–5016.
• Tufto et al. (2012) Tufto, J., R. Lande, T.-H. Ringsby, S. Engen, B.-E. Sæther, T. R. Walla, and P. J. DeVries (2012). Estimating Brownian motion dispersal rate, longevity and population density from spatially explicit mark-recapture data on tropical butterflies. Journal of Animal Ecology 81, 756–769.