 # On the proper treatment of improper distributions

The axiomatic foundation of probability theory presented by Kolmogorov has been the basis of modern theory for probability and statistics. In certain applications it is, however, necessary or convenient to allow improper (unbounded) distributions, which is often done without a theoretical foundation. The paper reviews a recent theory which includes improper distributions, and which is related to Renyi's theory of conditional probability spaces. It is in particular demonstrated how the theory leads to simple explanations of apparent paradoxes known from the Bayesian literature. Several examples from statistical practice with improper distributions are discussed in light of the given theoretical results, which also include a recent theory of convergence of proper distributions to improper ones.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Bayes’ formula forms the basis of Bayesian statistics. Suppose a parameter is of interest, and that we have data which is supposed to give information about

. The idea of Bayesian inference is to first express one’s prior knowledge (some would call it

uncertainty) of in the form of a prior distribution, commonly given in the form of a density function , and then combine this knowledge with the new knowledge provided by the data . The influence of on the data is modeled by a statistical model, represented by the conditional density of the data given the parameter, . Note that will sometimes be interpreted as the ’likelihood’ of for a given observation , in which case the function will be the likelihood function.

Bayes’ formula is used to express the updated information about obtained after is observed, given in the form of the posterior distribution,

 π(θ|x)=f(x|θ)π(θ)f(x). (1)

Here is the marginal density of . Equation (1

) is one version of Bayes’ theorem.

The algorithm for calculation of posterior distributions given by (1) is well defined as long as

, and in this case it will always lead to a proper probability distribution in the sense that

is a non-negative function which integrates to 1.

The usual proof of Bayes’ formula is restricted to the case where is a probability density, where basic rules of probability are used in the derivation. As pointed out above we may however get a proper distribution as the output of the formula even if the prior is not proper. This fact is the obvious excuse for using Bayes’ formula for improper priors.

A natural question is now, why should one want to use improper distributions for ? In practice, improper distributions often result from the search for so-called non-informative priors. The most prominent example of such priors is the Jeffreys’ prior, which is proportional to the square root of the determinant of the Fisher information matrix and has the key property of being invariant under reparameterizations.

Intuitively, a non-informative prior should be one that does not favor any parameter values above others, suggesting “flat” priors. In practice, this often means to use proper standard probability models such as normal, gamma or uniform distributions with (very) large variances. Taking the limit as the variance tends to

is then most often the excuse for using an improper prior density which is constant over the complete parameter space. Such distributions may, however, not have the invariance properties required by Jeffreys’ priors. It is in fact well-known that flat priors may be very informative on non-location parameters. We refer to Irony and Singpurwalla (1997) for an interesting discussion on non-informative and improper priors.

As already indicated, posterior distributions computed by Bayes’ formula are proper probability distributions only under the condition of . Standard Bayesian calculations are, however, made by just recognizing the proportionality

 π(θ|x)∝f(x|θ)π(θ),

which can be used without loss of information in the case that Bayes’ formula gives a proper distribution, but gives an improper posterior distribution in case is not finite. The latter case, if ignored, may lead to misleading inferences, as discussed later in this paper.

Improper distributions also appear naturally in certain non-Bayesian analyses. Lindqvist and Taraldsen (2005) considered conditional sampling of data given a sufficient statistic for the unknown parameter , which has numerous applications in statistical inference (Casella and Berger, 2002). The key is that this conditional distribution is independent of the value of . The general idea of the conditional sampling method of Lindqvist and Taraldsen (2005) was to use this fact, but instead of fixing the value of , to let it be a random quantity with some suitable distribution. Under certain mild restrictions, this distribution can be freely chosen, often with improper distributions giving the most efficient methods. Improper distributions appear likewise as useful ingredients in fiducial statistics, see for example Taraldsen and Lindqvist (2013) and Taraldsen and Lindqvist (2015).

The purpose of the present paper is to review and discuss some important aspects of the use of improper priors in statistical practice. Some would say that no new theory is needed, since improper priors are just approximations to proper ones. As will be discussed in the paper, this is a too simple attitude. The literature on Bayesian statistics includes a lot of paradoxes and misleading conclusions due to improper priors and posteriors. There are, however, not a lot of theoretical treatments of proper versus improper distributions in the literature. Some exceptions are, e.g., Hartigan (1983), Chang and Pollard (1997) and the more recent paper by McCullagh et al. (2011).

Our point of departure will be the paper by Taraldsen and Lindqvist (2010)

which has a slightly different view than the above references. The idea is here simply to allow infinite probabilities in Kolmogorov’s axioms. While this implies that all random variables have infinite mass, all conditional distributions will be proper probability distributions under a certain crucial condition which turns out to be equivalent to the above mentioned condition of finite

. Formally, this condition is the -finiteness of the random quantity that is conditioned on, here . Details will be given in Section 2 which reviews the theoretical results of Taraldsen and Lindqvist (2010).

The above idea is not new, however. We quote from Renyi (1962), motivating the introduction of improper distributions:

One can indeed give an axiomatic theory of probability which matches the above-mentioned requirements. This theory contains the theory of Kolmogorov as a special case. The fundamental concept of the theory is that of conditional probability; it contains cases where ordinary probabilities are not defined at all.

The idea of such a theory is due to Kolmogorov himself; he, however, did not publish anything about it.

The theory presented in Taraldsen and Lindqvist (2010) is in fact closely related to Renyi’s theory of conditional probability spaces (Renyi, 1970). This connection is studied in more detail in Taraldsen and Lindqvist (2016).

Having introduced the basic elements of the theory of Taraldsen and Lindqvist (2010) in Section 2, we proceed to Section 3 which discusses some consequences of the theory when applied to Bayesian statistics. In particular we investigate in some detail a so-called marginalization paradox presented by Stone and Dawid (1972). Section 4 is devoted to Gibbs sampling, where a possible pitfall is the fact that posteriors may be improper even if all full conditionals are proper. A recent theoretical paper on approximation of improper priors, Bioche and Druilhet (2016), is briefly reviewed in Section 5. This is an important paper giving precise conditions for convergence of proper priors to improper ones and for convergence of the corresponding posterior distributions. Section 6 discusses a class of improper models which is popular in spatial statistics. Some concluding remarks are finally given in Section 7.

## 2 The theoretical framework

### 2.1 The modified Kolmogorov axioms

As in Kolmogorov’s axioms we consider an abstract space of outcomes, where events are represented by subsets of and where the family of events is assumed to be a -algebra. We next let the measurable space be equipped with a fixed law Pr with

• for all .

• whenever are pairwise disjoint events.

However, where Kolmogorov adds the axiom , we assume only

• ,

and hence allow the case . Note that the above axioms are exactly the axioms of a positive measure from standard measure theory (Royden, 1968).

### 2.2 Random quantities

A random quantity with values in a measurable space , is identified with a measurable function , i.e., such that is an event in for any event in . The law Pr on now induces the law of a random quantity by defining

 PrX(A)=Pr(X∈A) for A∈EX.

Hence the joint law of a pair is determined by

 PrX,Y(A×B)=Pr((X,Y)∈A×B) for A∈EX, B∈EY,

while marginal laws are found from

 PrX(A)=PrX,Y(A×ΩY) for A∈EX.

The random quantity is called -finite if the law is -finite, i.e., if there exist events with:

 ΩY=∪iEi and PrY(Ei)<∞ for i=1,2,…

### 2.3 Conditional distributions

A key feature of our approach is that if is -finite, then we can define a unique proper conditional probability

 Pry(A)≡Pr(A|Y=y)

as a function of . The following approach equals the standard approach for definition of conditional probabilities and expectation in ordinary probability theory.

For a given event in , conditional probabilities should satisfy, for all :

 Pr(A∩(Y∈B)) = ∫BPr(A|Y=y)PrY(dy) (2) = ∫BPry(A)PrY(dy).

By the assumed -finiteness of the measure , the Radon-Nikodym theorem (Royden, 1968) states exactly that the function exists and is uniquely (a.e.) defined by the above.

Since the measure must satisfy for all , it is seen by letting in (2) and using uniqueness of , that we have

 Pry(Ω)=1.

Under regularity conditions, which we will not pursue here, we may from this conclude that conditional laws can always be represented as proper probability distributions, as long as is -finite. If, on the other hand, is not -finite, then is not defined due to the requirement of -finiteness in Radon-Nikodym’s theorem.

Having defined the conditional law on , we now define the conditional distribution of a random quantity given for by

 PryX(A)=Pry(X∈A).

### 2.4 A Bayesian statistical model

A Bayesian statistical model involves an observation, represented by a random quantity , and a random parameter , represented as a -finite random quantity . The law of is then the prior distribution.

The conditional distribution of given , i.e., , defines in a consistent way a statistical model. This follows directly from the above approach since is assumed to be -finite and since conditional distributions are always proper.

### 2.5 Implications of improper prior PrΘ

So far we have not specified the value of . Suppose that is -finite with . We claim that this implies that Pr is -finite and . To see this, suppose are such that

 ΩΘ=∪iAi and PrΘ(Ai)<∞%for$i=1,2,…$

Then

 Ω=(Θ∈∪iAi)=∪i(Θ∈Ai),

which implies that Pr is also by necessity improper and -finite. This follows since and .

Assume now that . Then every random quantity has an improper law, since

 PrX(ΩX)=Pr{ω:X(ω)∈ΩX}=Pr(Ω)=∞.

On the other hand, a random quantity is not necessarily -finite, even if Pr is. Namely, let take values and . Then

 ∞=Pr(X∈ΩX)=Pr(X=0)+Pr(X=1)

and at least one of these is necessarily equal to . Hence is not -finite.

### 2.6 Bayesian posteriors

Recall that a Bayesian model is given by a -finite law , the prior distribution, and an observation with distribution . For an observation , Bayesian inference considers the posterior law, i.e., the conditional law of given , which in our notation is . This conditional distribution is well defined if is -finite, in which case it is a proper probability distribution. On the other hand, if is not -finite, then the posterior is not defined. Hence, in the current theory there is nothing such as an improper posterior!

## 3 Bayesian statistics and marginalization paradoxes

### 3.1 The absolutely continuous case

Random quantities are said to be absolutely continuous if they can be defined by densities with respect to Lebesgue measure, for example . The marginal density of is then given by the density , where is a permitted value.

It is seen that with density is -finite according to the definition of Section 2.2 if and only if (a.e.), and the approach of the previous section can be shown to lead to the Bayes’ formula (1), which corresponds to in the notation of Section 2.3.

### 3.2 What may “go wrong” with improper distributions?

A prior model for in Bayesian statistics is commonly given on the form of a joint density of the pair , where , are two non-negative, finite-valued functions. One would then say that the parameters are given “independent priors, with marginal priors and ”. In practice, one might have chosen one or both of the “marginal” priors and as improper ones. To be concrete, suppose is a proper probability density, while is improper, i.e., integrates to . As indicated above, it would be tempting to call and the marginal densities of and , respectively. But are they?

By the definition given in Section 2.2, the marginal density of is , which however equals since is improper. Since this equals whenever , it follows that is not the marginal density of ! However, integrating instead with respect to and recalling that was assumed to be a probability density, we find that the marginal density of is , showing that is indeed the marginal density of .

So which interpretation can we give of ? Since the marginal density of , , is finite (although not proper), we conclude that is -finite. Hence it has meaning to condition on it, and it can be seen that is the conditional density of given , using the approach of Section 2.3. In particular this shows that even if is not -finite, and has an infinite marginal density, it has a well defined proper conditional distribution given the -finite random quantity .

### 3.3 A marginalization paradox (Stone and Dawid, 1972)

For given parameters , let and

be independent and exponentially distributed with hazard rates, respectively,

and . Suppose the interest is in the ratio of the hazard rates, which suggests consideration of the ratio .

Let the joint prior distribution of be given by , where is proper. (Note that by the previous subsection, is not the marginal density of .) The joint density of is readily obtained to be

 f(x,z,θ,ϕ)=θϕ2xe−ϕx(θ+z)π(θ). (3)

Integration with respect to shows that is -finite, and we readily get the marginal conditional distribution of given to be

 f(θ|x,z)∝θπ(θ)(θ+z)3. (4)

Since this does not depend on , it is tempting to conclude that the right hand side of (4) is also the conditional distribution of given , i.e.,

 f(θ|z)=f(θ|x,z)∝θπ(θ)(θ+z)3. (5)

Starting differently, by integrating out in (3) and conditioning with respect to (which is obviously -finite), we get

 f(z|θ,ϕ)=∫∞0θϕ2xe−ϕx(θ+z)dx=θ(θ+z)2.

Since this depends only on , one might suggest that and from this obtain

 f(θ|z)∝f(z|θ)π(θ)=θπ(θ)(θ+z)2. (6)

But (5) and (6) contradict each other! It is therefore not clear how to proceed if one wants to do inference on based on alone. This is an example of a marginalization paradox.

So what is the problem? Considering the approach of Section 2, the problem is that the marginal distribution of is not -finite, so one is not allowed to condition on it. The conclusion in (5) is hence not correct. In fact, neither is the distribution of -finite, making the conclusion (6) incorrect as well.

The clue is that we have above, in fact twice, used the generally invalid result that

 f(θ|x,z) does not depend on x ⇒f(θ|z)=f(θ|x,z).

This is well-known to hold for probability distributions, but holds for improper distributions only provided and both have -finite distributions.

In order to understand better the mechanisms of the previous example, let us redefine the problem and let the prior of be given by

 π(θ,ϕ)=π(θ)h(ϕ),

where may be proper or improper, while is proper as before, unless otherwise stated below. Multiplying (3) by and integrating with respect to we get

 f(x,z,θ)=θπ(θ)x2∫∞0u2e−u(θ+z)h(ux)du. (7)

The joint marginal distribution of is obtained by integrating with respect to , which gives

 f(z,θ) = θπ(θ)∫∞0u2e−u(θ+z)[∫∞01x2h(ux)dx]du (8) = θπ(θ)(θ+z)2∫∞0h(w)dw.

Hence, if is proper, then for inference about we may base ourselves on only and use the relation (6). This case corresponds to the typical frequentist approach for this example, where one concludes that the distribution of depends on the parameters only via . In the case where is improper, we get however in (8), and neither nor are -finite.

Recalling that we want to make inference about , let us go back to (7). It follows that

 f(θ|x,z)∝θπ(θ)(θ+z)3∫∞0w2e−wh(wx(θ+z))dw. (9)

Setting leads to the right hand side of (5). Strange enough, choosing to be the improper density , it follows that

 f(θ|x,z)∝θπ(θ)(θ+z)2 (10)

in which case we would apparently not have a marginalization paradox. Still, however, is not -finite, so we cannot conclude that (10) equals . It is notable that Stone and Dawid (1972) explain this apparent absense of a marginalization paradox by the fact that we now use the prior for which is the common prior for a scale parameter. In our opinion this seems to be more like a coincidence since we here use an improper  under which is not -finite.

As another comment on (10), note that the pair is -finite even if we let be the improper density , while we keep . This is seen by integrating (7) with respect to . Hence (10) is meaningful and leads after normalization to the posterior density for given by

 f(θ|x,z)=z(θ+z)2. (11)

It is interesting to note that the density (11) also appears as the optimal invariant confidence distribution for in a frequentist approach involving the observations and , where is the parameter of interest. The argument follows Schweder and Hjort (2016), Chapter 5.

We close the present section by returning to the original assumption where and is unspecified, but proper. Consider a proper density which can be seen as an approximation to the improper , e.g.,

 hM(ϕ)=1MI(0<ϕ≤M), (12)

where is the indicator function and is considered to be large. From (9) we get

 f(θ|x,z) ∝ θπ(θ)(θ+z)3∫xM(θ+z)0w2e−wdw (13) = θπ(θ)(θ+z)3[2−e−A(A2+2A+2)],

where .

It is seen that the limit as tends to infinity in (13) is consistent with (4). This is in fact a consequence of a general result in Bioche and Druilhet (2016) (see Section 5), since in their approach with . As seen from (13), the convergence of as tends to infinity is not uniform in the observations . This point was made by Akaike (1980) in his discussion of certain marginalization paradoxes. More precisely, Akaike questioned the common interpretation of an improper prior distribution as a limit of proper prior distributions, and he argued that an improper prior can more adequately be described as the limit of certain data adaptive proper prior distributions. He concluded that a prior distribution without data adaptability may produce poor inference due to a gross misspecification of the prior. We illustrate this point in Figure 1. It is seen that, even if we set as a large number (here 500) in (12), the posterior distribution for in (13) depends rather distinctly on the value of , at least for small . This is despite the non-appearance of in the right hand side of (4). For higher values of , like , the posterior is however indistinguisable from the one we get when equals infinity. The figure also includes a corresponding plot of (6) which illustrates the difference between (4) and (6). Figure 1: Marginalization paradox example. Let π(θ)=exp(−θ) and suppose z=1 is observed. Solid line: the (normalized) density (13) with x=1,M=500 (which is indistinguishable from (4)). Dashed line: the (normalized) density (13) with x=0.001,M=500. Dotted line: the (normalized) density (10).

## 4 An example from Gibbs-sampling

### 4.1 Gibbs-sampling from improper posterior distribution

Hobert and Casella (1996) gave an example showing that the output from Gibbs sampling corresponding to an improper posterior distribution may still appear perfectly reasonable. The authors’ advice is thus that before implementing a Gibbs chain one should check that the posterior is proper. For this it is important to note that propriety of the conditionals of a Gibbs chain does not imply that the full posterior is proper (see example below).

Gelfand and Sahu (1999) consider similar problems with Gibbs sampling, focusing on parameter identifiability and posterior propriety. In particular, they provide rather general results for propriety of posteriors in the case of GLMs. As a simple illustration they consider in an earlier technical report (Gelfand and Sahu, 1996) the following example.

Let with improper prior

. Then the joint distribution of

is

 f(y,θ1,θ2)=(1/√2π)e−(1/2)(y−θ1−θ2)2,

leading to the marginal density of given as

 ∫∫f(y,θ1,θ2)dθ1dθ2=∞.

Hence is not -finite, so the posterior does not exist (or, is improper).

On the other hand, the pairs and are both -finite, so the following conditional distributions exist and are proper:

 Θ1|θ2,y ∼ N(y−θ2,1), (14) Θ2|θ1,y ∼ N(y−θ1,1). (15)

Thus Gibbs-sampling of pairs for given is possible. The question is, however, how the pairs will behave, knowing that the posterior does not exist. Figure 2 shows a simulation from (14)-(15). The large fluctuations seen in the plots are due to the impropriety of the joint posterior given . Figure 2: Gibbs chains for θ1 (left) and θ2 (right), drawn from (14) and (15), respectively. Figure 3: Simulated values of δ=θ1+θ2 using (14) and (15).

### 4.2 The proper embedded posterior (Gelfand and Sahu, 1999)

Gelfand and Sahu observed, however, that if one makes a 1-1 transformation

 (θ1,θ2)→(δ,ρ), where % δ=θ1+θ2,

then the distribution can be recovered in the Gibbs-sampling. Indeed, the plot of from (14) and (15) (Figure 3), is apparently well-behaved. Gelfand and Sahu (1999) call the unique proper embedded posterior, regarding it as embedded within the improper posterior for .

A critical remark is of course appropriate here in view of the previous theory. Since is not -finite it has apparently no meaning to consider . On the other hand, we clearly have from (15) that , i.e., if we let , we have . Thus Gelfand and Sahu’s conclusion is similar to the one that we deemed to be incorrect in connection with the marginalization paradox. Namely, when is not -finite, then even if the density of does not depend on , this is not the conditional density of given .

The nice behavior of in the simulation can be explained theoretically as follows. Suppose that the prior distribution of is given by , where is a proper density. Then under the transformation , we have

 f(y,ρ,δ)=(1/√2π)e−(1/2)(y−δ)2⋅g(ρ). (16)

In this model is clearly -finite (in fact, the marginal density of is the constant 1). Thus the posterior exists and is given by (16). The marginal posterior of is hence

 δ|y∼N(y,1)

whatever be the density , as long as it is proper. Gelfand and Sahu (1999) let correspond to and let , and concluded that also in the limit will have . This is, however, not a valid conclusion since is now not -finite.

### 4.3 Using proper priors for both θ1 and θ2

A proper posterior for can of course be achieved by giving a proper prior. Assume for example that and are independent with , . Then, as shown in Gelfand and Sahu (1996),

 Θ1|θ2,y ∼ N(τ21+τ2(y−θ2),τ21+τ2), Θ2|θ1,y ∼ N(κ21+κ2(y−θ1),κ21+κ2).

If and are large, then the trajectories of the Gibbs chains for and , respectively, will still tend to drift in a way similar to the behavior in Figure 2. Thus, if we use a proper but diffuse priors for and , the posteriors will be proper but will in practice be indistinguishable from those obtained under the corresponding limiting improper prior. As concluded by Gelfand and Sahu (1996), an implicit byproduct of this observation is the infeasibility of numerical sampling based diagnostics for propriety of posteriors. A similar conclusion is expressed by Hobert and Casella (1996).

## 5 Convergence of priors and posteriors (Bioche and Druilhet)

### 5.1 q-vague convergence of measures

Bioche and Druilhet (2016) propose a convergence mode for measures allowing a sequence of probability measures to have an improper limiting measure. They also study convergence of corresponding posterior distributions.

Technically the authors study the set of positive Radon-measures on the state space , i.e., the set of positive measures which are finite on compact subsets of . Noting that the output of Bayes’ formula (1) is unchanged if is multiplied by a constant, they define the equivalence relation to mean that there is an such that . Their basic space of measures is then the corresponding quotient space, equipped with the quotient topology resulting from vague convergence of positive Radon measures. Convergence in this topology has been denoted as q-vague convergence. A similar quotient topology is introduced by Taraldsen and Lindqvist (2016).

A useful way of expressing the definition of -vague convergence is the following: A sequence of positive Radon-measures converges -vaguely to if there exists a sequence such that

 anΠn→Π (vaguely),

(see, e.g., Billingsley (2008) for the definition of vague convergence).

From this definition it is not difficult to prove that for any improper distribution there is a sequence of proper distributions such that (-vaguely). In this case, the given in the above definition tend to as increases.

As an example, consider the proper distribution with density given by (12). We claim that (q-vaguely) as , where . To see this we need to find constants such that (vaguely) as , i.e., such that

 ∫aMhM(ϕ)f(ϕ)dϕ→∫f(ϕ)dϕ

for each continuous function with compact support. But this is clear by the dominated convergence theorem by letting for all .

Bioche and Druilhet (2016) also consider convergence of posterior densities. If is the likelihood of the data and is the prior density, then they define the posterior distribution as the distribution with density in the equivalence class corresponding to , thus allowing also improper posterior distributions.

Their main proposition on convergence of posteriors states that if for the priors we have (-vaguely), and if is continuous, then the posteriors converge in the sense that (-vaguely). We have already seen an example in Section 3.4, where the posterior distribution (13) converges to (4) as tends to infinity. Note, however, that while the question of uniform convergence in was made a point in our example, this issue is not considered by Bioche and Druilhet (2016).

At first glance it seems that the above cited result on convergence of posteriors justifies the common excuse for using improper priors, namely that they are limits of proper priors and hence that the posteriors are limits of posteriors based on proper priors. However, we have already seen problems connected to such a view. Next we shall see another type of misinterpretation of improper limits of proper distributions, which in turn may give completely misleading results regarding posterior distributions.

### 5.2 The Jeffreys-Lindley paradox (Bioche and Druilhet)

Let

, and consider testing of the null hypothesis

. Suppose we have a prior distribution for given by

 π(θ)=12δ0+12I(θ≠0),

where is a point mass at and is the indicator function. This means that we have a prior belief of 1/2 in , while the remaining probability 1/2 is distributed according to Lebesgue measure on . A straightforward calculation gives

 π(0|x)=(1+√2πex2/2)−1

implying whatever be the data .

Using instead the proper prior measure

 πn(θ)=12δ0+12N(0,n2)

we get

 πn(0|x)=(1+√11+n2en2x22(1+n2))−1.

But this converges to 1 as , in conflict with the above calculation which was based on an apparently equivalent argument using the limiting prior. The result has therefore been considered as a paradox.

The clue, as presented by Bioche and Druilhet (2016), is that while converges q-vaguely to Lebesgue measure on the real line, the measure converges to and not to measure which one might believe. This explains the paradox, noting that by the convergence result for posteriors, the limiting posterior is a point mass at 0 as well.

## 6 Intrinsic Gaussian Markov random fields (IGMRF)

Intrinsic conditional autoregressions (ICAR) are widely used in spatial statistics and dynamic linear models (Besag et al., 1991)

. These models are improper versions of conditional autoregressive models (CAR) as introduced as spatial models by

Besag (1974). Important special cases of CAR and ICAR models are Gaussian Markov Random Fields (GMRF) and the intrinsic (improper) versions denoted IGMRF, see Rue and Held (2005) for a thorough treatment including applications.

As discussed by Lavine and Hodges (2012), the fact that the intrinsic models correspond to improper distributions, implies that care should be taken in their use and interpretation.

### 6.1 The first order random walk

Following Rue and Held (2005) we use this simple special case of an IGMRF to illustrate some of the main issues regarding IGMRF models.

Let be the successive observations of a random walk, assuming independent increments

 Δxi=xi+1−xi∼iidN(0,κ−1),i=1,2,…,n−1.

The IGMRF model specifies the density of to be the density obtained from these increments (only), giving

 f(x|κ) ∝ κ(n−1)/2exp(−κ2n−1∑i=1(Δxi)2) (17) = κ(n−1)/2exp(−κ2xTQx).

Here the structure matrix (displayed in Rue and Held (2005)

, p. 96) is positive semi-definite, with exactly one eigenvalue equal to 0, implying that

is an improper density.

Statistical inference in models involving IGMRFs may involve making inference about the precision parameter . In a Bayesian analysis, one must typically assign to

a hyperprior and work with (

17) as a likelihood function. In this connection, Lavine and Hodges (2012) question the use of the constant appearing in (17). In the following discussion let us replace (17) by

 f(x|κ)∝c(κ)exp(−κ2xTQx), (18)

thus making the appropriate choice of the main issue. As reported by Lavine and Hodges (2012), this choice has been discussed in several papers during the last two decades. Besag et al. (1991) in fact used , which was used by WinBUGS (Lunn et al. (2000)) until it was changed to following derivations appearing in, e.g., Knorr-Held (2003) and Hodges et al. (2003).

Rue and Held (2005) justify the density (17) as follows: Consider first the 1-1 transformation where is the average of the . Assuming that is multivariate normal, and are stochastically independent. We may hence write down the following proper density for , indexed by for the purpose of later taking limits,

 ~fk(x|κ)=f(x|κ)⋅˘fk(¯x) (19)

Here is the density (17) while is normal with zero expectation and precision . Suppose as . We may then invoke Proposition 2.15 of Bioche and Druilhet (2016) to show that

 ~fk(x|κ)→f(x|κ) as % k→∞ (20)

The interpretation of this is that the improper density (17) is the limit of a sequence of proper densities for . This derivation can also be interpreted as adding to the model for the a prior specification for given in the form of a constant prior.

Lavine and Hodges (2012) point, however, to a problem with this conclusion, having to do with the non-uniqueness of marginals in cases involving improper distributions and related to our discussion in Section 2. To illustrate, essentially following Lavine and Hodges, we consider the modified 1-1 transformation of given as

 (Δx1,…,Δxn−1,¯xΔx1).

It follows by the ordinary transformation formula (involving a Jacobi-determinant), starting from (19), that we have

 ~fk(x|κ)=f(x|κ)⋅˘fk(¯x/Δx1)⋅1|Δx1|.

Again, letting , we get

 ~fk(x|κ)→f(x|κ)⋅1|Δx1| as k→∞,

thus giving a limit different from (20).

Lavine and Hodges (2012) conclude that essentially all the arguments given in the literature for the value of the constant in some way are flawed. Their conclusion is therefore that any value of this constant may do. This is of course also in accordance with the previous section where the quotient topology for distributions was used, and where improper (as well as proper) distributions were identified with equivalence classes only.

Having said this, there seem to be good reasons to use the form (17). It follows from Rue and Held (2005), p. 90-91, who considered a more general case, that (17) when restricted to such that , is the conditional density of given . Here can be any real number, but it seems that is commonly used. Furthermore, the specification of enables one to simulate from the distribution (17) (see Rue and Held (2005), p. 92).

### 6.2 Bayesian analysis with IGMRFs

In a Bayesian inference with as a parameter we consider in (18) as a likelihood function. It should then be noted that is improper and hence not proportional to a proper distribution, which is the case for commonly considered likelihood functions.

Let be the prior density of , possibly improper. The natural definition of the joint distribution of is then

 f(x,κ)=f(x|κ)π(κ). (21)

Thus the marginal density of is

 ∫f(x,κ)dx=∫f(x|κ)π(κ)dx=∞,

so is in fact not the marginal distribution of . But still, by the theory of Section 2, the posterior density is well defined provided is -finite. This holds if the integral over of (21) is finite for (almost) all , i.e., if

 ∫π(κ)c(κ)exp(−κ2xTQx)dκ<∞.

A sufficient condition for this is clearly that . The conclusion of the above is that Bayesian inference for is well-behaved under reasonable restrictions, as soon as the constant has been determined.

## 7 Concluding remarks

In this paper we have presented, and discussed in view of several examples, a simple theoretical approach which enables the inclusion of improper priors in Bayesian analyses. A special feature of the approach is that both parameters and observations are represented as random quantities defined on a common underlying space . The clue has been to allow the probability Pr in Kolmogorov’s axioms to be a -finite law with . In fact it was shown in Section 2 that is necessary if improper priors are to be included.

What makes this a sensible theory is the fact that all conditional distributions, given -finite random quantities, are proper distributions. In particular this property leads to a consistent treatment of statistical models and a theoretically based condition for posterior propriety.

The relation to Renyi’s theory of conditional probability spaces has been mentioned earlier. In this connection we would also like to quote from Lindley (1965). In the Preface to his classical test on probabilities he writes:

The axiomatic structure used here is not the usual one associated with the name of Kolmogorov. Instead one based on the ideas of Renyi has been used. The essential dfference between the two approaches is that Renyi’s is stated in terms of conditional probabilities, whereas Kolmogorov’s is in terms of absolute probabilities, and conditional probabilities are defined in terms of them. Our treatment always refers to the probability of A, given B, and not simply to the probability of A. In my experience students benefit from having to think of probability as a function of two arguments, A and B, right from the beginning. The conditioning event, B, is then not easily forgotten and misunderstandings are avoided. These ideas are particularly important in Bayesian inference where one’s views are influenced by the changes in the conditioning event.

## References

• Akaike (1980) Akaike, H., 1980. The interpretation of improper prior distributions as limits of data dependent proper prior distributions. Journal of the Royal Statistical Society. Series B (Methodological), 46–52.
• Besag (1974) Besag, J., 1974. Spatial interaction and the statistical analysis of lattice systems. Journal of the Royal Statistical Society. Series B (Methodological), 192–236.
• Besag et al. (1991) Besag, J., York, J., Mollié, A., 1991. Bayesian image restoration, with two applications in spatial statistics. Annals of the Institute of Statistical Mathematics 43 (1), 1–20.
• Billingsley (2008) Billingsley, P., 2008. Probability and Measure. John Wiley & Sons, Hoboken, New Jersey.
• Bioche and Druilhet (2016) Bioche, C., Druilhet, P., 2016. Approximation of improper priors. Bernoulli 22 (3), 1709–1728.
• Casella and Berger (2002) Casella, G., Berger, R. L., 2002. Statistical Inference, 2nd Ed. Duxbury, Pacific Grove, CA.
• Chang and Pollard (1997) Chang, J. T., Pollard, D., 1997. Conditioning as disintegration. Statistica Neerlandica 51 (3), 287–317.
• Gelfand and Sahu (1996)

Gelfand, A. E., Sahu, S. K., 1996. Identifiability, propriety, and parametrization with regard to simulation-based fitting of generalized linear mixed models. Tech. rep., 96-36, Department of Statistics, University of Connecticut.

• Gelfand and Sahu (1999) Gelfand, A. E., Sahu, S. K., 1999. Identifiability, improper priors, and Gibbs sampling for generalized linear models. Journal of the American Statistical Association 94 (445), 247–253.
• Hartigan (1983) Hartigan, J. A., 1983. Bayes Theory. Springer Science, New York.
• Hobert and Casella (1996) Hobert, J. P., Casella, G., 1996. The effect of improper priors on Gibbs sampling in hierarchical linear mixed models. Journal of the American Statistical Association 91 (436), 1461–1473.
• Hodges et al. (2003) Hodges, J. S., Carlin, B. P., Fan, Q., 2003. On the precision of the conditionally autoregressive prior in spatial models. Biometrics 59 (2), 317–322.
• Irony and Singpurwalla (1997) Irony, T. Z., Singpurwalla, N. D., 1997. Non-informative priors do not exist. A dialogue with José M. Bernardo. Journal of Statistical Planning and Inference 65 (1), 159–177.
• Knorr-Held (2003) Knorr-Held, L., 2003. Some remarks on Gaussian Markov random field models for disease mapping. In: Green, P., Hjort, N., Richardson, S. (Eds.), Highly Structured Stochastic Systems. Oxford University Press, Oxford.
• Lavine and Hodges (2012) Lavine, M. L., Hodges, J. S., 2012. On rigorous specification of ICAR models. The American Statistician 66 (1), 42–49.
• Lindley (1965) Lindley, D. V., 1965. Introduction to Probability and Statistics from Bayesian Viewpoint. Vol. 1-2. Cambridge University Press, Cambridge.
• Lindqvist and Taraldsen (2005) Lindqvist, B. H., Taraldsen, G., 2005. Monte Carlo conditioning on a sufficient statistic. Biometrika 92 (2), 451–464.
• Lunn et al. (2000) Lunn, D. J., Thomas, A., Best, N., Spiegelhalter, D., 2000. WinBUGS-a Bayesian modelling framework: concepts, structure, and extensibility. Statistics and Computing 10 (4), 325–337.
• McCullagh et al. (2011) McCullagh, P., Han, H., et al., 2011. On Bayes’ theorem for improper mixtures. The Annals of Statistics 39 (4), 2007–2020.
• Renyi (1962) Renyi, A., 1962. Probability Theory. North-Holland, Amsterdam.
• Renyi (1970) Renyi, A., 1970. Foundations of Probability. North-Holland, Amsterdam.
• Royden (1968) Royden, H., 1968. Real Analysis: 2nd Ed. Macmillan, London.
• Rue and Held (2005) Rue, H., Held, L., 2005. Gaussian Markov Random Fields: Theory and Applications. CRC Press, London.
• Stone and Dawid (1972) Stone, M., Dawid, A., 1972. Un-Bayesian implications of improper Bayes inference in routine statistical problems. Biometrika 59 (2), 369–375.
• Taraldsen and Lindqvist (2010) Taraldsen, G., Lindqvist, B. H., 2010. Improper priors are not improper. The American Statistician 64 (2), 154–158.
• Taraldsen and Lindqvist (2013) Taraldsen, G., Lindqvist, B. H., 2013. Fiducial theory and optimal inference. The Annals of Statistics 41 (1), 323–341.
• Taraldsen and Lindqvist (2015) Taraldsen, G., Lindqvist, B. H., 2015. Fiducial and posterior sampling. Communications in Statistics–Theory and Methods 44 (17), 3754–3767.
• Taraldsen and Lindqvist (2016) Taraldsen, G., Lindqvist, B. H., 2016. Conditional probability and improper priors. Communications in Statistics–Theory and Methods 45 (17), 5007–5016.