 # Asymptotically Independent Markov Sampling: a new MCMC scheme for Bayesian Inference

In Bayesian statistics, many problems can be expressed as the evaluation of the expectation of a quantity of interest with respect to the posterior distribution. Standard Monte Carlo method is often not applicable because the encountered posterior distributions cannot be sampled directly. In this case, the most popular strategies are the importance sampling method, Markov chain Monte Carlo, and annealing. In this paper, we introduce a new scheme for Bayesian inference, called Asymptotically Independent Markov Sampling (AIMS), which is based on the above methods. We derive important ergodic properties of AIMS. In particular, it is shown that, under certain conditions, the AIMS algorithm produces a uniformly ergodic Markov chain. The choice of the free parameters of the algorithm is discussed and recommendations are provided for this choice, both theoretically and heuristically based. The efficiency of AIMS is demonstrated with three numerical examples, which include both multi-modal and higher-dimensional target posterior distributions.

## Code Repositories

### aims-2016

An unofficial implementation of Asymptotically Independent Markov Sampling

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Three cornerstones of computational Bayesian inference

In Bayesian statistics, many problems can be expressed as the evaluation of the expectation of a quantity of interest with respect to the posterior distribution. Standard Monte Carlo simulation [MU49]

, where expectations are estimated by sample averages based on samples drawn independently from the posterior, is often not applicable because the encountered posterior distributions are multi-dimensional non-Gaussian distributions that cannot be explicitly normalized. In this case, the most popular strategies are importance sampling and Markov chain Monte Carlo methods. We briefly review these two methods first because they play an important role in the new MCMC method introduced in this paper.

Importance sampling: This is nearly as old as the Monte Carlo method (see, for instance, [KM53]), and works as follows. Suppose we want to evaluate that is an expectation of a function of interest under distribution222

Unless otherwise stated, all probability distributions are assumed to have densities with respect to Lebesgue measure,

. For simplicity, the same symbol will be used to denote both the distribution and its density, and we write to denote that is distributed according to . defined on a parameter space ,

 Eπ[h]=∫Θh(θ)π(θ)dθ. (1)

Suppose also that we are not able to sample directly from , although we can compute for any to within a proportionality constant. Instead, we sample from some other distribution on which is readily computable for any . Let be i.i.d. samples from , and denote the importance weight of the sample, then we can estimate by

 ^hN=∑Ni=1w(i)h(θ(i))∑Ni=1w(i). (2)

The estimator converges almost surely as to

by the Strong Law of Large Numbers for any choice of distribution

, provided . Note that the latter condition automatically holds in Bayesian updating using data where is the prior density and is the posterior , where stands for the likelihood function .

The estimator in (2) generally has a smaller mean square error than a more straightforward unbiased importance sampling estimator:

 ^h′N=1NN∑i=1w(i)h(x(i)). (3)

This is especially clear when is nearly a constant: if , then , while has a larger variation. Although is biased for any finite , the bias can be made small by taking sufficiently large

, and the improvement in variance makes it a preferred alternative to

[Li01, RC04]. Another major advantage of using instead of , which is especially important for Bayesian applications, is that in using the former we need to know only up to a multiplicative normalizing constant; whereas in the latter, this constant must be known exactly.

The accuracy of depends critically on the choice of the importance sampling distribution (ISD) , which is also called the instrumental or trial distribution. If is chosen carelessly such that the the importance weights have a large variation, then is essentially based only on the few samples with the largest weights, yielding generally a very poor estimate. Hence, for importance sampling to work efficiently, must be a good approximation of — “the importance sampling density should mimic the posterior density” [Ge89] — so that the variance is not large. Since usually the prior and posterior are quite different, it is, therefore, highly inefficient to use the prior as the importance sampling distribution. When is high-dimensional, and is complex, finding a good importance sampling distribution can be very challenging, limiting the applicability of the method [AB03].

For the estimator in (3), it is not difficult to show that the optimal importance sampling density, i.e., that minimizes the variance of , is . This result is sometimes attributed to Rubinstein [Ru81], although it was proved earlier by Kahn and Marshall [KM53]. It is not true, however, that is optimal for the estimator . Note also that this optimality result is not useful in practice, since when , the required normalizing constant of is , the integral of interest.

MCMC Sampling: Instead of generating independent samples from an ISD, we could generate dependent samples by simulating a Markov chain whose state distribution converges to the posterior distribution as its stationary distribution. Markov chain Monte Carlo sampling (MCMC) originated in statistical physics, and now is widely used in solving statistical problems [Ne93, GRS96, Li01, RC04].

The Metropolis-Hastings algorithm [MRT53, Ha70], the most popular MCMC technique, works as follows. Let be a distribution on , which may or may not depend on . Assume that is easy to sample from and it is either computable (up to a multiplicative constant) or symmetric, i.e. . The sampling distribution is called the proposal distribution. Starting from essentially any , the Metropolis-Hastings algorithm proceeds by iterating the following two steps. First, generate a candidate state from the proposal density . Second, either accept as the next state of the Markov chain, , with probability ; or reject and set with the remaining probability . It can be shown (see, for example, [RC04]), that under fairly weak conditions, is the stationary distribution of the Markov chain and

 limN→∞1NN∑i=1h(θ(i))=∫Θh(θ)π(θ)dθ. (4)

Since the chain needs some time (so called “burn-in” period) to converge to stationarity, in practice, an initial portion of, say, states is usually discarded and

 ~hN=1N−N0N∑i=N0+1h(θ(i)) (5)

is used as an estimator for .

The two main special cases of the Metropolis-Hastings algorithm are Independent Metropolis-Hastings (IMH), where the proposal distribution is independent of (so is a global proposal), and Random Walk Metropolis-Hastings (RWMH), where the proposal distribution is of the form , i.e. a candidate state is proposed as , where is a random perturbation (so is a local proposal). In both cases, the choice of the proposal distribution strongly affects the efficiency of the algorithms. For IMH to work well, as with importance sampling, the proposal distribution must be a good approximation of the target distribution , otherwise a large fraction of the candidate samples will be rejected and the Markov chain will be too slow in covering the important regions for . When, however, it is possible to find a proposal , such that , IMH should always be preferred to RWMH because of better efficiency, i.e. better approximations of for a given number of samples . Unfortunately, such a proposal is difficult to construct in the context of Bayesian inference where the posterior is often complex and high-dimensional. This limits the applicability of IMH.

Since the random walk proposal is local, it is less sensitive to the target distribution. That is why, in practice, RWMH is more robust and used more frequently than IMH. Nonetheless, there are settings where RWMH also does not work well because of the complexity of the posterior distribution. Although (4) is true in theory, a potential problem with RWMH (and, in fact, with any MCMC algorithm) is that the generated samples often consist of highly correlated samples. Therefore, the estimator in (5) obtained from these samples tends to have a large variance for a modest amount of samples. This is especially true when the posterior distribution contains several widely-separated modes: a chain will move between modes only rarely and it will take a long time before it reaches stationarity. If this is the case, an estimate produced by will be very inaccurate. At first glance, it seems natural to generate several independent Markov chains, starting from different random seeds, and hope that different chains will get trapped by different modes. However, multiple runs will not in general generate a sample in which each mode is correctly represented, since the probability of a chain reaching a mode depends more on the mode’s “basin of attraction” than on the probability concentrated in the mode [Ne96].

Annealing: The concept of annealing (or tempering), which involves moving from an easy-to-sample distribution to the target distribution via a sequence of intermediate distributions, is one of the most effective methods of handling multiple isolated modes. Together with importance sampling and MCMC, annealing constitutes the third cornerstone of computational Bayesian inference.

The idea of using the RWMH algorithm in conjunction with annealing was introduced independently in [KGV83] and [Če85] for solving difficult optimization problems. The resulting algorithm, called Simulated Annealing, works as follows. Suppose we want to find the global minimum of a function of interest . This is equivalent to finding the global maximum of for any given . By analogy with the Gibbs distribution in statistical mechanics, is called the temperature parameter. Let be a sequence of monotonically decreasing temperatures, in which is large enough so that the probability distribution is close to uniform, and . At each temperature , the Simulated Annealing method generates a Markov chain with as its stationary distribution. The final state of the Markov chain at simulation level is used as the initial state for the chain at level . The key observation is that for any function such that for all , distribution , as increases, puts more and more of its probability mass (converging to ) into a neighborhood of the global minimum of . Therefore, a sample drawn from would almost surely be in a vicinity of the global minimum of when is close to zero.

The success of Simulated Annealing in finding the global minimum crucially depends on the schedule of temperatures used in the simulation. It was proved in [GG84] that if a logarithmic schedule is used, then, under certain conditions, there exists a value for such that use of this schedule guarantees that the global minimum of will be reached almost surely. In practice, however, such a slow annealing schedule is not computationally efficient. It is more common to use either a geometric schedule, with , or some adaptive schedule, which defines the temperature for the next annealing level based on characteristics of the samples observed at earlier levels. For examples of adaptive annealing schedules, see, for instance, [Ne93].

In Bayesian inference problems, the idea of annealing is typically employed in the following way. First, we construct (in advance or adaptively) a sequence of distributions interpolating between the prior distribution and the posterior distribution . Next, we generate i.i.d. samples from the prior, which is assumed to be readily sampled. Then, at each annealing level , using some MCMC algorithm and samples from the previous level , we generate samples which are approximately distributed according to . We proceed sequentially in this way, until the posterior distribution has been sampled. The rationale behind this strategy is that sampling from the multi-modal and, perhaps, high-dimensional posterior in such a way is likely to be more efficient than a straightforward MCMC sampling of the posterior.

The problem of sampling a complex distribution is encountered in statistical mechanics, computational Bayesian inference, scientific computing, machine learning, and other fields. As a result, many different efficient algorithms have been recently developed, e.g. the method of Simulated Tempering

[MP92, GT95], the Tempered Transition method [Ne96], Annealed Importance Sampling [Ne01], the Adaptive Metropolis-Hastings algorithm [BA02], Transitional Markov Chain Monte Carlo method [CC07], to name a few.

In this paper we introduce a new MCMC scheme for Bayesian inference, called Asymptotically Independent Markov Sampling (AIMS), which combines the three approaches described above — importance sampling, MCMC, and annealing — in the following way. Importance sampling with as the ISD is used for a construction of an approximation of , which is based on samples . This approximation is then employed as the independent (global) proposal distribution for sampling from by the IMH algorithm. Intermediate distributions interpolating between prior and posterior are constructed adaptively, using the essential sample size (ESS) to measure how much differs from . When the number of samples , the approximation converges to , providing the optimal proposal distribution. In other words, when , the corresponding MCMC sampler produces independent samples, hence the name of the algorithm.

###### Remark 1.

The term “Markov sampling” has several different meanings. In this paper it is used as synonymous to “MCMC sampling”.

In this introductory section, we have described all the main ingredients that we will need in the subsequent sections. The rest of the paper is organized as follows. In Section 2, the AIMS algorithm is described. The ergodic properties of AIMS are derived in Section 3. The efficiency of AIMS is illustrated in Section 4 with three numerical examples that include both multi-modal and high-dimensional posterior distributions. Concluding remarks are made in Section 5.

## 2 Asymptotically Independent Markov Sampling

Let and be the prior and the posterior distributions defined on a parameter space

, respectively, so that, according to Bayes’ Theorem,

, where denotes the likelihood function for data . Our ultimate goal is to draw samples that are distributed according to .

In Asymptotically Independent Markov Sampling (AIMS), we sequentially generate samples from intermediate distributions interpolating between the prior and the posterior . The sequence of distributions could be specially constructed for a given problem but the following scheme [Ne01, CC07] generally yields good efficiency:

 πj(θ)∝π0(θ)L(θ)βj, (6)

where . We will refer to and as the annealing level and the annealing parameter at level , respectively. In the next subsection, we assume that is given and therefore the intermediate distribution is also known. In Subsection 2.2, we describe how to choose the annealing parameters adaptively.

### 2.1 AIMS at annealing level j

Our first goal is to describe how AIMS generates sample from based on the sample obtained at the previous annealing level. We start with an informal motivating discussion that leads to the simulation algorithm. In Section 3, we rigorously prove that the corresponding algorithm indeed generates samples which are asymptotically distributed according to , as the sample size . Moreover, the larger , the less correlated generated samples are — a very desirable, yet rarely affordable, property for any MCMC algorithm.

Let be any transition kernel such that is a stationary distribution with respect to . By definition, this means that

 πj(θ)dθ=∫ΘKj(dθ|ξ)πj(ξ)dξ (7)

Applying importance sampling with the sampling density to integral (7), we have:

 πj(θ)dθ=∫ΘKj(dθ|ξ)πj(ξ)πj−1(ξ)πj−1(ξ)dξ≈Nj−1∑i=1Kj(dθ|θ(i)j−1)¯w(i)j−1def=^πNj−1j(dθ), (8)

where will be used as the global proposal distribution in the Independent Metropolis-Hastings algorithm, and

 w(i)j−1=πj(θ(i)j−1)πj−1(θ(i)j−1)∝L(θ(i)j−1)βj−βj−1 % and ¯w(i)j−1=w(i)j−1∑Nj−1k=1w(k)j−1 (9)

are the importance weights and normalized importance weights, respectively. Note that to calculate , we do not need to know the normalizing constants of and . If adjacent intermediate distributions and are sufficiently close (in other words, if is small enough), then the importance weights (9) will not vary wildly, and, therefore, we can expect that, for reasonably large , approximation is accurate.

###### Remark 2.

In [CB10], the stationary condition (7) was used for an analytical approximation of the target PDF to evaluate the evidence (marginal likelihood) for a model.

###### Remark 3.

Note that for any finite , distribution will usually have both continuous and discrete parts. This follows from the fact that the transition kernel in Markov chain simulation usually has the following form: , where describes the continuous part of the transition kernel, denotes the Dirac mass at , and . This is the form, for example, for the Metropolis-Hastings algorithm. Therefore, (8) must be understood as the approximate equality of distributions, not densities. In other words, (8) means that and , when , for all integrable functions . See also Example 2.1 below.

From now on, we consider a special case where is the random walk Metropolis-Hastings (RWMH) transition kernel. In this case, it can be written as follows:

 Kj(dθ|ξ)=qj(θ|ξ)min{1,πj(θ)πj(ξ)}dθ+(1−aj(ξ))δξ(dθ), (10)

where is a symmetric local proposal density, and is the probability of having a proper transition to :

 aj(ξ)=∫Θqj(θ|ξ)min{1,πj(θ)πj(ξ)}dθ (11)

Example 2.1. As a simple illustration of (8), consider the case when , , and , where denotes the Gaussian density with mean and variance . The approximation based on the samples is shown in the top panels of Figure 1, for and . Suppose that and are the functions of interest. Then and . The convergence of and is shown in the bottom panel of Figure 1.

For sampling from , we will use the Independent Metropolis-Hastings algorithm (IMH) with the global proposal distribution . To accomplish this, we have to be able to calculate the ratio for any as a part of the expression for the acceptance probability . However, as it has been already mentioned, the distribution does not have a density since it has both continuous and discrete components, and, therefore, the ratio makes no sense. To overcome this “lack-of-continuity problem”, taking into account (8) and (10), let us formally define the global proposal distribution over as:

 ^πNj−1j(θ)def=Nj−1∑i=1¯w(i)j−1qj(θ|θ(i)j−1)min⎧⎪⎨⎪⎩1,πj(θ)πj(θ(i)j−1)⎫⎪⎬⎪⎭, (12)

if , and

 ^πNj−1j(θ(k)j−1)def=∞ (13)

Note that is a distribution on , but it does not have a density. However, induces another distribution on which does have a density, given by the r.h.s. of (12). This motivates (12).

Now, using (12) and (13), we can calculate the ratio as follows:

1. If , then

 ^πNj−1j(θ)^πNj−1j(ξ)=∑Nj−1i=1¯w(i)j−1qj(θ|θ(i)j−1)min{1,πj(θ)πj(θ(i)j−1)}∑Nj−1i=1¯w(i)j−1qj(ξ|θ(i)j−1)min{1,πj(ξ)πj(θ(i)j−1)} (14)
2. If and , then

 (15)
3. If and , then

 ^πNj−1j(θ)^πNj−1j(ξ)=∞andαj(ξ|θ)=1 (16)
4. If and , then is not defined.

Notice that in the first three cases the ratio is readily computable, while in Case IV, it is not even defined. Therefore, it is very desirable to avoid Case IV. The key observation that allows us to do this is the following: suppose that the initial state of the Markov chain that is generated is such that , then for all . Indeed, the only way for the chain to enter the set is to generate a candidate state ; however, according to Case II, such a candidate will always be rejected. Thus, by replacing the state space by and using (14) and (15) for evaluation of , we are able to calculate the acceptance probability involved in the IMH algorithm. It is clear that the replacement of by is harmless for the ergodic properties of the Markov chain when .

###### Remark 4.

One may wonder why not just use the continuous part of as the global proposal density within the IMH algorithm. In other words, why not use the density , which is proportional to the function defined by (12), as the proposal density. Indeed, in this case we would not have any difficulties with calculating the ratio . The problem is that it is not clear how to sample from , while sampling from is straightforward.

The above discussion leads to the following algorithm for sampling from the distribution :

AIMS at annealing level

Input:

, samples generated at annealing level ;

, initial state of a Markov chain;

, symmetric proposal density associated with the RWMH kernel;

, total number of Markov chain states to be generated.

Algorithm:

for do

1) Generate a global candidate state as follows:

a. Select from with probabilities given by (9).

b. Generate a local candidate .

c. Accept or reject by setting

 ξg=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩ξl,with probability min{1,πj(ξl)πj(θ(k)j−1)};θ(k)j−1,with the remaining probability. (17)

2) Update by accepting or rejecting as follows:

if

Set

else

Set

 θ(i+1)j=⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩ξg,with probability min{1,πj(ξg)^πNj−1j(θ(i)j)πj(θ(i)j)^πNj−1j(ξg)};θ(i)j,with the remaining probability. (18)

end if

end for

Output:

, states of a Markov chain with a stationary distribution

Schematically, the AIMS algorithm at annealing level is shown in Figure 2. The proof that is indeed a stationary distribution for the Markov chain generated by AIMS is given in Section 3.

###### Remark 5.

As usually for MCMC algorithms, the fact of convergence of a Markov chain to its stationary distribution does not depend on the initial state; however, the speed of convergence does. One reasonable way to chose the initial state in practical applications is the following: generate , where , i.e. has the largest normalized importance weight.

### 2.2 The full AIMS procedure

At the zero annealing level, , we generate prior samples , which usually can be readily drawn directly by a suitable choice of the prior distribution . Then, using the algorithm described in the previous subsection, we generate samples , which are approximately distributed according to intermediate distribution . We proceed like this until the posterior distribution () has been sampled. To make the description of AIMS complete, we have to explain how to choose the annealing parameters , for .

It is clear that the choice of the annealing parameters is very important, since, for instance, it affects the accuracy of the importance sampling approximation (8) and, therefore, the efficiency of the whole AIMS procedure. At the same time, it is difficult to make a rational choice of the -values in advance, since this requires some prior knowledge about the posterior distribution, which is often not available. For this reason, we propose an adaptive way of choosing the annealing scheme.

In importance sampling, a useful measure of degeneracy of the method is the effective sample size (ESS) introduced in [KLW94] and [Li96]. The ESS measures how similar the importance sampling distribution is to the target distribution . Suppose independent samples are generated from , then the ESS of these samples is defined as

 Neffj−1=Nj−11+varπj−1[w]=Nj−1Eπj−1[w2], (19)

where . The ESS can be interpreted as implying that weighted samples are worth ) i.i.d. samples drawn from the target distribution . One cannot evaluate the ESS exactly but an estimate of is given by

 ^Neffj−1(¯wj−1)=1∑Nj−1i=1(¯w(i)j−1)2, (20)

where and is the normalized importance weight of .

At annealing level , when is already known, the problem is to define . Let be a prescribed threshold that characterizes the “quality” of the weighted sample (the larger is, the “better” the weighted sample is). Then we obtain the following equation:

 Nj−1∑i=1(¯w(i)j−1)2=1γNj−1 (21)

Observe that this equation can be expressed as an equation for by using (9):

 ∑Nj−1i=1L(θ(i)j−1)2(βj−βj−1)(∑Nj−1i=1L(θ(i)j−1)βj−βj−1)2=1γNj−1 (22)

Solving this equation for gives us the value of the annealing parameter at level .

###### Remark 6.

Note that when , the are generated by the Markov chain sampler described in the previous subsection and therefore are not independent. This means that, because of the autocorrelations produced by the Markov chain used, the “true” ESS of this sample is, in fact, smaller than the one given by (19). This is useful to remember when choosing . Also, this is another reason to select the prior distribution so that samples can be generated independently at the start of each AIMS run.

Combining the AIMS algorithm at a given annealing level with the described adaptive annealing scheme gives rise to the following procedure.

The AIMS procedure

Input:

, threshold for the effective sample size (ESS);

, where is the total number of Markov chain states to be generated

at annealing level ;

, where is the symmetric proposal density associated with

the RWMH kernel at annealing level .

Algorithm:

Set , current annealing level.

Set , current annealing parameter.

Sample .

Calculate , .

Calculate the ESS using (20), which measures how similar the

prior distribution is to the target posterior distribution .

while do

Find from equation (22).

Calculate normalized importance weights , using (9).

Generate a Markov chain with the stationary distribution

using the AIMS algorithm at annealing level .

Calculate , .

Calculate the ESS using (20), which measures how

similar the intermediate distribution is to the posterior .

Increment to .

end while

Set , current annealing parameter.

Set , the total number of distributions in the annealing scheme.

Set , .

Generate a Markov chain with the stationary distribution

using the AIMS algorithm at annealing level .

Output:

, samples that are approximately distributed according

to the posterior distribution.

### 2.3 Implementation issues

As it follows from the description, the AIMS procedure has the following parameters: , the threshold for the effective sample size; , the length of a Markov chain generated at annealing level ; and , the symmetric proposal density associated with the RWMH kernel at level