A Multi-armed Bandit MCMC, with applications in sampling from doubly intractable posterior

03/13/2019 ∙ by Wang Guanyang, et al. ∙ 0

Markov chain Monte Carlo (MCMC) algorithms are widely used to sample from complicated distributions, especially to sample from the posterior distribution in Bayesian inference. However, MCMC is not directly applicable when facing the doubly intractable problem. In this paper, we discussed and compared two existing solutions -- Pseudo-marginal Monte Carlo and Exchange Algorithm. This paper also proposes a novel algorithm: Multi-armed Bandit MCMC (MABMC), which chooses between two (or more) randomized acceptance ratios in each step. MABMC could be applied directly to incorporate Pseudo-marginal Monte Carlo and Exchange algorithm, with higher average acceptance probability.

READ FULL TEXT VIEW PDF

Authors

page 20

page 21

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and Problem Formulation

Sampling from the posterior is the central part in Bayesian inference. Suppose there is a family of densities on the sample space , and a prior on the parameter space . It is of our interest to sample from the posterior . Assuming the prior and likelihood can be evaluated at every point, then Markov chain Monte Carlo (MCMC) would be the natural choice. The standard Metropolis-Hastings (M-H) Algorithm (Algorithm 1) constructs a Markov chain on , with stationary distribution .

Input: initial setting , number of iterations

1:for  do
2:     Propose
3:     Compute
4:     Draw
5:     If then set
6:end for
Algorithm 1 Metropolis-Hastings Algorithm

Here is often named as ‘acceptance ratio’, and is often called ‘acceptance probability’.

In real situations, however, the likelihood may often be intractable or computationally expensive. In this scenario the likelihood function is known up to a normalizing constant, that is:

where can be evaluated at every , but is unknown. This intractable constants arise in many statistical problems and interesting models, such as image analysis [Bes86], Gaussian graphical models [Rov02], Ising models [PFR03].

Suppose one is still interested in sampling from the posterior , a standard Metropolis-Hastings algorithm (Algorithm 1) gives acceptance ratio:

which cannot be calculated due to unknown ratio .

This problem is known as ‘doubly intractable problem’, as and

are both unknown. Based on the idea of estimating the normalizing constant or the likelihood function, a wide range of techniques are proposed, such as maximum pseudo-likelihood estimator

[Bes74], ratio importance sampling [CS97], bridge sampling [MW96], path sampling [GM98]. These methods use different approaches to estimate and , plugging the estimator into the expression of . However, it breaks the detailed balance and causes the Markov chain not converging to the correct stationary distribution, which may cause problems.

The Pseudo-marginal Monte Carlo is first introduced in [Bea03]. Møller et al. [MPRB06] proposed a ‘single auxiliary variable method’ , which is a special case of the pseudo marginal Monte Carlo approaches. This algorithm is asymptotically exact, i.e., the Markov chain converges to the correct stationary distribution. The convergence rate of Pseudo-marginal Markov chain can be found in [AR09], [AV15].

In 2006, Murray et al. [MGM12] proposed the ‘exchange algorithm’, which provides a generalization of Møller et al. [MPRB06]. The exchange algorithm also proposes auxiliary variables in each iteration and is asymptotically exact. However, the exchange algorithm requires perfect sampling from which is not practical in many cases. Usually sampling from also requires doing MCMC and would be very slow. Therefore there are several results working on proposing variants of exchange algorithm to tackle this problem. For example, Faming Liang et al. proposed ‘doubly Metropolis-Hastings sampler’[Lia10] and ‘adaptive exchange algorithm’[LJSL16], Alquier et al. proposed ‘Noisy Monte Carlo’ [AFEB16]. All these algorithms run Markov chains using approximate transition kernels, thus the stationary distribution is still no longer the exact posterior distribution and may not even exist.

The modified pseudo-marginal Monte Carlo which we will introduce later and the exchange algorithm construct Markov chains with randomized acceptance probability by introducing auxiliary variables at each iteration, which will both be referred to as ‘Randomized MCMC’ (RMCMC) hereafter.

This paper provides a comparison of the two algorithms and a new MCMC algorithm, which chooses between these two algorithms adaptively at each iteration, obtaining better acceptance probabilities. In Section 2, we reviewed the two RMCMC algorithms and explained a statistical point of view of the two algorithms, which provides the main motivation of this paper. In Section 3, two examples are introduced as a comparison between the two algorithms. In the first example, the exchange algorithm performs better than the Pseudo-marginal Monte Carlo and in the second example, vise versa. In Section 4, we propose a new algorithm: Multi-armed Bandit MCMC (MABMC) which is a combination of the two RMCMC algorithms, obtaining higher acceptance probability.

2 Review of the two RMCMC algorithms

2.1 Pseudo-marginal Monte Carlo (PMC)

To tackle the problem of the unknown ratio , Møller et al. [MPRB06] introduced an auxiliary variable which also takes value on the state space

, the joint distribution is designed to be:

and therefore the previous joint distribution is unaffected. Therefore, if we could sample from , the marginal distribution of would be our target . The Pseudo-marginal Monte Carlo algorithm is designed as follows:

Input: initial setting , number of iterations

1:for  do
2:     Generate
3:     Generate
4:     Compute
5:     Draw
6:     If then set .
7:end for
Algorithm 2 Pseudo-marginal Monte Carlo Algorithm

The auxiliary density can be chosen by an arbitrary distribution. The only requirement for is that, for every , the support of contains the support of . For example, potential choice for

is, uniform distribution on

if

, or normal distribution with mean

and variance

if , or where is an estimator of . The choice of will not affect the correctness of the pseudo-marginal algorithm, but will definitely have a strong impact on the efficiency of the Markov chain. Detailed discussions and suggestions in choosing a proper auxiliary density can be found in Møller et al. [MPRB06]

PMC can be viewed as a M-H algorithm on the space , with transition kernel . With this choice of transition kernel, the acceptance ratio of the M-H algorithm becomes

Therefore the acceptance ratio does not depend on the unknown term .

One of the most important assumptions we made here is doing exact sampling from (Step 2 in Algorithm 2). As is unknown, this step is not easy and often not doable. Surprisingly, perfect sampling without knowing the normalizing constant is sometimes still possible using the ‘coupling from the past’ method, see [PW96] for details. However, in more cases, we usually establish another Markov chain with stationary distribution as an approximation in practice.

2.2 Modified Pseudo-marginal Monte Carlo (MPMC)

The state space of PMC is , while the state space of exchange algorithm (described in Section 2.3) is . Therefore we provide a modified version of PMC, which is essentially the same as PMC but with state space , making it possible for comparing the two algorithms and incorporating them together.

The modified Pseudo-marginal Monte Carlo is designed as follows:

Input: initial setting , number of iterations

1:for  do
2:     Propose
3:     Propose and
4:     Compute
5:     Draw
6:     If then set .
7:end for
Algorithm 3 Modified Pseudo-marginal Monte Carlo Algorithm

Before proving that MPMC is a valid Monte Carlo algorithm, we first discuss the difference between PMC and MPMC. For PMC, in each step there is an attempted move from to according to the kernel , and the acceptance ratio is calculated according to the M-H algorithm, therefore randomness only occurs in the process of generating a new proposal .

For MPMC, however, in each step the attempted move is proposed from to according to . Then two auxiliary variables

are generated and the corresponding acceptance ratio depends on the value of random variables

. Randomness comes not only from the proposal step, but also from the procedure of calculating the acceptance ratio, which is different from PMC, or other standard M-H type algorithms. Those M-H algorithms with randomized acceptance ratio would be referred to as randomized Markov chain Monte Carlo (RMCMC). MPMC and exchange algorithms are two typical examples of RMCMC algorithms.

Now we prove the basic property of MPMC:

Lemma 1.

The Modified Pseudo-marginal Monte Carlo Algorithm satisfies the detailed balance, i.e.,

Proof.

First we calculate the transition probability , notice that to move from to , one has to first propose according to the density and then accept, and the acceptance probability depends on and thus we need to take expectation with respect to . Therefore,

So we have

The last equality comes from the fact that for any integrable function ,

Lemma 1 implies MPMC constructs a reversible Markov chain with stationary distribution .

It is not hard to show that (essentially one-step Janson’s inequality), comparing with the original M-H chain (assuming the normalizing constant is known), the MPMC chain is less statically efficient. For all , . This puts MPMC chain below M-H chain in the ordering of Peskun [Pes73], although the M-H chain is not achievable here in our case.

The convergence property of MPMC requires more careful analysis. Nicholls et al. gives useful results on the convergence results for randomized MCMC [NFW12]. In our case, briefly speaking, if the original M-H chain is uniformly ergodic or geometrically ergodic and the ratio

is upper and lower bounded by a positive number, then so is the MPMC chain.

2.3 Single Variable Exchange Algorithm (SVE)

The exchange algorithm is another RMCMC algorithm, which is similar to MPMC. However, the acceptance ratio is calculated (estimated) in a different way.

Input: initial setting , number of iterations

1:for  do
2:     Generate
3:     Generate an auxiliary variable
4:     Compute
5:     Draw
6:     If then set
7:end for
Algorithm 4 Exchange Algorithm

For SVE, in each step there is an attempted move from to according to the same transition kernel , however, SVE only generates one auxiliary variable and the acceptance ratio depends on . SVE also preserves detailed balance.

Lemma 2.

The Single variable exchange algorithm satisfies the detailed balance, i.e.,

The proof is very similar to 2 and can be found in [DW18].

Similar results on Peskun’s ordering and convergence rate can be established for SVE, but it will be omitted here as this is not of our main focus in this paper.

2.4 MPMC versus SVE, a statistical point of view

In the abstract of Murray’s exchange algorithm paper [MGM12], the authors claimed that SVE achieves better acceptance probability than PMC, and is justified by numerical simulation. This motivates us to raise the following question:

  • Is it possible to compare PMC and SVE theoretically?

As the state spaces of PMC and SVE are different, it makes more sense to compare SVE with MPMC. In this part, we provide a statistics point of view of SVE and PEMC, which also provides intuitions for future sections.

Recall that in standard M-H algorithm, given the stationary distribution and transition kernel , the acceptance ratio is

All the terms can be computed except the ratio of unknown normalizing constants . The obvious idea is to find an estimator of and plug it into the expression of acceptance ratio. This is widely used in practice, however, as mention in Section 1

, such estimators without other constraint will break the detailed balance and the corresponding Markov chain is not guaranteed to converge to the desired stationary distribution. Heuristically speaking, the idea of two RMCMC algorithms (MPMC and SVE) is to find a ‘good’ estimator of

. The word ‘good’ here means the estimator should preserve detailed balance of the resulting chain. It will soon be clear that the only difference between MPMC and SVE is that they use different estimators (denoted by and , respectively) to estimate acceptance ratio .

To be specific, in MPMC, the ratio is estimated by:

Therefore the resulting randomized acceptance ratio is given by:

is unbiased since

as we proved in Lemma 1, unbiasness preserves the detailed balance of MPMC, which guarantees the asymptotic exactness of MPMC algorithm.

Similarly, for SVE, the ratio is estimated by

Therefore the resulting randomized acceptance ratio is given by:

is clearly unbiased since

and again, unbiaseedness guarantees detailed balance.

Remark 1.

This not necessary implies unbiasedness is the sufficient and necessary condition for designing an estimator of .

To summarize, given an attempted move from to , the acceptance ratio

is estimated by two unbiased estimators,

and . Then the acceptance probability is estimated by

and

respectively.

Therefore comparing the two algorithms is equivalent to comparing the performance of the two estimators. As both and are unbiased, the performance only depends on the variance of the two estimators. The following theorem characterize the relative mean square error for both estimators by Pearson Chi-square distances .

Theorem 1.

Let

be the relative mean square error of estimator , then we have

and

where

Proof.

For SVE, we have

Similarly, for MPMC

as desired. ∎

Theorem 1 reveals a significant difference between MPMC and SVE. For MPMC, the choice of would influence the corresponding Peason Chi-square distance, and thus has a stong impace on the efficiency of the Markov chain. The optimal choice of is clearly itself, such an estimator has variance and it makes Algorithm 1 and 3 agrees, but is impractical in our case. In practice, this suggests us to choose which is as close to as possible.

For SVE, the variance is controlled by the Pearson Chi-square distance between and . Roughly speaking, when is close to (which is often equivalent to is close to ), the SVE estimator tends to perform well. However, when is far away from , the SVE estimator may perform poorly due to the large variance.

Therefore, with a properly chosen , it is reasonable to have the following heuristics:

  • When the proposed is close to , the SVE estimator would have better performance, resulting in a higher acceptance probability.

  • When is far away from , the MPMC estimator would outperform SVE, and one should choose MPMC if possible.

The above heuristics suggests it is not possible to conclude that one algorithm dominates the other in all cases. In the next section two concrete examples will be provided to justify our intuition. Meanwhile, in each step of M-H algorithm, a new is proposed based on the previous , this motivates us to find a method for choosing between MPMC and SVE adaptively, which is described in detail in Section 4.

Remark 2.

In this paper, it is of our main focus to choose between MPMC and SVE. But the methodology proposed in Section 4 is general for all RMCMC algorithms with the same transition kernel. It would be very interesting to construct other RMCMC algorithms which preserves detailed balance.

3 Two concrete examples

In this section we will give two concrete examples, and argue that it is impossible to claim that one algorithm would always works better than the other.

3.1 The first example

Let be the space with two points, i.e., . Therefore the probability measure on

are Bernoulli distributions. Let the parameter space

also consists two points, . Here

corresponds to the probability distribution on

:

Similarly corresponds to the probability distribution on :

The prior distribution is chosen to be the uniform distribution over , and suppose the data equals . Therefore, after simple calculation, the true posterior density would be:

For both algorithms, the transition probability is the uniform distribution over . To make calculation easier, we choose uniform distribution over as the conditional distribution , which is independent with data and parameter.

Now we are ready to calculate the transition probability of both algorithms, we will use to denote the transition probability using exchange algorithm, and use to denote the transition probability using modified Pseudo-marginal Monte Carlo. The transition probabilities can be calculated as follows:

Similarly,

Therefore we have for any and thus the exchange algorithm has higher acceptance transition probability, which means exchange algorithm will converge faster.

3.2 The second example

The second example is designed to show that Pesudo-marginal Monte Carlo may perform better than exchange algorithm. In this example we take be the space with three points, i.e., , the parameter space also consists two points, . Here corresponds to the probability distribution on :

Similarly corresponds to the probability distribution on :

The prior distribution are chosen to be the uniform distribution over , and suppose the data equals . Therefore, after simple calculation, the true posterior density would be:

Similar to the previous example, the transition probability is the uniform distribution over . The conditional distribution is designed to be the uniform distribution over , which is independent with data and parameter.

The last thing we need to specify is the way we initialize variable in the pseudo-marginal Monte Carlo algorithm, we simply draw with .

In this setting, we are ready to calculate the transition probability of both algorithms, we will still use to denote the transition probability using exchange algorithm, and use to denote the transition probability using pseudo marginal Monte Carlo. The transition probabilities can be calculated as follows:

By symmetry, we also have

Similarly,

and

Therefore we have for any and thus the Pseudo-marginal algorithm has higher acceptance transition probability, which means it will converge faster.

3.3 Intuition and discussion

In this part we briefly talk about the intuition behind the two examples above. Comparing with the ideal Metropolis-Hastings algorithm, the difficult part comes from the unknown ratio of normalizing constant

The idea of both algorithms is to generate an auxiliary variable, and use the new random variable to construct an estimator to estimate the ratio of normalizing constants. Therefore, the accuracy of the estimator determines the performance of the algorithm. The main difference between Pseudo-marginal Monte Carlo and exchange algorithm is, the exchange algorithm uses the estimator

to estimate the ratio directly. However, the Pseudo-marginal Monte Carlo uses to estimate and uses to estimate , then it uses the quotient as an estimator of . Therefore, when the probability measure and differs a lot (the case in the second example), then the exchange algorithm may perform poorly. But when the two measures are close, then the exchange algorithm may perform better than Pseudo-marginal Monte Carlo.

In the two examples above, one algorithm dominates the other for all pairs . In real situations, however, the parameter space may be much larger than our designed examples, thus the usually there exists two rigions, . In , PMC performs better than SVE, while in , vice versa. Therefore, given a proposal move from to , it is natural to ask the following question:

  • Is it possible to find a ‘reasonable’ way of choosing between PMC and SVE to improve the acceptance probability ?

This question will be answered in the next section.

4 MABMC: A Multi-armed Bandit MCMC Algorithm

4.1 How to choose between algorithms ‘legally’?

Now we are ready to incorporate the two algorithms together. To make things more general and concrete, this problem can be formulated as follows:

Given an attempt move from to , and two valid estimates of , denoted by , . Can one find a decision rule such that the new Markov chain with transition probability still preserves detailed balance and has higher acceptance probability than either algorithm?

It is worth mentioning that the seemingly obvious choice is not valid, as this would break the detailed balance and thus the algorithm may not converge to the desired stationary distribution. It turns out that to preserve the detailed balance, the decision rule has to be ‘symmetric’, i.e., .

This problem is very similar to the multi-armed bandit problem in probability. In each iteration of the MCMC algorithm, an agent is facing the choose between and , with the goal of increasing the acceptance probability. It is of our interest to design a reasonable policy (or make a decision rule) to get better performance than random guess.

Definition 1 (Valid ratio).

Let be a probability density on the parameter space , which may be of the form where is intractable. Let be a transition kernel in the M-H algorithm. A (randomized) acceptance ratio is called valid if it preserves the detailed balance with respect to stationary distribution