On the utility of Metropolis-Hastings with asymmetric acceptance ratio

03/26/2018 ∙ by Christophe Andrieu, et al. ∙ 0

The Metropolis-Hastings algorithm allows one to sample asymptotically from any probability distribution π. There has been recently much work devoted to the development of variants of the MH update which can handle scenarios where such an evaluation is impossible, and yet are guaranteed to sample from π asymptotically. The most popular approach to have emerged is arguably the pseudo-marginal MH algorithm which substitutes an unbiased estimate of an unnormalised version of π for π. Alternative pseudo-marginal algorithms relying instead on unbiased estimates of the MH acceptance ratio have also been proposed. These algorithms can have better properties than standard PM algorithms. Convergence properties of both classes of algorithms are known to depend on the variability of the estimators involved and reduced variability is guaranteed to decrease the asymptotic variance of ergodic averages and will shorten the burn-in period, or convergence to equilibrium, in most scenarios of interest. A simple approach to reduce variability, amenable to parallel computations, consists of averaging independent estimators. However, while averaging estimators of π in a pseudo-marginal algorithm retains the guarantee of sampling from π asymptotically, naive averaging of acceptance ratio estimates breaks detailed balance, leading to incorrect results. We propose an original methodology which allows for a correct implementation of this idea. We establish theoretical properties which parallel those available for standard PM algorithms and discussed above. We demonstrate the interest of the approach on various inference problems. In particular we show that convergence to equilibrium can be significantly shortened, therefore offering the possibility to reduce a user's waiting time in a generic fashion when a parallel computing architecture is available.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Suppose we are interested in sampling from a given probability distribution on some measurable space . When it is impossible or too difficult to generate perfect samples from , one practical resource is to use a Markov chain Monte Carlo (MCMC) algorithm which generates an ergodic Markov chain whose invariant distribution is . Among MCMC methods, the Metropolis–Hastings (MH) algorithm plays a central rôle. The MH update proceeds as follows: given and a Markov transition kernel on , we propose and set with probability , where

(1)

for (see Appendix A for a definition of ) is a well defined Radon–Nikodym derivative, and otherwise. When the proposed value is rejected, we set . We will refer to as the acceptance ratio. The transition kernel of the Markov chain generated with the MH algorithm with proposal kernel is

(2)

where is the probability of rejecting a proposed sample when ,

and is the Dirac measure centred at . Expectations of functions, say , with respect to can be estimated with for , which is consistent under mild assumptions.

Being able to evaluate the acceptance ratio is obviously central to implementing the MH algorithm in practice. Recently, there has been much interest in expanding the scope of the MH algorithm to situations where this acceptance ratio is intractable, that is, impossible or very expensive to compute. A canonical example of intractability is when can be written as the marginal of a given joint probability distribution for and some latent variable

. A classical way of addressing this problem consists of running an MCMC targeting the joint distribution, which may however become very inefficient in situations where the size of the latent variable is high–this is for example the case for general state-space models. In what follows, we will briefly review some more effective ways of tackling this problem. To that purpose we will use the following simple running example to illustrate various methods. This example has the advantage that its setup is relatively simple and of clear practical relevance. We postpone developments for much more complicated setups to Sections

4 and 5.

Example 1 (Inference with doubly intractable models).

In this scenario the likelihood function of the unknown parameter for the dataset , , is only known up to a normalising constant, that is

where is unknown, while can be evaluated pointwise for any value of . In a Bayesian framework, for a prior density , we are interested in the posterior density , with respect to some measure, given by

With in (1), the resulting acceptance ratio of the MH algorithm associated to a proposal density is

(3)

which cannot be calculated because of the unknown ratio . While the likelihood function may be intractable, sampling artificial datasets may be possible for any , and sometimes computationally cheap. We will describe two known approaches which exploit and expand this property in order to design Markov kernels preserving as invariant density.

1.1 Estimating the target density

Assume for simplicity that has a probability density with respect to some -finite measure. We will abuse notation slightly by using for both the probability distribution and its density. A powerful, yet simple, method to tackle intractability which has recently attracted substantial interest consists of replacing the value of with a non-negative random estimator whenever it is required in the implementation of the MH algorithm above. If for all and a constant , a property we refer somewhat abusively as unbiasedness, this strategy turns out to lead to exact algorithms, that is sampling from is guaranteed at equilibrium under very mild assumptions on . This approach leads to so called pseudo-marginal algorithms (Andrieu and Roberts, 2009). However, for reasons which will become clearer later, we refer from now on to these techniques as Pseudo-Marginal Target (PMT) algorithms.

Example 2 (Example 1, ctd).

Let be an integrable non-negative function of integral equal to . For a given , an unbiased estimate of can be obtained via importance sampling whenever the support of includes that of :

(4)

since the normalised sum is an unbiased estimator of . The auxiliary variable method of Møller et al. (2006) corresponds to . An interesting feature of this approach is that is a free parameter of the algorithm which reduces the variability of this estimator. It is shown in Andrieu and Vihola (2014) that increasing in a PMT algorithm always reduces the asymptotic variance of averages using this chain. This is particularly interesting in a parallel computing environment, but also serial for some models. We illustrate this numerically on a simple Ising model (see details in Section 3.3) for a lattice and , where is an approximation of the maximum likelihood estimator of for the data . In Figure 1 we report the estimated integrated auto-covariance (IAC) for the identity, that is for the function , as a function of and values of . The results are highly dependent on the value of , but adjusting allows one to compensate for a wrong choice of this parameter. This is important in practice since for more complicated scenarios obtaining a good approximation of the maximum likelihood estimator of may be difficult.

Figure 1: IAC of the algorithm as a function of .

1.2 Estimating the acceptance ratio

One can in fact push the idea of replacing algebraic expressions with estimators further. Instead of approximating the numerator and denominator of the acceptance ratio independently, it is indeed possible to use directly estimators of the acceptance ratio and still obtain algorithms guaranteed to sample from at equilibrium. We will refer to these algorithms as Pseudo-Marginal Ratio (PMR) algorithms. A general framework is described in Andrieu and Vihola (2014) as well as in Section 2.1, but this idea has appeared earlier in various forms in the literature, see e.g. Nicholls et al. (2012) and the references therein. An interesting feature of PMR algorithms is that we estimate the ratio afresh whenever it is required. On the contrary, in a PMT framework, if the estimate of the current state significantly overestimates , this results in poor performance as the algorithm will typically reject many transitions away from as the same estimate of is used until a proposal is accepted. In the following continuation of Example 1, we present a particular case of this PMR idea proposed by Murray et al. (2006).

Example 3 (Example 1, ctd).

The exchange algorithm of Murray et al. (2006) is motivated by the realisation that while for and any , is an unbiased estimator of , the particular choice leads to an unbiased estimator of . We can expect this estimator to have a reasonable variance when and are close if satisfies some form of continuity. This suggests the following algorithm. Given , sample , then and use the acceptance ratio

(5)

which is an unbiased estimator of the acceptance ratio in (3). The remarkable property of this algorithm is that it admits as an invariant distribution and hence, under mild exploration related assumptions, it is guaranteed to produce samples asymptotically distributed according to .

1.3 Contribution

As for PMT algorithms, it is natural to ask whether it is possible to further improve the performance of PMR algorithms by reducing the variability of the acceptance ratio estimator by averaging a number of such estimators, while preserving the target distribution invariant. We shall see that, unfortunately, such naïve averaging approach does not work for the PMR methods currently available as it breaks the reversibility of the kernels with respect to .

A contribution of the present paper is the introduction of a novel class of PMR algorithms which can exploit acceptance ratio estimators obtained through averaging and sample from at equilibrium. These algorithms described in Section 2 naturally lend themselves to parallel computations as independent ratio estimators can be computed in parallel at each iteration. In this respect, our methodology contributes to the emerging literature on the use of parallel processing units, such as Graphics Processing Units (GPUs) or multicore chips for scientific computation Lee et al. (2010); Suchard et al. (2010). We show that this generic procedure is guaranteed to decrease the asymptotic variance of ergodic averages as the number of independent ratios one averages increases and that the burn-in period will be reduced in most scenarios. The latter is particularly relevant since exact and generic methods to achieve this are scarce (Sohn, 1995), in contrast with variance reduction techniques for which better embarrassingly parallel solutions (Sherlock et al., 2017; Bornn et al., 2017) and/or post-processing methods are available (Delmas and Jourdain, 2009; Dellaportas and Kontoyiannis, 2012). We demonstrate experimentally its performance gain for the exchange algorithm.

This new class of PMR algorithms can be understood as being a particular instance of a more general principle which we exploit further in this paper, beyond the above example. Let be a pair of kernels such that the following Radon-Nikodym derivative

is well defined for on some symmetric set and set otherwise. This can be thought of as an asymmetric version of the standard MH acceptance ratio (1) and naturally leads to two questions.

  1. Assuming that sampling from and for any is feasible and that is tractable, can one design a correct MCMC algorithm for that involves simulating from and and evaluating ?

  2. Assuming the answer to the above is positive, can this additional degree of freedom be beneficial in order to design correct MCMC algorithms with practically appealing features e.g. accelerated convergence?

The answer to the first question is unsurprisingly yes, and we will refer to the corresponding class of algorithms as MH with Asymmetric Acceptance Ratio (MHAAR). MHAAR has already been exploited in some specific contexts (Tjelmeland and Eidsvik, 2004; Andrieu and Thoms, 2008), but its best known application certainly remains the reversible jump MCMC methodology of Green (1995). However the way we take advantage of this additional flexibility seems completely novel. We also note, as detailed in our discussion in Section 6, that such asymmetric acceptance ratios are also at the heart of non-reversible MCMC algorithms which have recently attracted renewed interest in the Physics and Statistical communities (Gustafson, 1998; Turitsyn et al., 2011). In Appendix A, we describe and justify a slightly more general framework to the above which ensures reversibility with respect to . The answer to the second question is the object of this paper, and averaging acceptance ratios as suggested earlier is one such application.

In Section 3 we further investigate the doubly intractable scenario by incorporating the Annealed Importance Sampling (AIS) mechanism (Neal, 2001; Murray et al., 2006) in MHAAR, and explore numerically the performance of MHAAR with AIS on an Ising model.

In Section 4 we expand the class of problems our methodology can address by considering latent variable models. This leads to important extensions of the original AIS within MH algorithm proposed in Neal (2004). We demonstrate the efficiency of our MHAAR-based approach by recasting the popular reversible jump MCMC (RJ-MCMC) methodology as a particular case of our framework and illustrate the computational benefits of our novel algorithm in this context on the Poisson change-point model in Green (1995).

In Section 5, we show how MHAAR can be advantageous in the context of inference in state-space models when it is utilised with sequential Monte Carlo (SMC) algorithms. In particular, we expand the scope of particle MCMC algorithms (Andrieu et al., 2010) and show novel ways of using multiple or all possible paths from backward sampling of conditional SMC (cSMC) to estimate the marginal acceptance ratio.

In Section 6, we provide some discussion and two interesting extensions of MHAAR. Specifically, in Section 6.1 we discuss an SMC-based generalisation of our algorithms involving AIS. Furthermore, in Section 6.2 we provide a new insight to non-reversible versions of MH algorithms that is relevant to our setting. We briefly demonstrate how non-reversible versions of our algorithms can be obtained with a small modification so that one can benefit both from non-reversibility and the ability to average acceptance ratio estimators.

Some of the proofs of the validity of our algorithms as well as additional discussion on the generalisation of the methods can found in the Appendices.

2 PMR algorithms using averaged acceptance ratio estimators

2.1 PMR algorithms

We introduce here generic PMR algorithms, that is MH algorithms relying on an estimator of the acceptance ratio. We then show that in their standard form these algorithms cannot use an estimator of this ratio obtained through averaging independent estimators. A slightly more general framework is provided in Nicholls et al. (2012), while a more abstract description is provided in Andrieu and Vihola (2014). To that purpose we introduce a -valued auxiliary variable

(we use small letters for random variables and realisations throughout) and let

be a measurable involution, that is . Then we introduce a pair of families of proposal distributions , on , where

(6)

with denoting the conditional distribution of given , and

(7)

where, for any we have

(8)

which means that in order to sample , one can sample and set . PMR algorithms are defined by the following transition kernel

(9)

where the acceptance ratio is equal, for and defined similarly to (47), to

(10)

and to otherwise. It is clear from (10) that the acceptance ratio is an unbiased estimator of the standard MH acceptance , i.e.

(11)

Due to the particular form of symmetry between and imposed by (8), is reversible with respect to by considering detailed balance for fixed ; see Theorem 5 in Appendix A.

As long as PMR algorithms are concerned, we call the proposal kernel of PMR and its complementary kernel, owing to the way (9) is constructed. Motivation for this enumeration will be clear in Section 2.2, in particular by Remark 3.

Remark 1.

A cautionary remark is in order. When we substitute a non-negative unbiased estimator of for in the MH algorithm, the resulting PMT algorithm is invariant. However, if we substitute a positive unbiased estimator of for in the MH algorithm then the resulting transition kernel is not necessarily invariant. To establish that is invariant, we require our estimator to have the specific structure given in (10).

A particular instance of this algorithm was given earlier in the set-up of Example 1, where , the random variable corresponds to a fictitious dataset used to estimate the ratio of normalising constants, and . The need to consider more general transformations will become apparent in Section 3.

This type of algorithms is motivated by the fact that while in some situations cannot be computed, the introduction of the auxiliary variable makes the computation of possible. However, this computational tractability comes at a price. Applying Jensen’s inequality to (11) shows that

Peskun’s result (Tierney, 1998) thus implies that the MCMC algorithm relying on is always inferior to that using for various performance measures (see Theorem 1 for details). As pointed out in Andrieu and Vihola (2014) reducing the variability of , for example in the sense of the convex order, for all will reduce the gap in the inequality above, resulting in improved performance. From the rightmost expression in (10) a possibility to reduce variability might be to change (and possibly ) in such a way that for all , but this is impossible in most practical scenarios. In contrast a natural idea consists of averaging ratios ’s for, say, realisations and use the acceptance ratio

(12)

where –we drop the dependence on in order to alleviate notation whenever no ambiguity is possible. While this reduces the variance of the estimator of , this naïve modification of the acceptance rule of breaks detailed balance with respect to . Indeed one can check that with , a bounded measurable function and using Fubini’s result,

in general. This is best seen in a scenario where and are finite and for some , and it can be shown that is not left invariant by the corresponding Markov transition probability.

2.2 MHAAR for averaging PMR estimators

We show here how MHAAR updates can be used in order to use the acceptance ratio in (12), while preserving reversibility. Our novel scheme is described in Algorithm 1. For and we let denote the probability distribution of the random variable on such that .

Input: Current sample
Output: New sample
1 Sample and .
2 if  then
3       Sample and .
4       Set with probability , otherwise set .
5else
6       Sample , and
7       Set with probability , otherwise set .
Algorithm 1 MHAAR for averaging PMR estimators

The unusual step in this update is the random choice between two sampling mechanisms for the auxiliary variables and the fact that depending on this choice either or is used. Apart from the reversible jump MCMC context (Green, 1995) and specific uses Tjelmeland and Eidsvik (2004); Andrieu and Thoms (2008), this type of asymmetric updates has rarely been used –see Appendix A.2 for an extensive discussion and from Section 4 on for other applications. The probability distributions corresponding to the two proposal mechanisms in Algorithm 1 are given by

and the corresponding Markov transition kernel by

(13)

where and are the rejection probabilities for each sampling mechanism. We establish the reversibility of in Theorem 1.

Remark 2.

It is necessary to include the variable in and to obtain tractable acceptance ratios validating the algorithm but, practically, its value is clearly redundant in Algorithm 1 and sampling is therefore not required.

Remark 3.

and reduce to and in (6) and (7) when in which case becomes redundant. This implies generality over PMR algorithms even for (although probably not a useful one), in the sense that in MHAAR one can also propose from .

Example 4 (Example 2, ctd).

As noticed in Nicholls et al. (2012), the exchange algorithm (Murray et al., 2006) can be recast as a PMR algorithm of the form given in (9) where , and corresponds to . Hence an extension of this algorithm using an averaged acceptance ratio estimator is given by Algorithm 1. Taking into account Remark 2, this takes the following form. Sample , then with probability sample and compute

or (i.e. with probability 1/2) sample and , and compute . This algorithm was implemented for an Ising model (see details in Section 3.3) and numerical simulations are presented in Figure 3 where the IAC of is reported as a function of (red/grey colour). As anticipated, increasing improves performance.

2.2.1 Theoretical results on validity and performance of MHAAR

The following theorem justifies the theoretical usefulness of Algorithm 1. The result follows from Hilbert space techniques and the recent realisation that the convex order plays an important rôle in the characterisation of MH updates based on estimated acceptance ratios (Andrieu and Vihola, 2014). We consider standard performance measures associated to a Markov transition probability of invariant distribution defined on some measurable space . Let and . For any the asymptotic variance is defined as

which is guaranteed to exist for reversible Markov chains (although it may be infinite) and for a reversible kernel its right spectral gap

where for any is the so-called Dirichlet form. The right spectral gap is particularly useful in the situation where is a positive operator, in which case is related to the geometric rate of convergence of the Markov chain.

Theorem 1.

With and as defined in (2) and(13), respectively,

  1. For any is reversible,

  2. For all , and is non decreasing,

  3. For any ,

    1. (or equivalently first order auto-covariance coefficient) is non decreasing (non increasing),

    2. is non increasing,

    3. for all , .

Proof.

The reversibility follows from the fact that this Markov transition kernel fits in the framework of asymmetric MH updates described in Theorem 4 in Appendix A after checking that for any ,

(14)

For the other statements we first start by noticing that the expression for the Dirichlet form associated with can be rewritten in either of the following simplified forms

This follows from the identities established in (13) and (14). The expression on the first line turns out to be particularly convenient. A well known result from the convex order literature states that for any exchangeable random variables and any convex function we have whenever the expectations exist (Müller and Stoyan, 2002, Corollary 1.5.24). The two sums are said to be convex ordered. Now since is convex we deduce that for any , ,

(15)

where , and consequently for any and

All the monotonicity properties follow from Tierney (1998) since and are reversible. The comparisons to follow from the application of Jensen’s inequality to , which leads for any to

and again using the results of Tierney (1998). ∎

This result motivates the practical usefulness of the algorithm, in particular in a parallel computing environment. Indeed, one crucial property of Algorithm 1 is that in both moves and , sampling of and computation of can be performed in a parallel fashion and offers the possibility to reduce the variance of estimators, but more importantly the burn-in period of algorithms. Indeed one could object that running independent chains in parallel with and combining their averages, instead of using the output from a single chain with would achieve variance reduction. However our point is that the former does not speed up convergence to equilibrium, while the latter will, in general. Unfortunately, while estimating the asymptotic variance from simulations is achievable, estimating time to convergence to equilibrium is far from standard in general. The following toy example is an exception and illustrates our point.

Example 5.

Here we let

be the uniform distribution on

, for , , and . In other words can be reparametrized in terms of and with the choice for we obtain

Note that there is no need to be more specific than say for as then a proposed “stay” is always accepted. This suggests that we are in fact drawing the acceptance ratio, and corresponds to (Example 8 in Andrieu and Vihola, 2014) of their abstract parametrisation of PMR algorithms. Now for and we have

where

is the probability mass function of the binomial distribution of parameters

and and

The second largest eigenvalue of the corresponding Markov transition matrix is

from which we find the relaxation time , and bounds on the mixing time , that is the number of iterations required for the Markov chain to have marginal distribution within of , in the total variation distance, Levin and Peres (2017, Theorem 12.3 and Theorem 12.4)

We define the time reduction, , which is independent of and captures the benefit of MHAAR in terms of convergence to equilibrium. In Fig. 2 we present the evolution of for and as a function of . As expected the worse the algorithm corresponding to is, the more beneficial averaging is: for we observe running time reductions of approximately , and respectively. This suggests that computationally cheap, but possibly highly variable, estimators of the acceptance ratio may be preferable to reduce burn-in when a parallel machine is available and communication costs are negli-geable.

Figure 2: Top left: , Top right: , Bottom left: , Bottom right: evolution of as a function of

2.2.2 Introducing dependence

The following discussion on possible extensions can be omitted on a first reading. There are numerous possible variations around the basic algorithm presented above. A practically important extension related to the order in which variables are drawn is discussed in Section 4 in the general context of latent variable models. There is another possible extension worth mentioning here. Close inspection of the proof of reversibility of in Theorem 1 suggests that conditional independence of is not a requirement. Define .

Theorem 2.

Let and for any let be a probability distribution on such that all its marginals are identical and equal to . Assume that

Then with acceptance ratio as in (12) is reversible. Further, if are exchangeable with respect to then all the comparison results in Theorem 1 still hold.

Proof.

One can check that

which remains the same as in (12). The exchangeability assumption ensures that (15) holds. ∎

Example 6.

The following is a short discussion of a scenario which may be relevant in practice. Assume that it is possible to sample from but that this is computationally expensive, as is the case for sampling exactly from Markov random fields such as the Ising model. One could suggest sampling the remaining samples as defined in using a reversible Markov transition probability (and similarly for in using ), which will in general be far less expensive. Here corresponds to sampling

In order to describe sampling in