The Metropolis–Hastings (MH) algorithm is a classical method for sampling approximately from a distribution of interest relying only on point-wise evaluations of an unnormalized density. However, when even this unnormalized density depends on unknown integrals and cannot easily be evaluated, then this approach is not feasible. A possible solution is to replace the required density evaluations in the MH acceptance ratio with suitable approximations. This idea is implemented in Monte Carlo within Metropolis (MCwM) algorithms which substitute the unnormalized density evaluations by Monte Carlo estimates for the intractable integrals.
Yet in general, replacing the exact MH acceptance ratio by an approximation leads to inexact algorithms in the sense that the resulting Markov chain does not have the distribution of interest as stationary one and might not converge to a stationary distribution at all. Nonetheless, these approximate or noisy methods, see e.g. [AFEB16, JM17b, JMMD15], have recently gained increased attention due to their applicability in certain intractable sampling problems. In this work we attempt to answer the following questions about the MCwM algorithm:
Can one quantify the quality of MCwM algorithms?
When might the MCwM algorithm fail and what can one do in such situations?
Regarding the first question, by using bounds on the difference of the -th step distributions of MH and MCwM algorithm we give a positive answer which improves upon earlier results in the literature, see below. For the second question, we suggest a modification for stabilizing the MCwM approach by restricting the Markov chain to a suitably chosen set that contains the “essential part”, or the “center”, of the state space. We provide examples where this restricted version of MCwM converges towards the distribution of interest while the unrestricted version does not. Note also that in practical implementations of Markov chain Monte Carlo on a computer, simulated chains are effectively restricted to compact state spaces due to memory limitations. Our results on restricted approximations can also be read in this spirit.
Our overall approach is based on perturbation theory for Markov chains. Let be a Markov chain with transition kernel and be a Markov chain with transition kernel on a common Polish space . We think of and as “close” to each other in a suitable sense and consider as a perturbation of . In order to quantify the difference of the distributions of and , denoted by and respectively, we work with
where denotes the total variation distance. The Markov chain can be interpreted as unavailable, unperturbed, ideal, while is a perturbation that is available for simulation. We focus on the case where the ideal Markov chain is geometrically ergodic, more precisely -uniformly ergodic, and satisfies a Lyapunov condition of the form
for some function and numbers .
To obtain estimates of (1) we need two assumptions which can be informally explained as follows:
Explicit bounds on (1) are provided in Section 3 in Proposition 7, Theorem 8 and Theorem 10. More precisely, in Theorem 7 a Lyapunov condition on is imposed whereas in Theorem 8 a restricted approximation is considered. We now briefly comment on related literature to such perturbation estimates for -uniformly ergodic Markov chains.
In contrast to the -uniform ergodicity assumption we impose on the ideal Markov chain, the results in [AFEB16, JMMD15, Mit05] only cover perturbations of uniformly ergodic Markov chains. Nonetheless, perturbation theoretical questions for geometrically ergodic Markov chains have been studied before, see e.g. [BRR01, FHL13, MLR16, NR17, RRS98, RS18, SS00] and the references therein. A crucial aspect where those papers differ from each other is how one measures the closeness of the transitions of the unperturbed and perturbed Markov chains to have applicable estimates, see the discussion about that in [SS00, FHL13, RS18]. Our Theorem 7 and Theorem 8 refine and extend the results of [RS18, Theorem 3.2]. In particular, in Theorem 8 we take a restriction to the center of the state space into account. Let us also mention here that [PS14, RS18] contain related results under Wasserstein ergodicity assumptions. More recently, [JM17a] studies approximate chains using notions of maximal couplings, [NR17] extends the uniformly ergodic setting from [JMMD15] to using norms instead of total variation, and [JM17b] explores bounds on the approximation error of time averages.
In Section 4 we apply our perturbation bounds in the context of approximate sampling via MCwM. The goal is to (approximately) sample from a target distribution on , which is determined by an unnormalized density function w.r.t a reference measure , that is,
Classically the method of choice is to construct a Markov chain based on the MH algorithm for approximate sampling of . This algorithm crucially relies on knowing (at least) the ratio for arbitrary , e.g., because and can readily be computed. However, in some scenarios, only approximations of and are available. Replacing the true unnormalized density in the MH algorithm by an approximation yields a perturbed, “inexact” Markov chain . If the approximation is based on a Monte Carlo method, the perturbed chain is called MCwM chain.
Two particular settings where approximations of may rely on Monte Carlo estimates are doubly-intractable distributions and latent variables. Examples of the former occur in Markov or Gibbs random fields, where the function values of the unnormalized density itself are only known up to a factor . This means that
where only values of can easily be computed while the computational problem lies in evaluating
where denotes an auxiliary variable space, and is a
-finite measure. We study the MCwM algorithm, which in every transition uses an iid sequence of random variables, with , to approximate by (and by , respectively).
Let us emphasize that the doubly-intractable case can also be approached algorithmically from various other perspectives. For instance, instead of estimating the normalizing constant in (2), one could estimate unbiasedly whenever exact simulation from the Markov or Gibbs random field is possible. In this case, turns into a Monte Carlo estimate which can formally be analyzed with exactly the same techniques as the latent variable scenario described below. Yet another algorithmic possibility is explored in the noisy exchange algorithm of [AFEB16], where ratios of the form are approximated by a single Monte Carlo estimate. Their algorithm is motivated by the exchange algorithm [MGM06] which, perhaps surprisingly, can avoid the need for evaluating the ratio and targets the distribution exactly, see [EJREH17] for a brief review of these and related methods. However, in some cases the exchange algorithm performs poorly, see [AFEB16]. Then approximate sampling methods for distributions of the form (2) might prove useful as long as the introduced bias is not too large. As a final remark in this direction, the recent work [ADYC18] considers a correction of the noisy exchange algorithm which produces a Markov chain with stationary distribution .
The second setting we study arises from latent variables. Here, cannot be evaluated since it takes the form
where is a distribution on a measurable space of latent variables , and is a non-negative density function. In general, no explicit computable expression of the above integral is at hand and the MCwM idea is to substitute in the MH algorithm by a Monte Carlo estimate based on iid sequences of random variables and with , . The resulting MCwM algorithm has been studied before in [AR09, MLR16]. This MCwM approach should not be confused with the pseudo-marginal method, see [AR09]. The pseudo-marginal method constructs a Markov chain on the extended space that targets a distribution with as its marginal on .
In both intractability settings, the corresponding MCwM Markov chains depend on the parameter which denotes the number of samples used within the Monte Carlo estimates. As a consequence, any bound on (1) is -dependent, which allows us to control the dissimilarity to the ideal MH based Markov chain. In Corollary 16 and the application of Corollary 17 to the examples considered in Section 4 we provide informative rates of convergence as . Note that with those estimates we relax the requirement of uniform bounds on the approximation error introduced by the estimator for , which is essentially imposed in [MLR16, AFEB16]
. In contrast to this requirement, we use (if available) the Lyapunov function as a counterweight for a second as well as inverse second moment and can therefore handle situations where uniform bounds on the approximation error are not available. If we do not have access to a Lyapunov function for the MCwM transition kernel we suggest to restrict it to a subset of the state space, i.e., use restricted approximations. This subset is determined byand usually corresponds to a ball with some radius that increases as the approximation quality improves, that is, as .
Our analysis of the MCwM algorithm is guided by some stylized facts we observe in simple illustrations, in particular, we consider a log-normal example discussed in Section 4.1. In this example, we encounter a situation where the mean squared error of the Monte Carlo approximation grows exponentially in the tail of the target distribution. We observe empirically that (unrestricted) MCwM works well whenever the growth behavior is dominated by the decay of the (Gaussian) target density in the tail. The application of Corollary 17 to the log-normal example shows that the restricted approximation converges towards the true target density in the number of samples at least like independent of any growth of the error. However, the convergence is better, at least like , if the growth is dominated by the decay of the target density.
Note that our perturbation bounds of Corollary 16 and Corollary 17 hold generally in the doubly-intractable as well as in the latent variable case. Corollary 17 handles the situation of not uniformly bounded approximation error and Corollary 16 the situation of uniformly bounded error. In Section 4.2 we consider the latent variable case and provide an illustrative normal-normal example where the restricted approximation proves to be useful.
Let be a Polish space, where denotes its Borel -algebra. Assume that is a transition kernel with stationary distribution on . For a signed measure on and a measurable function we define
For a distribution on we use the notation For a measurable function and two probability measures on define
For the constant function this is the total variation distance, i.e.,
For a -irreducible and aperiodic transition kernel with stationary distribution defined on the following statements are equivalent:
The transition kernel is geometrically ergodic, that is, there exists a number and a measurable function such that for -a.e. we have
There is a -a.e. finite measurable function with finite moments with respect to and there are constants and such that
In particular, the function can be chosen such that a Lyapunov condition of the form
for some and , is satisfied.
We call a transition kernel -uniformly ergodic if it satisfies (5) and note that this condition can be be rewritten as
3 Quantitative perturbation bounds
Assume that is a Markov chain with transition kernel and initial distribution on . We define , i.e., is the distribution of . The distribution is approximated by using another Markov chain with transition kernel and initial distribution . We define , i.e., is the distribution of . The idea throughout the paper is to interpret as some ideal, unperturbed chain and as an approximating, perturbed Markov chain.
In the spirit of the doubly-intractable distribution and latent variable case considered in Section 4 we think of the unperturbed Markov chain as “nice”, where convergence properties are readily available. Unfortunately since we cannot simulate the “nice” chain we try to approximate it with a perturbed Markov chain, which is, because of the perturbation, difficult to analyze directly. With this in mind, we make the following standing assumption on the unperturbed Markov chain.
Let be a measurable function and assume that is -uniformly ergodic, i.e.,
for some numbers and .
We start with an auxiliary estimate of which is interesting on its own and proved in Appendix A.1.
Let Assumption 3 be satisfied and for a measurable function define
Then, for any ,
If the initial distribution of the perturbed and unperturbed Markov chain differ, then, this appears also in the estimate. Yet, for sufficiently large the influence of the difference of the initial distributions vanishes exponentially quickly.
The quantities and measure the difference between and . Note that we can interpret them as operator norms
It is also easily seen that which implies that is a stronger criterion for measuring the perturbation. In (8) an additional parameter appears which can be used to tune the estimate. Namely, if one is not able to bound sufficiently well but has a good estimate of one can optimize over . On the other hand, if there is a satisfying estimate of one can just set .
In the previous lemma we proved an upper bound of which still contains an unknown quantity given by
which measures, in a sense, stability of the perturbed chain through a weighted sum of expectations of the Lyapunov function under . To control this term, we impose additional assumptions on the perturbed chain. In the following, we consider two assumptions of this type, a Lyapunov condition and a bounded support assumption.
3.1 Lyapunov condition
We start with a simple version of our main estimate which illustrates already some key aspects of the approach via the Lyapunov condition. Here the intuition is as follows: By Theorem 1 we know that the function of Assumption 3 can be chosen such that a Lyapunov condition for is satisfied. Since we think of as being close to , it might be possible to show also a Lyapunov condition with of . If this is the case, the following proposition is applicable.
Let Assumption 3 be satisfied. Additionally, let and such that
Assume that and define as well as (for simplicity)
Then, for any ,
Now we state a more general theorem. In particular, in this estimate the dependence on the initial distribution can be weakened. In the perturbation bound of the previous estimate, the initial distribution is only forgotten if . Yet, intuitively, for long-term stability results should not matter at all. This intuition is confirmed by the theorem.
We consider an illustrating example where Theorem 8 leads to a considerably sharper bound than Proposition 7. This improvement is due to the combination of two novel properties of the bound of Theorem 8:
In the Lyapunov condition (13) the function can be chosen differently from .
The fact that is bounded from above by . Thus converges almost exponentially fast to zero in . This implies that for sufficiently large the dependence of vanishes. Nevertheless, the leading factor can capture situations in which the perturbation error is increasing in for small .
Illustrating example. Let and assume . Here state “” can be interpreted as the “tail” while state “” is the “center”. Define
Thus, the unperturbed Markov chain moves from “” to “” right away, while the perturbed one takes longer. Both transition matrices have the same stationary distribution . Obviously, and for holds
The unperturbed Markov chain is uniformly ergodic, such that we can choose and (5) is satisfied with and . In particular, in this setting and from Proposition 7 coincide, we have . Thus, the estimate of Proposition 7 gives
This bound is optimal in the sense that it is best possible for . But for increasing it is getting worse. Notice also that a different choice of cannot really remedy this situation: The chains differ most strongly at and the bound of Proposition 7 is constant over time. Now choose the function for some . The transition matrix satisfies the Lyapunov condition
i.e., . Moreover, we have and . Thus, in the bound from Theorem 8 we can set and such that
Since can be chosen arbitrarily large, it follows that
which is best possible for all .
The previous example can be seen as a toy model of a situation where the transition probabilities of a perturbed and unperturbed Markov chain are very similar in the “center” of the state space but differ considerably in the “tail”. When the chains start both in the same point in the “tail”, considerable differences between distributions can build up along the initial transient and then vanish again. Earlier perturbation bounds as for example in [Mit05, PS14, RS18] take only an initial error and a remaining error into account. Thus, those are worse for situations where this transient error captured by dominates. A very similar term also appears in the very recent error bounds due to [JM17b]. Here we also see that a function different from is advantageous.
3.2 Restricted approximation
In the previous section, we have seen that a Lyapunov condition of the perturbation helps to control the long-term stability of approximating a -uniformly ergodic Markov chain. In this section we assume that the perturbed chain is restricted to a “large” subset of the state space. In this setting a sufficiently good approximation of the unperturbed Markov chain on this subset leads to a perturbation estimate.
For the unperturbed Markov chain we assume that transition kernel is -uniformly ergodic. Then, for define the “large subset” of the state space as
If is chosen as a monotonic transformation of a norm on , is simply a ball around . The restriction of to the set , given as , is defined as
In other words, whenever would make a transition from to , remains in . Otherwise, is the same as . We obtain the following perturbation bound for approximations whose stability is guaranteed through a restriction to the set .
Under the -uniform ergodicity of Assumption 3 let and be chosen in such a way that
For the perturbed transition kernel assume that it is restricted to , i.e., for all , and that with
Then, with and
we have for that
The proof of the result is stated in Appendix A.1. Notice that while the perturbed chain is restricted to the set , we do not place a similar restriction on the unperturbed chain. The estimate (15) compares the restricted, perturbed chain to the unrestricted, unperturbed one.
In the special case where for we have . For example
with satisfies this condition. The resulting perturbed Markov chain is simply a restriction of the unperturbed Markov chain to and Theorem 10 provides a quantitative bound on the difference of the distributions.
4 Monte Carlo within Metropolis
In Bayesian statistics it is of interest to sample with respect to a distributionon . We assume that admits a possibly unnormalized density with respect to a reference measure , for example the counting, Lebesgue or some Gaussian measure. The Metropolis-Hastings (MH) algorithm is often the method of choice to draw approximate samples according to :
For a proposal transition kernel a transition from to of the MH algorithm works as follows.
Draw and a proposal independently, call the result and , respectively.
is the acceptance ratio, that is, the density of the measure w.r.t. , see [Tie98].
If , then accept the proposal, and return , otherwise reject the proposal and return .
The transition kernel of the MH algorithm with proposal , stationary distribution and acceptance probability is given by
For the MH algorithm in the computation of one uses , which might be known from having access to function evaluations of the unnormalized density . However, when it is expensive or even impossible to compute function values of , then it may not be feasible to sample from using the MH algorithm. Here are two typical examples of such scenarios:
Doubly-intractable distribution: For models such as Markov or Gibbs random fields, the unnormalized density itself is typically only known up to a factor , that is,
where functions values of can be computed, but function values of cannot. For instance, might be given in the form
where denotes an auxiliary variable space, and is a -finite measure.
Latent variables: Here cannot be evaluated, since it takes the form
with a distribution on a measurable space of latent variables and a non-negative function .
In the next sections, we study in both of these settings the perturbation error of an approximating MH algorithm. A fair assumption in both scenarios is that the infeasible, unperturbed MH algorithm is -uniformly ergodic:
For some function let the transition kernel of the MH algorithm be -uniformly ergodic, that is,
with and , and additionally, assume that the Lyapunov condition
for some and is satisfied.
We have the following standard proposition (see e.g. [RS18, Lemma 4.1] or [AFEB16, BDH14, JM17b, MLR18, PS14]) which leads to upper bounds on , and (see Lemma 4 and Theorem 10) for two MH type algorithms and with acceptance probability functions .
Let and let be such that for a constant . Assume that there are functions and a set such that either for or holds
Then we have
and, for any ,
The proposition provides a tool for controlling the distance between the transition kernels of two MH type algorithms with identical proposal and different acceptance probabilities. The specific functional form for the dependence of the upper bound in (20) on and is motivated by the applications below. The set indicates the “essential” part of where the difference of the acceptance probabilities matter. The parameter is used to shift weight between the two components and of the approximation error. For the proof of the proposition, we refer to Appendix A.2.
4.1 Doubly-intractable distributions
In the case where takes the form (18), we can approximate by a Monte Carlo estimate
under the assumption that we have access to an iid sequence of random variables where each is distributed according to . Then, the idea is to substitute the unknown quantity by the approximation within the acceptance ratio. Define a function by , then the acceptance probability given , modifies to
where the random variables , are assumed to be independent from each other. This leads to a Monte Carlo within Metropolis (MCwM) algorithm:
For a given proposal transition kernel , a transition from to of the MCwM algorithm works as follows.
Draw and a proposal independently, call the result and , respectively.
Based on independent samples compute , and then calculate .
If , then accept the proposal, and return , otherwise reject the proposal and return .
Given the current state and a proposed state the overall acceptance probability is
which leads to the corresponding transition kernel of the form , see (17).
The quality of the MCwM algorithm depends on the error of the approximation of . The root mean squared error of this approximation can be quantified by the use of , that is,
is determined by the second moment of . In addition, due to the appearance of the estimator in the denominator of , we need some control of its distribution near zero. To this end, we define, for and , the inverse moment function
With this notation we obtain the following estimate, which is proved in Appendix A.2.
Assume that there exists such that and are finite for all . Then, for all and we have
One can replace the boundedness of the second inverse moment for any by boundedness of a lower moment for with suitably adjusted , see Lemma 23 in the appendix.
4.1.1 Inheritance of the Lyapunov condition
If the second and inverse second moment are uniformly bounded, as well as , one can show that the Lyapunov condition of the MH transition kernel is inherited by the the MCwM algorithm. In the following corollary, we prove this inheritance and state the resulting error bound for MCwM.
For a distribution on let and be the respective distributions of the MH and MCwM algorithms after steps. Let Assumption 12 be satisfied and for some let
Further, define and Then, for any
we have and
Observe that the estimate is bounded in so that the
difference of the distributions converges uniformly in to zero for .
The constant decreases for increasing , so that larger values of improve the bound.
Log-normal example I. Let and the target measure
be the standard normal distribution. We choose a Gaussian proposal kernelfor some , where denotes the normal distribution with mean
and variance. It is well known, see [JH00, Theorem 4.1, Theorem 4.3 and Theorem 4.6], that the MH transition kernel satisfies Assumption 12 for some numbers , , and with .
be the density of the log-normal distribution with parametersand , i.e., is the density of for a random variable . Then, by the fact that for all functions , we can write the (unnormalized) standard normal density as
Hence takes the form (18) with , , and being a log-normal distribution with parameters and . Independent draws from this log-normal distribution are used in the MCwM algorithm to approximate the integral. We have for all and, accordingly,