Models with unknown normalizing constants arise frequently in many different areas. Examples include Ising models [Isi25] in statistical physics, autologistic models [Bes72] [Bes74] in spatial statistics, exponential random graph models [RPKL07] in sociology, disease transmission models [PHLJH12] in epidemiology, and so on. The corresponding statistical inference problem can be formulated as follows.
Suppose we were given data
sampled from a family of probability densities (or probability mass distributions) of the form:
We assume can be easily evaluated but the normalizing function is computationally intractable. Examples include:
Example 1 (Ising Model).
Consider a graph with nodes, each vertex is assigned with a spin , which is either or . A spin configuration is an assignment of spins to all the graph vertices. An Ising model on G is defined by the following Boltzmann distributions over all possible configurations:
where is often referred to as the ‘Hamiltonian function’, is often referred to as the ‘partition function’. As there are different possible spin configurations, the normalizing constant is usually computationally intractable for moderately large .
Example 2 (Exponential Random Graph Model).
Exponential random graph models are a family of probability distributions on graphs. Let
Exponential random graph models are a family of probability distributions on graphs. Letbe the set of all simple, undirect graphs without loops or multiple edges on vertices. Consider the following distribution on :
where is a sufficient statistics defined on . This may be chosen as the degrees of the vertices, the number of edges, the number of triangles, or other sub-graph counts, is the normalizing constant. As there are up to possible graphs, is also computationally intractable for moderately large .
It is of natural interest to do inference on the parameter . It turns out that the classical route for statistical inference (maximum likelihood approach) can not be applied due to the intractability of . Current frequentist solutions are mainly based on approximation methods such as pseudo-likelihood approximation [Bes74], MCMC-MLE [Gey91], stochastic approximation [You88]. Usually frequentist approaches are computationally efficient but do not have theoretical guarantees. In fact, it is known that there are cases these approximation methods perform poorly, see [CMRT09] for discussion.
In a Bayesian prospective, suppose a prior is adopted. The posterior can be formally calculated by
. Then a central part of Bayesian inference is to understand the posterior distribution. For example, if one is able to (asymptotically) draw samples from the posterior (usually by Markov-chain Monte Carlo algorithms), then the distribution of any function
of interest can be estimated by:
where are samples drawn from .
However, the unknown normalizing function makes MCMC sampling pretty challenging. Consider a standard Metropolis-Hastings (M-H) algorithm with proposal density , in each iteration the acceptance ratio is of the form:
This can not be directly computed as the ratio is unknown. The posterior distribution is often referred to as a doubly-intractable distribution as the Metropolis-Hastings algorithm is accurate only after infinity steps, and each iteration includes an infeasible calculation [MGM06].
One of the most popular methods to resolve this issue is the exchange algorithm [MGM06] proposed by Murray et al. Roughly speaking, the exchange algorithm is a new MCMC algorithm which uses an auxiliary variable at each step to estimate the unknown ratio (see Algorithm 2 for details). The algorithm is easy to implement and is asymptotically exact.
The exchange algorithm is widely used in sampling from doubly-intractable distributions. However, there are very limited studies about its theoretical properties. One fundamental problem with the MCMC algorithm is its convergence rate. On the one hand, an a-priori bound on how long the chain should run to converge within any given accuracy would be helpful to guide practical uses. On the other hand, present theories show there are deep connections between the convergence rate and Markov-chain Central Limit Theorem. A chain with a sub-geometric convergence rate may fail to admit the Central Limit Theorem and the estimator derived by Markov chain samples may even have infinite variance.
This motivates us to study the theoretical properties of the exchange algorithm. Our main contributions include:
We prove several comparison-type results between the exchange algorithm and the original Metropolis-Hastings algorithm. Our results compare the exchange algorithm and the Metropolis-Hastings algorithm in terms of asymptotic variance and convergence rate. We establish both necessary and sufficient conditions for the exchange chain to have ‘good’ convergence properties in the sense of geometric ergodicity.
We apply our theoretical results on a variety of practical examples such as location models, Ising models, exponential random graph models which include many of the practical applications of exchange algorithms. Our results justify the theoretical usefulness of the exchange algorithm in practical situations. To our best knowledge, this is the first result to establish geometric ergodicity for the exchange algorithm on non-compact parameter space.
We prove a Central Limit Theorem for the exchange algorithm given that it is geometrically ergodic. We also provide lower and upper bounds for the asymptotic variance of the exchange algorithm.
We compare the exchange algorithm with Pseudo-marginal Monte Carlo, another popular method to tackle the doubly-intractable distribution. We design a hybrid algorithm called Multi-armed Bandit Monte Carlo which interpolates between the two algorithms and achieves higher acceptance probability.
. Our results are heuristically summarized in Section3.2, with proofs provided in Section 3.3. Section 3.4 shows the theorem proved in Section 3.3 can be applied to many practical models, including location models, Poisson models, a large subset of exponential family models which contains ERGMs and Ising Models. Section 4 studies a Beta-Binomial example carefully. Quantitative bounds for both the Metropolis-Hastings chain and the exchange algorithm are provided. Section 5 concludes this paper and provides further possible directions.
2.1 The Exchange Algorithm
Let be the parameter space and let be the target density on , the standard Metropolis-Hastings Algorithm (MHMC) is described in Algorithm 1.
Input: initial setting , number of iterations , Markov transition kernel
However, in our setting the posterior density has the following expression:
where is an unknown function of .
Therefore, at each step the acceptance ratio
contains an intractable term .
The exchange algorithm described below in Algorithm 2 is a clever extension of MHMC which uses an auxiliary variable at each step to estimate the unknown ratio of . Throughout this paper, we will use to denote the Markov chain of MHMC algorithm, and to denote the Markov chain of exchange algorithm.
Input: initial setting , number of iterations
If we compare exchange algorithm with Metropolis-Hastings Algorithm (Algorithm 1), it turns out the only difference is the uncomputable ratio
appeared in Algorithm 1 is replaced by
in Algorithm 2, where is the auxiliary variable generated in each step. Roughly speaking, the exchange algorithm uses the importance sampling-type estimator
and plugs it into the acceptance ratio in Metropolis-Hastings algorithm. The exchange algorithm is easy to implement and is simple in the sense that it differs from the original Metropolis-Hastings algorithm by only an extra auxiliary variable in one step. Meanwhile, the estimator is cleverly designed so the correct stationary distribution is still preserved. The strong assumption is the ability of drawing exact samples from at Step 3, which is highly nontrivial and often impractical.
In its original paper, the use of the exchange algorithm is illustrated by a stylized example with likelihood, as well as an example with the Ising model. Practitioners also use exchange algorithm in Exponential Random Graph Model (ERGM) [CF11], spatial autoregressive (SAR) model [HL16], Bayesian hypothesis testing [DW18] and so on. However, theoretical studies for doubly intractable distributions and the exchange algorithm are still very limited. Murray et al. proved the detailed-balance equation holds for the exchange algorithm in their original paper [MGM06]. Nicholls et al. [NFW12] gave a sufficient condition for a minorization condition of the exchange chains. Habeck et al. [HRS20] provided stability properties of doubly-intractable distributions. Medina-Aguayo et al. [MARS20] provided guarantees for the Monte Carlo within Metropolis algorithm for approximate sampling of doubly intractable distributions. However, it seems the only existing result concerning the convergence rate of the exchange algorithm is in [NFW12], but the conditions proposed in [NFW12] seems to be strong and are generally not satisfied in unbounded parameter space, which is of practitioner’s main interest. For example, geometric ergodicity is the usual notion of a chain having a ‘good’ convergence rate. But there is no result showing whether the exchange algorithm is geometrically ergodic or not. This motivates us to study the theoretical properties of the exchange algorithm.
It is worth mentioning that, there are numerous other sampling algorithms besides the exchange algorithm to tackle the ‘doubly intractable distribution’. Those algorithms are broadly divided into two classes: ‘exact algorithms’ and ‘noisy algorithms’. As their names suggest, ‘exact algorithms’ usually have the correct stationary distribution , while ‘noisy algorithms’ only follow the target distribution approximately, even after infinitely many steps. Numerous algorithms such as pseudo-marginal MCMC [MPRB06], exchange algorithm [MGM06], noisy MCMC [AFEB16], Doubly Metropolis-Hastings sampler [Lia10] are proposed in each category. Usually, exact algorithms have the virtue of sampling from the correct target distribution in the asymptotic sense but usually requires strict assumption such as perfect sampling. The comparison between these algorithms is beyond the scope of this paper and we refer the readers to an excellent survey [PH18].
In particular, Pseudo-marginal Monte Carlo (PMC) is another popular approach to tackle the double-intractable distribution and shares many similarities with the exchange algorithm. PMC is first proposed in the seminal paper [Bea03] by Beaumont in 2003. In 2006, Møller first used the PMC approach to sample from doubly-intractable posterior distributions. PMC can be viewed as a generalization of the popular data augmentation schemes which samples jointly from parameter space and sample space . PMC is similar to the exchange algorithm as they are both ‘exact algorithms’ and they both use the ‘importance-sampling’ type estimator to estimate the unknown ratio of normalizing constants. A comparison between the estimators used in PMC and the exchange algorithm is discussed in [Wan19]. Meanwhile, many important progresses have been made on the theoretical properties of PMC, see [AR09] and [AV15] for discussions. But the theoretical results for the exchange algorithm is still very limited. This is another motivation of this paper.
2.2 Markov Chain Convergence
Let be a Markov chain on with transition kernel . It is further assumed that is absolute continuous with density with respect to the Lebesgue measure , except perhaps for an atom .
If is a probability measure such that
is called the stationary distribution of . Furthermore, is called reversible if it satisfies a detailed-balance equation, that is:
By design, Metropolis-Hastings algorithms are reversible Markov chains with target distribution as stationary distribution.
Let be the probability measure after running the chain -steps with starting point . The distance between with is measured by total variation distance, defined as follows:
Let , be two probability measures on a sigma-algebra of subsets of a sample space , the total variation distance between and is defined as:
We now make note of two simple properties which turn out to be necessary for analyzing Markov chain convergence and are satisfied by most chains of our interest:
Definition 2 (-irreducible).
A Markov Chain is -irreducible if there exists a non-zero -finite measure on such that for all measurable with , for all , there exists positive integer such that
Roughly speaking, -irreducible ensures every subset with positive measure will be reached from anywhere in the state space.
Definition 3 (Aperiodic).
A Markov Chain is aperiodic if there does not exist and disjoint subsets such that for all and .
Aperiodicity ensures the Markov chain will not periodically explore the state space.
The following theorem establishes the convergence for general state space Markov chain.
Theorem 1 ([Mt12], Chapter 13).
If a Markov chain on a state space with countably generated algebra is -irreducible and aperiodic, and has a stationary distribution , then
as for -a.e. .
Theorem 1 is widely applied in MCMC algorithms, as it is often very easy to check the MCMC chain is irreducible (with respect to or Lebesgue measure) and aperiodic.
On the other hand, however, Theorem 1 does not give us any quantitative result on convergence rate. Specifically, a Markov chain is said to be uniformly ergodic if
for and .
A Markov chain is said to be geometrically ergodic if
for some .
Geometric ergodicity is extremely important for analyzing Markov chains because:
If one can give sharp bounds on the right-hand side of 2.3, practical users can calculate how many iterations are sufficient to get the total variation distance within any prefixed accuracy before running the algorithm.
As discussed in Theorem 1 - Theorem 6 in [RR08]
, geometrically ergodic is closely related to the existence of a Markov chain Central Limit Theorem (CLT). Roughly speaking, for a reversible Markov chain (which is the case for Metropolis-Hastings chain), geometrically ergodic is essentially a necessary and sufficient condition to ensure a Markov chain CLT for all square-integrable random variables. Therefore, a geometrically ergodic Markov chain is often considered a ‘good’ chain.
There are numerous studies concerning the convergence rate of Markov chains. For finite state space Markov chains, many techniques including geometric inequalities, Fourier analysis, multicommodity flows are used to bound the right hand side of 2.2, see [Sin92], [DS91], [DS81] for examples.
For general state Markov chains, geometric ergodicity is usually established by minorization and drift conditions, see [Ros95] [MT96] [RT96b] [MT12]. It is worth mentioning that the spectral characterization between geometric ergodicity and the spectrum of a reversible Markov chain is established in [RR97].
Geometric ergodicity for can often be established by [MT96], [RR97]. However, as we discussed above, in the doubly-intractable setting, implementing Metropolis-Hastings chain is impractical, we are more interested in studying the convergence of the exchange algorithm.
To establish conditions that ensure uniformly or geometrically ergodic, we need the concepts of small set.
A set is called small if there exists an integer , , and a probability measure such that, for any ,
Here 2.4 is also called minorization condition.
The following theorem about uniform ergodicity is taken from [MT12], Chapter 16.
Theorem 2 (Uniform Ergodicity).
For a Markov chain , the followings are equivalent:
A minorization condition holds for the whole space , i.e., there exists an integer , , and a probability measure such that, for any ,
The chain is uniformly ergodic, and
When the state space is finite, then all the aperiodic, -irreducible Markov chains are uniformly ergodic. However, when the state space is infinite, most chains are not uniformly ergodic. The following theorem is taken from Meyn and Tweedie [MT12], Chapter 15 and 16, and Roberts and Rosenthal [RR97], Proposition 1 and Theorem 2, which characterizes geometric ergodicity.
Theorem 3 (Geometric Ergodicity).
For a reversible Markov chain , the followings are equivalent:
is geometrically ergodic
There exists a function , finite at least for one point, and a small set , such that for some , :
There exists such that . Here . Here is viewed as an operator on , where is the stationary distribution.
Furthermore, it is shown by Mengerson and Tweedie [MT96] that a symmetric random-walk Metropolis algorithm is geometrically ergodic essentially if and only if
has finite exponential moments. Therefore, if the posterior distribution has exponentially decreasing tails, the symmetric random-walk Metropolis algorithm would be geometrically ergodic. Since conditions in Mengerson and Tweedie[MT96] are often easier to check, and satisfied by a large family of practical Metropolis-Hastings algorithms, people often use [MT96] to conclude an MCMC algorithm is geometrically ergodic.
3 Theoretical Results
3.1 Asymptotic variance and Peskun’s ordering
We start by proving the following simple but useful lemma, indicating that the exchange chain is always less statistically efficient comparing with the original MH chain:
This lemma shows that, the exchange algorithm is less likely to make a move comparing with the original M-H algorithm.
This lemma shows that, the exchange algorithm is less likely to make a move comparing with the original M-H algorithm.
By Jensen’s inequality (the function is concave).
as desired. ∎
Let be the set of all the -integrable random variable with mean . Define
where is a Markov chain with initial distribution and transition kernel . Then
for all .
is also called the ‘asymptotic variance’. Theorem 4 proves the original chain has smaller asymptotic variance and is thus statistically more efficient than chain.
Theorem 4 is not very surprising because in each iteration of the exchange algorithm, the ratio
can be viewed as an estimator to estimate the unknown quantity
. On the other hand, the standard M-H chain uses
directly which can be viewed as an estimator with variance . Therefore it is not surprising that the original M-H chain has a smaller asymptotic variance.
However, asymptotic variance is only one measurement to evaluate a Markov chain. Another natural way of evaluating a Markov chain is the speed of convergence to stationary distribution. Even though is dominated by in Peskun’s order, the following simple example shows it is possible that converges to stationary distribution uniformly faster than .
Example 3 (Two point example).
Let , where the parameter space only contains two points:
Suppose the observed data is only one single point . Suppose the prior measure on is defined by
It is not hard to compute the posterior measure:
which is a uniform measure on . We further assume the transition matrix equals
It is clear that all the moves of the M-H chain will be accepted, hence has transition matrix:
On the other hand, the transition function for chain can be computed by:
so the transition matrix would be
Therefore, with any initialization, converges to the stationary distribution after one step. However, never converges as it jumps back and forth between and .
In Example 3 above, never converges because it is a periodic chain, i.e., its smallest eigenvalue equals
never converges because it is a periodic chain, i.e., its smallest eigenvalue equals. But even we assume is aperiodic, it is still possible that coverges slower than , as we could tilt the transition matrix above a little bit, for example, let the transision matrix for be
then chain still converges faster.
In terms of asymptotic variance, chain is at least as good as ,
In terms of distributional convergence, it is not possible to derive a general ordering between and chain.
The above results tell us the exchange chain might converge faster or more slowly than the original M-H chain. In the rest part of this section, we will further investigate the convergence speed of .
3.2 Main Results
In this part, we study the convergence properties of the exchange algorithm. As exchange algorithm can be viewed as a variant of the standard Metropolis-Hastings algorithm. It is natural to ask ‘comparison-type’ questions. In usual settings, people implement the exchange algorithm as the Metropolis-Hastings algorithm can not be applied in the doubly intractable case. The intuition seems to suggest that the M-H algorithm converges faster. Therefore, suppose one knows the convergence speed of one algorithm, one might want to ask if the other algorithm also has a similar convergence speed. For example, one can ask questions like:
Question 1: Suppose is geometrically ergodic, is the original chain also geometrically ergodic?
Or the reverse
Question 2: Suppose is geometrically ergodic, is the exchange chain geometrically ergodic as well? If not, can we find sufficient conditions to ensure ‘inherits’ the convergence rate of ?
We will answer both of the two questions in the rest of this section. The second question is practically more interesting. In real settings, usually we can study the convergence rate of , though is not practically implementable. Therefore theoretical guarantees of would justify the usefulness of the exchange algorithm.
Before everything is rigorously stated, we state our results in a heuristic way here.
(Question 1) If geometrically ergodic, there is no guarantee that also geometrically ergodic. In fact, Example 3 gives such a counterexample.
(Question 1) If geometrically ergodic, then any ‘lazy’ version (defined later) of must also be geometrically ergodic. This indicates essentially will not converge at a faster speed than the original chain.
(Question 2) If the M-H chain is geometrically ergodic, we have an example to show the exchange algorithm is not necessarily geometrically ergodic.
(Question 2) If the M-H chain is geometrically ergodic, we have established sufficient conditions to ensure the geometric ergodicity of the exchange algorithm.
3.3 Exchange chain convergence
Now we are ready to study the convergence properties for . Though Example 3 gives us a negative example, indicating the exchange chain may converge much faster than original chain. The next theorem shows, after making the original chain ‘lazy’, the original chain is no worse than the exchange chain, which answers Question 1 completely.
Suppose the exchange chain is uniformly/geometrically ergodic, then for any , chain is also uniformly/geometrically ergodic. Here is the lazy version of , defined by
First suppose is uniformly ergodic, then Theorem 2 shows there exists here exists an integer , , and a probability measure such that, for any ,
On the other hand, for any measurable set and any point , if , then we have
where . The same result holds if . Therefore,
for all . Thus is uniformly ergodic by Theorem 2.
Now suppose is geometrically ergodic. By Theorem 3,
for some .
Let and . In particular let and . Then we have
Meanwhile, as the exchange chain is dominated by the original M-H chain in Peskun’s ordering, it is shown by Tierney [T98] Lemma 3 that the supremum of ’s spectrum should be no less than the supremum of ’s spectrum. That is, .
, there exists such that
. The lazy version of M-H algorithm is thus geometrically ergodic, as desired. ∎
The next result follow immediately from Theorem 5.
If has rejection probability (i.e. ) uniformly bounded below from . Suppose the exchange chain is uniformly/geometrically ergodic, the original chain is also uniformly/geometrically ergodic.
Theorem 5 and Corollary 1 essentially show that the exchange chain will not converge faster than the (slightly modified) original chain. On the other hand, this result does not provide much insight into the convergence speed of the exchange algorithm. Therefore, it is of practical users’ interest to establish geometric ergodicity of the exchange algorithm, which provides the user with “peace of mind” of the convergence speed. As shown in Theorem 5 and Corollary 1, geometric ergodicity of is essentially the necessary condition for geometric ergodicity of . However, the next example shows this condition is not sufficient.
Example 4 (Exponential likelihood with Gamma prior).
Suppose the likelihood is , that is,
prior distribution is . Under this setting, it is easy to compute the posterior distribution:
which is a distribution. Consider an independence Metropolis-Hastings sampler with an proposal, that is,
As the proposal is precisely the posterior, this independence Metropolis-Hastings chain will converge perfectly after one step, and therefore is obviously geometrically ergodic. On the other hand, the Markov kernal of is:
is the only solution for equation:
for any fixed .
We can also compute the ‘rejection probability’ at point :
Notice that for each fixed , when goes to infinity, we have
Therefore the first term of the integration goes to
as by Lebesgue’s dominated convergence theorem (with control function ). The second term goes to as .
Therefore, we have:
It is proved by Roberts and Tweedie [RT96a] (Thm 5.1) that if a MH chain is geometrically ergodic, then its rejection probability is necessarily bounded away from unity. Therefore, the exchange algorithm in this example is not geometric ergodic.
3.3.1 Geometric Ergodicity of the Exchange Algorithm: Compact Parameter Space
Example 4 answers half of Question 2. That is, if is geometrically ergodic, the exchange algorithm is not necessarily geometrically ergodic. Now we focus on establishing sufficient conditions such that will ‘inherit’ the convergence rate of . We start with the following lemma.
Let , be two reversible Markov transition kernels on the same state space eqiupped with algebra . If there exists such that for any and any measureable set , and is geometrically ergodic, then is geometrically ergodic.
If is reversible, then Theorem in [RR97] says is geometrically ergodic if and only if
We can write as
where is a valid Markov kernel.
Let and . Then we have
This proves is geometrically ergodic, as desired. ∎
Lemma 2 can be used to establish sufficient conditions for ‘convergence rate inheritance’ of . Let be the set
such that the likelihood ratio has a positive lower bound. If the probability of under is bounded away from for any , then can ‘inherit’ convergence rate from . Nicholls et al. proved a similar result in [NFW12], but only covered the uniformly ergodic case.
Suppose there exists such that
for any . Suppose is uniformly/geometrically ergodic. Then the exchange chain is also uniformly/geometrically ergodic, respectively.
For any , the exchange chain has transition density
Lemma 1 proves for all , therefore we have for any .
If is uniformly ergodic, Theorem 2 proves there exists and such that for any . On the other hand, we have:
so also satisfies the minorization condition for the whole space. Thus is uniformly ergodic.
If is geometrically ergodic, then Lemma 2 implies is geometrically ergodic. ∎
Theorem 2 can be applied to almost all the cases where the parameter space is bounded or compact. We provide one example here:
Example 5 (Beta-Binomial model).
Consider the following Beta-Binomial example. Let
be the binomial distribution with parameter
be the binomial distribution with parameter, let
be the prior on , where are prefixed positive numbers. Given data , we would like to sample from the posterior distribution with density
which is a distributed random variable.
Consider an independence Metropolis-Hastings sampler with a proposal distribution , then has transition kernel:
Let be the maximizer for , for , we have:
and is therefore uniformly ergodic by Theorem 2. Furthermore, given any , the set has a positive probability under , since only take values from a compact set , there exists a constant such that
uniformly. Hence the exchange chain is uniformly ergodic by Theorem 6.
Example 5 is only for illustration and the bound is obviously crude. We will go back to this example and calculate the exact rates of convergence in later sections.
However, the uniform probability condition for is usually too strong for general state space Markov chains. Note that
Therefore the condition directly implies uniformly over . However, when the parameter space is is unbounded (say ), most practical models will have and far away from each other, i.e,
Thus Theorem 6 can not be directly applied in these cases. The next part gives more practical sufficient conditions which can be applied in unbounded parameter space.
3.3.2 Geometrically Ergodicity of Exchange Algorithm: Non-compact Parameter Space
To establish sufficient conditions on general parameter space, we will first need the following assumptions on the , the conditions are similar to [MT96][RR97] and are reasonable in practical settings. To fix ideas, we will only state the conditions for Markov chain with target distribution in , though all our results can be carried through to higher dimensions after appropriate modifications, see [RR97] for discussion.
We will say that a Metropolis-Hastings chain satisfies assumption () if it:
has a random-walk transition kernel , that is, where is a continuous density,
has a target density which is log-concave in the tails, that is, there exists some constant and such that, for all :
and for all :
The first assumption requires a random-walk proposal kernel, which is very common in practical uses. In practice people often use uniform distributionas proposal kernel, which both satisfies .
The second assumption requires the tail of the target density to be uniformly exponential or lighter. This covers many posterior distributions as Gaussian, Gamma, exponential from typically-used statistical models. As we will see later, the section assumption is also satisfied by Ising models and Exponential Random Graph Models with Gaussian priors. Indeed, it is proved in [MT96] that condition is essentially necessary for a random-walk Metropolis-Hastings algorithm to be geometrically ergodic.
The following theorem gives a sufficient condition for the geometric ergodicity of
Let be a Metropolis-Hastings chain on with posterior distribution as stationary distribution. Suppose satisfies Assumption . Furthermore, assume there exists a non-constant continuous function with such that