 # Estimating Convergence of Markov chains with L-Lag Couplings

Markov chain Monte Carlo (MCMC) methods generate samples that are asymptotically distributed from a target distribution of interest as the number of iterations goes to infinity. Various theoretical results provide upper bounds on the distance between the target and marginal distribution after a fixed number of iterations. These upper bounds are on a case by case basis and typically involve intractable quantities, which limits their use for practitioners. We introduce L-lag couplings to generate computable, non-asymptotic upper bound estimates for the total variation or the Wasserstein distance of general Markov chains. We apply L-lag couplings to three tasks: (i) determining MCMC burn-in with guarantees; (ii) comparison of different MCMC algorithms targeting the same distribution; (iii) comparison of exact and approximate MCMC algorithms such as Metropolis adjusted Langevin and unadjusted Langevin.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Markov chain Monte Carlo (MCMC) algorithms generate Markov chains that are invariant with respect to probability distributions that we wish to approximate. A large body of work helps understanding the convergence of these Markov chains to their invariant distributions. Such results are typically phrased as follows. Denote by

the marginal distribution of the chain at time and by the invariant distribution. The discrepancy between and can be measured with any distance between probability distributions, typically the total variation distance or the Wasserstein distance in the MCMC literature. Various results provide upper bounds on this distance, of the form , where depends on but not on , and where decreases to zero as goes to infinity, typically geometrically; see Section 3 in Roberts and Rosenthal (2004) for a gentle survey, and Durmus et al. (2016); Dalalyan (2017); Dwivedi et al. (2018) for recent examples. These results help understanding and comparing algorithms, in particular by informing on the sensitivity of convergence rates to the dimension of the state space or to various features of the target distribution. Unfortunately these results do not provide computable bounds on the distance between and in practice, as they typically feature unknown constants.

Practitioners resort to a variety of other results to assess the quality of MCMC estimates. Numerous convergence diagnostics rely on statistical hypothesis tests of the null hypothesis that the chains are stationary (e.g.

Gelman and Rubin (1992b); Geweke (1998); Gelman and Brooks (1998); Kass et al. (1998), Chapter 8 of Robert and Casella (2013)

). It is also customary to compute and report effective sample sizes, that are based on consistent estimates of the asymptotic variance of certain functions of the chains, e.g.

Vats et al. (2019). Fewer tools provide computable bounds on the distance between and for a fixed , also some are mentioned in Brooks and Roberts (1998) for the case of Gibbs samplers with tractable transition kernels. A notable exception is the technique proposed in Johnson (1996) (see also Johnson (1998)) that relies on coupled Markov chains. The method presented here is based on similar ideas, and we discuss similarities and differences in Section 2.

We propose to use -lag couplings to estimate the distance between and for a fixed time

. We build upon 1-lag couplings used to obtain unbiased estimators in

Jacob et al. (2017); Glynn and Rhee (2014). The discussion of Jacob et al. (2017) mentions that upper bounds on the total variation distance between and can be obtained with such couplings. We generalize the idea with -lag couplings, which are as simple to generate and provide sharper bounds, particularly for small values of . The proposed technique also provides upper bounds on a class of probability metrics (Sriperumbudur et al., 2012) that includes the 1-Wasserstein distance. We demonstrate numerically that the proposed bounds provide a practical assessment of convergence for various popular MCMC algorithms, on either discrete or continuous and possibly high-dimensional spaces. The proposed bounds can be used to 1) determine burn-in period for MCMC estimates, to 2) compare different MCMC algorithms targeting the same distribution, or to 3) compare exact and approximate MCMC algorithms, such as Unadjusted and Metropolis-adjusted Langevin algorithms, providing a computational companion to studies such as Dwivedi et al. (2018).

In Section 2 we introduce -lag couplings to estimate metrics between marginal distributions of a Markov chain and its invariant distribution. We illustrate the method on simple examples, discuss the choice of and compare with the approach of Johnson (1996). In Section 3 we consider applications including Gibbs samplers on the Ising model and gradient-based MCMC algorithms on log-concave targets. Scripts in R are available online.

## 2 L-lag couplings

Consider two Markov chains , , each with the same initial distribution and Markov kernel on such that is -invariant. Introduce a joint Markov transition kernel on such that, for all , , and ). Choose some integer as the lag parameter. We generate the two chains as follows: first sample and for , and sample . Then generate given , for , by sampling from . The construction ensures that and have the same marginal distribution at all times . Furthermore, we will choose the joint kernel such that the pair of chains can meet after a random number of steps, i.e. the meeting time is almost surely finite. Finally we assume that the chains remain faithful after meeting, i.e. for all . This construction is illustrated in Figure 1.

Various constructions for have been derived in the literature: coupled Metropolis–Hastings and Gibbs kernels in Johnson (1996); Jacob et al. (2017), coupled Hamiltonian Monte Carlo kernels in Mangoubi and Smith (2017); Bou-Rabee et al. (2018); Heng and Jacob (2019), and coupled particle Gibbs samplers in Jacob et al. (2019); Middleton et al. (2019). Given , , and a lag , Algorithm 1 summarizes how to sample the process in Figure 1 up to the meeting time .

The main contribution of this article is to show how Algorithm 1 can be used to upper bound various distances between and . We first introduce integral probability metrics (IPMs, e.g. Sriperumbudur et al. (2012)).

###### Definition 2.1.

(Integral Probability Metric). Let be a class of bounded functions on a measurable space . Then the corresponding IPM is defined as:

 dH(P,Q):=suph∈H∣∣EX∼P[h(X)]−EX∼Q[h(X)]∣∣ (1)

for all probability measures on . With and multiplicative factor of , is referred to as the total variation distance and denoted by . With , is referred to as the 1-Wasserstein distance and denoted by . Here refers to the norm on .

Our proposed method applies to IPMs such that we can compute a function on such that for all . For the total variation distance we have , and for the 1-Wasserstein distance . In the following refers to a class of functions such that we can compute .

With a similar motivation for the assessment of sample approximations, and not restricted to the MCMC setting, Gorham and Mackey (2015) considers choices of function sets that facilitate the computation of integral probability metrics. Here we focus on the MCMC setting and directly aim at upper bounds on the total variation and Wasserstein distance.

### 2.1 Main results

We make the three following assumptions similar to those of Jacob et al. (2017).

###### Assumption 2.2.

(Marginal convergence and moments). For all

, as , . Also, such that for all .

The above assumption is on the marginal convergence of the MCMC algorithm and on the moments of the associated chains. The next assumptions are on the coupling operated by the joint kernel .

###### Assumption 2.3.

(Sub-exponential tails of meeting times). The chains are such that the meeting time satisfies for all , for some constants and .

The above assumption can be relaxed to allow for polynomial tails as in Middleton et al. (2018). The final assumption on faithfulness is typically satisfied by design.

###### Assumption 2.4.

(Faithfulness). The chains stay together after meeting: for all .

We assume that the three assumptions above hold in the rest of the article. Theorem 2.5 is our main result, describing a computable upper bound on .

###### Theorem 2.5.

(Upper bounds). For an IPM with function set and upper bound , with the Markov chains satisfying the above assumptions, for any , and any ,

 dH(πt,π)≤E[⌈τ(L)−L−tL⌉∑j=1MH(Xt+jL,Yt+(j−1)L)]. (2)

The notation refers to the maximum between and the smallest integer above , for . When , the sum in inequality (2) is set to zero by convention. Out of this theorem we obtain the following simplifications for and :

 dTV(πt,π) ≤E[0∨⌈τ(L)−L−tL⌉], (3) dW(πt,π) ≤E[⌈τ(L)−L−tL⌉∑j=1∥Xt+jL−Yt+(j−1)L∥1]. (4)

For the total variation result, the boundedness part of Assumption 2.2 is directly satisfied. For the Wasserstein distance result, the boundedness part of Assumption 2.2 is equivalent to an uniform bound of moments of the marginal distributions for some .

We emphasise that the proposed bounds can be estimated directly by running Algorithm 1 independently, and by estimating the expectations in (3) and (4) by empirical averages.

### 2.2 Stylized examples

#### 2.2.1 A univariate Normal

We first consider a toy example, where we can compute total variation and -Wasserstein distances exactly. The target is and the MCMC kernel is that of a Normal random-walk Metropolis-Hastings (MH) algorithm, with step-size . We set the initial distribution to be a point mass at . The joint kernel relies on a maximal coupling of the proposal distributions and common random numbers for the uniform variable used in the acceptance step; all details of the couplings implemented in the article are given in the supplementary material. Figure 2 shows the evolution of the marginal distribution of the chain, and the total variation and -Wasserstein upper bounds. We compute them with and . For each , independent runs of Algorithm 1 were performed to approximate the bounds in Theorem 2.5 by empirical averages. Exact distances are shown for comparison. Tighter bounds are obtained with larger values for . Indeed larger results in the use of fewer triangle inequalities in the derivation of the bounds; see the proof of Theorem 2.5. The choice of is discussed again in Section 2.3. Figure 2: Upper bounds on the total variation and the 1-Wasserstein distance between πt and π, for a Metropolis-Hastings algorithm targeting N(0,1) and starting from a Dirac mass at 10. With L=150 the estimated upper bounds for the 1-Wasserstein distances are very close to the exact distances.

#### 2.2.2 A bimodal target

We consider a bimodal target to illustrate the limitations of the proposed technique. The target is , as in Section 5.1 of Jacob et al. (2017). The MCMC algorithm is again random walk MH, with . With these settings, chains struggle to jump between the modes, as seen in Figure 3 (Left), which shows a histogram of the marginal distribution from 1000 independent chains. Figure 3 (Right) shows the TV upper bound estimates for lags and (considered a very large value here), obtained with independent runs of Algorithm 1.

With , we do not see a difference between the obtained upper bounds, which suggests that the variance of the estimators is small for the different values of . In contrast, the dashed line bounds corresponding to lag are very different. This is because, over experiments, the 1-lag meetings always occurred quickly in the mode nearest to the initial distribution. However, over and experiments, there were instances where one of the two chains jumped to the other mode before meeting, resulting in a much longer meeting time. Thus the results obtained with repeats can be misleading. This is a manifestation of the estimation error associated with empirical averages, which are not guaranteed to be accurate after any fixed number of repeats. The shape of the bounds obtained with , with a plateau, highlights how the iterates of the chains first visit one of the modes, and then both. Figure 3: MH algorithm with π0∼N(10,1),σMH=1 on a bimodal target. Left: Histogram of the 500th marginal distribution from 1000 independent chains, and target density in full line; Right: Total variation bounds obtained with lags L∈{1,18000} and N∈{1000,5000,10000} independent runs of Algorithm 1.

### 2.3 Choice of lag L

Section 2.2.2 highlights the importance of the choice of lag in the upper bound estimates. The upper bounds get tighter with larger lag , but the computation time for the coupling algorithm (Algorithm 1) increases with . Furthermore, the benefits of increasing diminish: when is very large, we can consider to be at stationarity, while follows . Then the distribution of the meeting times still depends on the joint kernel , which is typically not a maximal coupling of the two chains (Thorisson, 1986). Thus the upper bounds do not converge to the exact TV as increases.

To determine a large enough lag , we can start with the intuition that we want the total variation upper bound estimates to take values in , including the bound on . This motivates the general strategy of starting with , plotting the bounds as in Figure 2, and increasing until the estimated upper bound for is close to 1.

### 2.4 Comparison with Johnson’s diagnostics

The proposed approach is similar to that of Johnson (1996), which we now recall. A number of chains start from and evolve jointly (without time lags), in such a way that they all coincide exactly after a random number of steps . Marginally each chain evolves according to the kernel , just as in the proposed construction. If we assume that any draw from would be accepted as a draw from in a rejection sampler with probability , then the main result of Johnson (1996) provides the bound: . As increases, for any the upper bound approaches , which itself is small if

is a large quantile of the meeting time

. A limitation of this otherwise remarkable result is its reliance on the quantity , which might be unknown or very close to one in some settings. Another difference in our approach is that we rely on pairs of chains, rather than on chains coupled together. If was chosen to be large, e.g. if is known to be close to one, then the memory and computing cost of simulating jointly coupled chains might become problematic.

## 3 Experiments and applications

### 3.1 Ising model

We consider an Ising model, where the target probability distribution is defined on a large discrete state space, namely a square lattice with 32 32 sites (each site has 4 neighbors) and periodic boundaries. For a state , we define the target probability , where the sum is over all pairs of neighboring sites. As increases, the correlation between nearby sites increases and single-site Gibbs samplers are known to perform worse (Mossel and Sly, 2013). Difficulties in the assessment of the convergence of these samplers are in part due to the discrete nature of the state space, which limits the possibilities of visual diagnostics. Users might observe trace plots of one-dimensional statistics of the chains, such as , and declare convergence when the statistic seems to stabilize; see Titsias and Yau (2017); Zanella (2019) where trace plots of summary statistics are used to monitor MCMC algorithms in discrete state spaces. Such procedures can however provide a “false sense of security” as described in Gelman and Rubin (1992a), because chains can be stuck in local modes.

Here we compute the proposed upper bounds for the TV distance between and . We do so for two algorithms: a single site Gibbs sampler (SSG) and a parallel tempering (PT) algorithm, where different chains target different with SSG updates, and regularly attempt to swap their states across different chains (Geyer, 1991; Syed et al., 2019). The initial distribution assigns and with equal probability on each site independently. The full algorithmic descriptions are available in the supplementary material. For , we obtain TV bounds for SSG using a lag , and independent repeats. For PT we use 12 chains, each targeting with in a equispaced grid ranging from to , a frequency of swap moves of , and a lag . The results are in Figure 4, where we see a plateau for the TV bounds on the SSG algorithm, and where the faster convergence of PT is apparent. Note that the targets are different for both algorithms, as PT operates on an extended space. We could also normalize the horizontal axis by the per-iteration costs of each algorithm, or use units of actual time instead of iterations. Figure 4: Single-site Gibbs (SSG) versus Parallel Tempering (PT) for an Ising model; bounds on the total variation distance between πt and π, for the inverse temperature β=0.46.

### 3.2 Logistic regression

We next consider a target on a continuous state space defined as the posterior in a Bayesian logistic regression. Consider the German Credit data from

Lichman (2013). There are binary responses indicating whether individuals are creditworthy or not creditworthy, and covariates for each individual . The logistic regression model states with a normal prior . We can sample from the posterior using Hamiltonian Monte Carlo (HMC, Neal (1993)) and the Pólya-Gamma (PG) sampler (Polson et al., 2013). Recall that HMC involves tuning parameters corresponding to a step size and a number of steps in the leap frog scheme that approximates the solution of Hamiltonian equations at every iteration. We can use the proposed non-asymptotic bounds to compare the convergence to stationarity associated with different , and to compare it with the PG sampler. The algorithmic details on the couplings implemented here are in the supplementary material. We emphasize that the tuning parameters associated with the fastest convergence to stationarity might not necessarily be optimal in terms of asymptotic variance of ergodic averages of functions of interest; see related discussions in Heng and Jacob (2019). Figure 5 shows the total variation bounds for HMC with and and the corresponding bound for the parameter-free PG sampler. In this example, we find that the upper bounds are smaller for the PG sampler than for any of the HMC sampler under consideration. We can expect that in higher-dimensional settings the TV bounds might be smaller for HMC. Our main message is that the proposed bounds can provide a non-asymptotic comparison of MCMC algorithms. Figure 5: Proposed upper bounds on dTV(πt,π) for a Pólya-Gamma Gibbs sampler and for Hamiltonian Monte Carlo on a 49-dimensional posterior distribution in a logistic regression model. For HMC LHMC=ϵ×L with ϵ=0.025 and L=4,5,6,7.

### 3.3 Comparison of exact and approximate MCMC algorithms

In various settings approximate MCMC methods trade off asymptotic unbiasedness for gains in computational speed. They have gained popularity in machine learning, especially in the presence of large datasets (e.g.

Welling and Teh (2011); Johndrow et al. (2017); Rudolf and Schweizer (2018); Dalalyan and Karagulyan (2019)). We compare an approximate MCMC method (Unadjusted Langevin Algorithm, ULA) with its exact counterpart (Metropolis-Adjusted Langevin Algorithm, MALA) in various dimensions. Our target is a multivariate normal:

 π=N(0,Σ) where [Σ]i,j=0.5|i−j| for % 1≤i,j≤d.

Both MALA and ULA chains start from , and have step-sizes of respectively. Step-sizes are linked to an optimal result of Roberts and Rosenthal (2001), and the 0.1 multiplicative factor for ULA ensures that the target distribution for ULA is close to (see Dalalyan (2017)). We can use couplings to study the mixing times of the two algorithms, where . Figure 6 highlights how the dimension impacts the estimated upper bounds on the mixing time , calculated as where denotes empirical averages. This can be related to theoretical analysis such as Dwivedi et al. (2018). For a strongly log-concave target such as , Table 2 of Dwivedi et al. (2018) indicates mixing time upper bounds of order and for ULA and MALA respectively (with a non-warm start centred at the unique mode of the target distribution). This is consistent with Figure 6. In comparison to the theoretical studies in Dalalyan (2017); Dwivedi et al. (2018), our bounds can be directly estimated and require no extra knowledge on the target or the algorithm, other than the capacity to simulate coupled chains. On the other hand, from a qualitative perspective the bounds in Dalalyan (2017); Dwivedi et al. (2018) are much more explicit about the impact of different aspects of the problem: dimension, step-size and features of the target. Figure 6: Mixing time bounds for ULA and MALA targeting a multivariate Normal distribution, as a function of the dimension. Here the mixing time tmix(0.25) denotes the first iteration t for which the estimated TV between πt and π is less than 0.25.

## 4 Discussion

We propose to use -lag couplings to compute non-asymptotic upper bounds on the total variation and the Wasserstein distance between the marginal distribution of a Markov chain at time and its stationary distribution . Our method can be used to obtain guidance on the choice of burn-in, to compare different MCMC algorithms targeting the same distribution, and to compare mixing times of approximate and exact MCMC methods. The main requirement for the application of the method is the ability to generate coupled Markov chains that can meet exactly after a random but finite number of iterations.

The bounds are not tight, in part due to the couplings not being maximal (Thorisson, 1986), but experiments suggest that the bounds are practical. In particular the proposed bounds always go to zero as increases, making them informative at least for large enough . The method of Johnson (1996) is closely related but involves a quantity that is often intractable. The combination of the use of time lags and of more than two chains as in Johnson (1996) could lead to new diagnostics.

##### Acknowledgments.

The authors are grateful to Espen Bernton, John O’Leary and James Scott for helpful comments. The second author gratefully acknowledges support by the National Science Foundation through grant DMS-1712872. The figures were created with packages made by Wilke (2017); Wickham (2016) in R Core Team (2013).

## References

• Bou-Rabee et al.  N. Bou-Rabee, A. Eberle, and R. Zimmer. Coupling and convergence for Hamiltonian Monte Carlo. arXiv preprint arXiv:1805.00452, 2018.
• Brooks and Roberts  S. P. Brooks and G. O. Roberts. Assessing convergence of Markov chain Monte Carlo algorithms. Statistics and Computing, 8(4):319–335, 1998.
• Dalalyan  A. S. Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):651–676, 2017. doi: 10.1111/rssb.12183.
• Dalalyan and Karagulyan  A. S. Dalalyan and A. Karagulyan. User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient. Stochastic Processes and their Applications, 2019.
• Durmus et al.  A. Durmus, G. Fort, and É. Moulines. Subgeometric rates of convergence in Wasserstein distance for Markov chains. In Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, volume 52, pages 1799–1822. Institut Henri Poincaré, 2016.
• Dwivedi et al.  R. Dwivedi, Y. Chen, M. J. Wainwright, and B. Yu. Log-concave sampling: Metropolis–Hastings algorithms are fast! In S. Bubeck, V. Perchet, and P. Rigollet, editors, Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pages 793–797. PMLR, 06–09 Jul 2018.
• Gelman and Brooks  A. Gelman and S. P. Brooks. General methods for monitoring convergence of iterative simulations. Journal of Computational and Graphical Statistics, 7(4):434–455, 1998.
• Gelman and Rubin [1992a] A. Gelman and D. B. Rubin. A single series from the Gibbs sampler provides a false sense of security. Bayesian statistics, 4:625–631, 1992a.
• Gelman and Rubin [1992b] A. Gelman and D. B. Rubin. Inference from iterative simulation using multiple sequences. Statistical Science, 430:457–472, 1992b.
• Geweke  J. Geweke. Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments. Bayesian Statistics, 4:169–193, 1998.
• Geyer  C. J. Geyer. Markov chain Monte Carlo maximum likelihood. Technical report, University of Minnesota, School of Statistics, 1991.
• Glynn and Rhee  P. W. Glynn and C.-H. Rhee. Exact estimation for Markov chain equilibrium expectations. Journal of Applied Probability, 51(A):377–389, 2014.
• Gorham and Mackey  J. Gorham and L. Mackey. Measuring sample quality with stein’s method. Advances in Neural Information Processing Systems, 28:226–234, 2015.
• Heng and Jacob  J. Heng and P. E. Jacob. Unbiased Hamiltonian Monte Carlo with couplings. Biometrika, 106(2):287–302, 2019.
• Jacob et al.  P. E. Jacob, J. O’Leary, and Y. F. Atchadé. Unbiased Markov chain Monte Carlo with couplings. 2017. arXiv preprint arXiv:1708.03625v4.
• Jacob et al.  P. E. Jacob, F. Lindsten, and T. B. Schön. Smoothing with couplings of conditional particle filters. Journal of the American Statistical Association, 0(0):1–20, 2019.
• Johndrow et al.  J. E. Johndrow, P. Orenstein, and A. Bhattacharya. Scalable MCMC for Bayes shrinkage priors. arXiv preprint arXiv:1705.00841, 2017.
• Johnson  V. E. Johnson. Studying convergence of Markov chain Monte Carlo algorithms using coupled sample paths. Journal of the American Statistical Association, 91(433):154–166, 1996.
• Johnson  V. E. Johnson. A coupling-regeneration scheme for diagnosing convergence in Markov chain Monte Carlo algorithms. Journal of the American Statistical Association, 93(441):238–248, 1998.
• Kass et al.  R. E. Kass, B. P. Carlin, A. Gelman, and R. M. Neal. Markov chain Monte Carlo in practice: a roundtable discussion. The American Statistician, 52((2)):93–100, 1998.
• Lichman  M. Lichman. UCI machine learning repository, 2013.
• Mangoubi and Smith  O. Mangoubi and A. Smith. Rapid mixing of Hamiltonian Monte Carlo on strongly log-concave distributions. arXiv preprint arXiv:1708.07114, 2017.
• Middleton et al.  L. Middleton, G. Deligiannidis, A. Doucet, and P. E. Jacob. Unbiased Markov chain Monte Carlo for intractable target distributions. arXiv preprint arXiv:1807.08691, 2018.
• Middleton et al.  L. Middleton, G. Deligiannidis, A. Doucet, and P. E. Jacob. Unbiased smoothing using particle independent Metropolis-Hastings. In K. Chaudhuri and M. Sugiyama, editors, Proceedings of Machine Learning Research, volume 89 of Proceedings of Machine Learning Research, pages 2378–2387. PMLR, 16–18 Apr 2019.
• Mossel and Sly  E. Mossel and A. Sly. Exact thresholds for Ising–Gibbs samplers on general graphs. The Annals of Probability, 41(1):294–328, 2013.
• Neal  R. M. Neal. Bayesian learning via stochastic dynamics. Advances in neural information processing systems, pages 475–482, 1993.
• Polson et al.  N. G. Polson, J. G. Scott, and J. Windle. Bayesian inference for logistic models using Polya-Gamma latent variables. Journal of the American Statistical Association, 108(504):1339–1349, 2013.
• R Core Team  R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2013.
• Robert and Casella  C. P. Robert and G. Casella. Monte Carlo Statistical Methods. Spinger New York, 2013.
• Roberts and Rosenthal  G. O. Roberts and J. S. Rosenthal. Optimal scaling for various Metropolis–Hastings algorithms. Statist. Sci., 16(4):351–367, 11 2001.
• Roberts and Rosenthal  G. O. Roberts and J. S. Rosenthal. General state space Markov chains and MCMC algorithms. Probab. Surveys, pages 20–61, 2004.
• Rudolf and Schweizer  D. Rudolf and N. Schweizer. Perturbation theory for Markov chains via Wasserstein distance. Bernoulli, 24(4A):2610–2639, 2018.
• Sriperumbudur et al.  B. K. Sriperumbudur, K. Fukumizu, A. Gretton, B. Schölkopf, and G. R. Lanckriet. On the empirical estimation of integral probability metrics. Electronic Journal of Statistics, 6:1550–1599, 2012.
• Syed et al.  S. Syed, A. Bouchard-Côté, G. Deligiannidis, and A. Doucet. Non-reversible parallel tempering: an embarassingly parallel MCMC scheme. arXiv preprint arXiv:1905.02939, 2019.
• Thorisson  H. Thorisson. On maximal and distributional coupling. The Annals of Probability, pages 873–876, 1986.
• Thorisson  H. Thorisson. Coupling, stationarity, and regeneration. Springer New York, 2000.
• Titsias and Yau  M. K. Titsias and C. Yau. The Hamming ball sampler. Journal of the American Statistical Association, 112(520):1598–1611, 2017.
• Vats et al.  D. Vats, J. M. Flegal, and G. L. Jones. Multivariate output analysis for Markov chain Monte Carlo. Biometrika, 106(2):321–337, 04 2019.
• Vihola  M. Vihola. Unbiased estimators and multilevel Monte Carlo. Operations Research, 66(2):448–462, 2017.
• Welling and Teh  M. Welling and Y. W. Teh. Bayesian learning via stochastic gradient Langevin dynamics. International Conference on Machine Learning, 2011.
• Wickham  H. Wickham. ggplot2: elegant graphics for data analysis. Springer, 2016.
• Wilke  C. O. Wilke. Ggridges: Ridgeline plots in’ggplot2’. R package version 0.4, 1, 2017.
• Zanella  G. Zanella. Informed proposals for local MCMC in discrete spaces. Journal of the American Statistical Association, (just-accepted):1–35, 2019.

## Appendix A Proofs

### a.1 L-lag unbiased estimators

Our motivation for Theorem 2.5 comes from recent works on the removal of the burn-in bias of MCMC estimators using couplings [Jacob et al., 2017, Glynn and Rhee, 2014]. In particular, extending the unbiased estimator from Jacob et al.  that corresponds to a lag , we first construct the -lag estimator with an arbitrary as

 H(L)t(X,Y):=h(Xt)+⌈τ(L)−L−tL⌉∑j=1h(Xt+jL)−h(Yt+(j−1)L). (5)

where , chains , marginally have the same initial distribution and Markov transition kernel on with invariant distribution , and they are jointly following the -lag coupling algorithm (Algorithm 1). Following the proof technique for the -lag estimate in Jacob et al. , we first prove an unbiasedness result for .

###### Proposition A.1.

Under the Assumptions 2.2, 2.3 and 2.4 of the main article, has expectation , finite variance, and finite expected computing time.

###### Proof.

The proof is nearly identical to those in Vihola , Glynn and Rhee , Jacob et al.  and related articles, and is only reproduced here for completeness. Let without loss of generality. Otherwise start the chains at rather than . Secondly, we can focus on the component-wise behaviour of and assume takes values in without loss of generality. For simplicity of notation we drop the superscript and write to denote .

Define , for , and . By Assumption 2.3, , so the computation time has finite expectation. When by faithfulness (Assumption 2.4). As , this implies as .

We now show that is a Cauchy sequence in

, the space of random variable with finite first two moments, by showing

 supn′≥nE[(Hn′0(X,Y)−Hn0(X,Y))2]→n→∞0.

This follows by direct calculation. Firstly by Cauchy–Schwarz,

 E[(Hn′0(X,Y)−Hn0(X,Y))2]=n′∑s=n+1n′∑t=n+1E[ΔsΔt]≤n′∑s=n+1n′∑t=n+1E[Δ2s]1/2E[Δ2t]1/2.

By Hölder’s inequality with , and Assumptions 2.2 - 2.3, for any ,

 E[Δ2t]=E[Δ2t1(τ(L)>(1+t)L)]≤E[Δ2+ηt]11+η/2E[1(τ(L)>(1+t)L)]η2+η

where follows from Assumptions 2.2. Overall this implies for some for all . Hence is a Cauchy sequence in , and has finite first and second moments. Recall that Cauchy sequences are bounded, so we can apply dominated convergence theorem to get,

 E[H0(X,Y)]=E[limn→∞Hn0(X,Y)]=limn→∞E[Hn0(X,Y)].

Finally, note that by a telescoping sum argument and Assumption 2.2,

 limn→∞E[Hn0(X,Y)]=limn→∞E[h(Xn)]=EX∼P[h(X)].

as required. Therefore, in general has expectation , finite variance, and a finite expected computing time. ∎

### a.2 Proof of Theorem 2.5

A proof of Theorem 2.5 now directly follows.

###### Proof.

We consider the -lag estimate in (5). Under Assumptions 2.2, 2.3 and 2.4, by Proposition A.1 is an unbiased estimator of , for any . Then,

 dH(πt,π) =suph∈H|EX∼π[h(X)]−E[h(Xt)]| =suph∈H∣∣E[⌈τ(L)−L−tL⌉∑j=1h(Xt+jL)−h(Yt+(j−1)L)]∣∣

The inequality above stems from 1) the triangle inequality applied times, and 2) the bound assumed in the main article. We see that increasing the lag reduces the number of applications of the triangle inequality performed above, which explains why it is beneficial to increase in order to obtain sharper bounds. ∎

## Appendix B Coupling Algorithms

In this section, all the coupling algorithms used in our examples are presented. These are constructions used in recent work on unbiased MCMC estimation with couplings [Jacob et al., 2017, Heng and Jacob, 2019, Middleton et al., 2018].

##### Maximal Couplings.

To construct -lag couplings, the pair of chains need to meet exactly whilst preserving their respective marginal distributions. This can be achieved using maximal coupling [Johnson, 1998, Thorisson, 2000], which we present below in Algorithm 2. Given variables , , Algorithm 2 samples jointly from such that the marginal distributions of are preserved and with maximal probability. It requires sampling from the distributions of

and evaluating the ratio of their probability density functions. Below

denotes distributions of , and the respective probability density functions.

For the particular case when , we can do a reflection-maximal coupling [Jacob et al., 2017, Bou-Rabee et al., 2018] which has a deterministic computational cost of order from a Cholesky decomposition calculation. This also samples jointly from such that the marginal distributions of are preserved and with maximal probability. This is given in Algorithm 3 below, where denotes the probability density function of a -dimensional standard Normal. Note that in the case below, we get an event as required.

When random variables have discrete distributions on a finite state space, we can perform a maximal coupling with deterministic computation cost. This is given in Algorithm 4. First, we define as for with . The notation stands for the minimum of and . We then define and as , and . These and

are probability vectors and computing them takes

operations. Note that the total variation distance between and is equal to , and that and have disjoint supports.

### b.1 Coupled Random-Walk Metropolis–Hastings

We couple a pair of random-walk Metropolis–Hastings chains in Sections 2.2.1 and 2.2.2 using Algorithm 5.

We can replace the acceptance probabilities by respectively to obtain a coupling algorithm for general Metropolis–Hastings Markov chains with proposal , provided that we can sample from a maximal coupling of and for any pair .

### b.2 Coupled Ising Model

##### Single site Gibbs (SSG).

Our implementation of single site Gibbs scans all the sites of the lattice systematically. We recall that the full conditionals of the Gibbs sampling updates are Bernoulli distributed. It is given in Algorithm

6.

##### Parallel Tempering (PT).

For parallel tempering, we introduce chains denoted by , …, . Each chain targets the distribution where are positive values interpreted as inverse temperatures. In the example of the article we have , , , and the intermediate are equispaced. This is in no way optimal, see Syed et al.  for practical tuning strategies. Our implementation of a coupled PT algorithm is given below.