 # Air Markov Chain Monte Carlo

We introduce a class of Adapted Increasingly Rarely Markov Chain Monte Carlo (AirMCMC) algorithms where the underlying Markov kernel is allowed to be changed based on the whole available chain output but only at specific time points separated by an increasing number of iterations. The main motivation is the ease of analysis of such algorithms. Under the assumption of either simultaneous or (weaker) local simultaneous geometric drift condition, or simultaneous polynomial drift we prove the L_2-convergence, Weak and Strong Laws of Large Numbers (WLLN, SLLN), Central Limit Theorem (CLT), and discuss how our approach extends the existing results. We argue that many of the known Adaptive MCMC algorithms may be transformed into the corresponding Air versions, and provide an empirical evidence that performance of the Air version stays virtually the same.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Consider the problem of estimating integrals of the form

 π(f):=∫Xf(x)dπ(x)

for a target distribution on a general state space . The usual Markov Chain Monte Carlo (MCMC) procedure is to choose a Markov kernel from a collection of invariant kernels in order to simulate an ergodic Markov chain . Then the chain output average

 ^πN(f):=1NN−1∑i=0f(Xi) (1)

is taken as an estimate of . The properties of this estimator, both asymptotic and finite sample, will depend on the choice of . Usually the optimal value of is unknown a priori, as it depends on the intractable in a complicated way. However, in many settings there is constructive theoretical guidance of how to hand tune based on a pilot MCMC run (e.g., optimal scale and covariance of the proposals in the Random Walk Metropolis algorithm [34, 36]

, or optimal selection probabilities in the Random Scan Gibbs sampler

). Hand tuning is troublesome: it requires human expertise, human time and requires an ad hoc decision on how long the pilot run should be (after which the opportunity to learn from subsequent samples is lost). In many high dimensional settings, or complex algorithms that use many kernels, hand tuning is not practically feasible.

A more attractive alternative to hand tuning, is to design an automated algorithmic procedure that would adjust indefinitely, as further information accrues from the chain output. Formally, such an approach is called adaptive MCMC (AMCMC). To optimise different sampling scenarios, a variety of AMCMC algorithms have been developed, including, among others, the Adaptive Metropolis [20, 46], Adaptive MALA [5, 29], Adaptive Random Scan Gibbs Sampler , or samplers specialised to model selection [32, 19]. All these AMCMC advancements share the common design of generating the process by repeating the following two steps:

1. [label*=(0)]

2. Sample from .

3. Given update according to some adaptation rule.

Empirically, Adaptive MCMC methods largely outperform their non-adap-tive counterparts, often by a factor exponential in dimension, and enjoy great success in many challenging applications (see e.g. [43, 9]). Nevertheless, despite large body of work that we discuss in Section 4, their theoretical underpinning is lagging behind that of nonadaptive MCMC. AMCMC algorithms are notoriously difficult to analyse due to their intrinsic nonmarkovian dynamics resulting from alternating steps 1 and 2 above.

In this paper we propose to redesign Adaptive MCMC so that it becomes more tractable mathematically, but its ability to self tune to the sampling problem becomes unaffected.

We introduce Adapted Increasingly Rarely MCMC (AirMCMC), where adaptations of are only allowed to happen at prescheduled times with an increasing lag between them. Denote the consecutive lags as and set the adaptation times as

 Nj:=j∑k=1nk,with N0:=n0:=0. (2)

The generic design of an AirMCMC is presented in Algorithm 1 below.

Note that Step 1b of the above AirMCMC pseudo code allows a background precomputation of the parameter , analogous to that in step 2 of AMCMC. However, the dynamics of is driven by , and the value of is updated at prescheduled times only. It is intuitively clear that updating the transition kernel at every step is not necessary for efficient tuning because the new information about optimal acquired from in a single move of is infinitesimal as the total length of simulation increases. We demonstrate this empirically in Section 2 by comparing performance of adaptive scaling and Adaptive Metropolis algorithms to their Air versions for various choices of the lag sequence .

Theoretical analysis of AirMCMC benefits from the fact that the law

 L(XNj+1,…,XNj+nj+1∣∣Gj),where Gj:=σ(X0,..,XNj,γ0,..,γNj),

is that of a Markov chain with transition kernel

Consequently, the standard Markov chain arguments apply to individual epochs between adaptations of increasing length

. In Section 3 we state that AirMCMC algorithms preserve the main convergence properties, namely, the Weak and Strong Law of Large Numbers (WLLN, SLLN) and the Central Limit Theorem (CLT). Also, we show that the Mean Squared Error (MSE) of decays to at a rate that is arbitrary close or equal to and with constants that in principle can be made explicit. We establish these results under regularity conditions that are standard for MCMC and AMCMC analysis, namely simultaneous geometric drift conditions of (MSE, WLLN, SLLN, CLT) and simultaneous polynomial drift conditions (WLLN, SLLN, CLT), as well as assuming a weaker, and non-standard local simultaneous geometric drift conditions (MSE, WLLN, SLLN, CLT). No further technical assumptions are needed, in particular, neither diminishing adaptation, nor Markovianity of the bivariate process that are typically required in theoretical analysis of AMCMC. A detailed discussion of how these results relate to available AMCMC theory is in Section 4. Proofs of the theoretical properties of AirMCMC are gathered in Section 7.

In Section 5 we demonstrate how AirMCMC helps establish theoretical underpinning of advanced algorithms. We consider the recently proposed Adaptive Random Scan Gibbs Sampler (ARSGS)  and the Kernel Adaptive Metropolis Hastings (KAMH)  algorithms. Asymptotic properties of (1) for both the ARSGS and KAMH are not covered by the currently available AMCMC theory when applied to a target with unbounded support. However, for their Air versions, we establish MSE convergence, the WLLN and SLLN under mild regularity assumptions. We conclude the paper in Section 6 with a discussion.

## 2 Motivating Examples

In this section we examine the ability of AirMCMC to self tune, and see how it compares to standard Adaptive MCMC in its two most successful design versions that adapt the scaling and the covariance matrix of the proposal. We also empirically investigate sensitivity of AirMCMC to its key design parameter, the sequence of blocks lengths .

### 2.1 Adaptive Scaling of Random Walk Metropolis

In this example we shall study Air version of the Adaptive Random walk Metropolis (ARWM) for a one dimensional target distribution. We consider an adaptive algorithm with normal proposals that tunes the proposal variance in order to achieve the optimal acceptance ratio 0.44 (see

).

In Algorithm 2 we present Air version of the algorithm, where the adaptations of the variances are separated by the sequence of iterations. By taking we recover the original ARWM.

Below we compare performance of the ARWM with the AirRWM on sampling from a t-distribution.

 π(x)∼(1+x2ν)−(ν+1)/2,

where we set and consider three different sequences for . We start algorithms with the initial proposal variance . We also run a non-adaptive RWM with this initial variance to demonstrate the speed up of the adaptive algorithms.

Remark. To prevent from converging to a poor proposal variance, the sequence should be chosen so that , where are the adaptation times. For example, we could choose for some .

Below we present the simulation results. The sequence in the settings of the Algorithm 2 is chosen as in the above remark, . For every algorithm we run 1000 independent chains for 100,000 iterations all started from the origin.

We estimate the optimal variance to be around 6.5. We observe that AirRWM with approximates the optimal variance very well and performs only 446 adaptations; AirRWM with , performs 66 adaptations and underestimates the optimal variance to be 4.5; whereas in case , the AirRWM does only 24 iterations and estimates the variance only as 1.95. On the other hand, it is known that the adaptive algorithms are robust to the choice of the adapted parameters (see., e.g., ). As we can see in Figure 1

, all the adaptive algorithms estimate the 0.95 quantile equally well after 100,000 iterations. Note that the non-adaptive chain with proposal variance

converges extremely slowly, so that its running quantile estimation plot does not fit into Figure 1. We present trace plots of the non-adaptive and adaptive chains in Figure 2. Figure 1: Error in estimation of a quantile at 0.95 level. X-axis – number of iterations. Y-axis – error in estimation. Figure 2: Trace plots.

Remark. If the target distribution has polynomial tails, then under mild conditions, as follows from results of Jarner and Roberts , the Random Walk Metropolis (RWM) with normal proposals produces a polynomially ergodic chain. More precisely, for some consider a target distribution on the whole line with Lebesgue density given by

 π(x)=l(|x|)|x|1+r, x∈R, (3)

where is a normalised slowly varying function. By slowly varying function we understand a function such that for all is eventually increasing and is eventually decreasing.

From Proposition 3 of , it follows that the collection of RWM kernels (here is a variance of the proposal) are simultaneously polynomially ergodic (see Assumption 3 in Section 3).

Thus, we can see that Theorem 3 of Section 3 is applicable and, given a sequence is chosen as in the theorem, the AirRWM Algorithm 2 produces a chain for which the SLLN and WLLN hold. If, additionally, the adapted variance converges, then the CLT holds, although we do not investigate further these details in the present paper.

### 2.2 Adaptive Metropolis for high dimensional correlated posteriors

In this example we shall analyse ‘Air‘ version of the Adaptive Random Walk Metropolis (ARWM) algorithm introduced by Haario et al.  and studied in .

For a dimensional distribution with covariance matrix , consider a Metropolis-Hastigns algorithm with a sequence of proposals

 Qn(x,⋅)=0.9N(x,[(2.38)2d]Σn)+0.1 N(x,(0.1)2dId),

where is a

dimensional identity matrix and

is a covariance matrix estimated from the first steps of the adaptive algorithm.

The algorithm is aimed to approximate the optimal proposal (see [34, 36, 40]), where is a covariance matrix of the target distribution. Roberts & Rosenthal  argue that the ARWM may be very efficient in high-dimensional settings, where a good proposal is crucial. We shall analyse the same example as in Section 2 of . The target distribution is a multivariate normal

 π∼N(0,MMT),

where the covariance matrix is formed of a dimensional matrix with randomly generated entries .

For the ‘Air‘ version of the algorithm, introduce a sequence of increasing lags,

 nk=⌊kβ⌋ k≥1,

for some , and consider Algorithm 1, where adaptations are allowed to take place only at times (2), i.e., after non-adaptive iterations.

Roberts & Rosenthal  measure the efficiency of an adaptive algorithm by looking at two crucial properties. First, is the ability of the algorithm to learn the appropriate scale (variance), which is monitored by looking at the trace plot. Second, is the ability of the algorithm to learn the shape of the target distribution, which is measured by inhomogeneity factor introduced by  (see also [39, 40]). For a dimensional target distribution, the inhomogeneity factor is defined as

 b=d∑di=1λ−1i(∑di=1λ−1/2i)2,

where

are the eigenvalues of

, where, as before, is the covariance matrix of and is the empirical covariance matrix. Note that by Jensen’s inequality, , and only for the proposal, which shape is proportional to .

For three different values of the parameter we run ARWM and AirRWM algorithms to obtain 1 million samples for a 100 dimensional target distribution . Trace plots of the 1st coordinate can be found in Figure 3, whereas the running inhomogeneity factor estimator is plotted in Figure 4.

Surprisingly, it seems that AirRWM performs at least as well as the usual ARWM for any , whence we conclude that one does not need to adapt the covariance matrix after each iteration. Moreover, we present total computational cost of the adaptive algorithm in Table 1. One can observe that Airing delivers a 5 fold speed up to the ARWM, where adaptations are performed at every iteration.

 ARWM AirRWM β=1 AirRWM β=2 AirRWM β=4 Time (seconds) 507.6 90.5 86.9 80.2

## 3 AirMCMC Theory

Recall that we are interested in the long time behavior of the sample average defined in (1), where the sequence is generated by the generic AirMCMC Algorithm 1. Hence, for iterations between and the process is evolving according to , and it is the properties of these Markov transition kernels that play the key role in the analysis.

The transition kernel is a map such that is a probability measure on for every and is a measurable function for every . acts on the space of probability measures from the left, , with , and on the space of functions from the right, , with

Given a collection of transition kernels , a sequence of lags , an adaptation rule, say , and initialisation , the AirMCMC Algorithm 1 induces a probability measure on i.e. on the space of trajectories of . Denote this probability measure as and write for its expectation. Note that the construction allows for being a random sequence, being a randomised rule and being a random starting point. We will explore the possibility of being random in Section 3.4.

Properties of AirMCMC translate into statements about and, in particular,

• we say that the AirMCMC algorithm is ergodic, if it converges in distribution, i.e. for every ,

 limn→∞P(X0,γ0)(Xn∈A)=π(A); (4)
• the Mean Square Error of defined in (1) and obtained from AirMCMC, is

 MSE(^πN(f)):=E(X0,γ0)[^πN(f)−π(f)]2; (5)
• the Weak Law of Large Numbers holds for AirMCMC, if for every , converges in probability to , i.e.,

 limN→∞P(X0,γ0)(|^πN(f)−π(f)|>ε)=0, (6)

and we use to denote the convergence in probability;

• the Strong Law of Large Numbers holds for AirMCMC, if converges to almost surely, i.e.,

 P(X0,γ0)(limN→∞^πN(f)→π(f))=1, (7)

and we use to denote almost sure convergence;

• and finally, the Central Limit Theorem holds if for every ,

 limn→∞P(X0,γ0)(√N{^πN(f)−π(f)}≤u)=1√2πσ2f∫u−∞ev22σ2fdv. (8)

where is called the asymptotic variance. We use to denote convergence (8).

We start by introducing regularity conditions commonly used in analysis of MCMC and AMCMC algorithms. We refer to [30, 37] for the Markov chains and MCMC context of these conditions, and to [7, 11, 38] for the AMCMC context. Throughout the paper the following will hold:

###### Assumption 1 (Regularity and Small Set).
• All considered Markov kernels are -invariant, -irreducible, and aperiodic (see  for definitions);

• One step simultaneous minorisation condition holds, i.e., there exist a set , with positive mass , a probability measure on , and a constant such that

 Pγ(x,⋅)≥δν(⋅) for all x∈C, γ∈Γ. (9)

We shall consider AirMCMC in several stability settings.

###### Assumption 2 (Simultaneous Geometric Drift).

The collection of transition kernels satisfies the Simultaneous Geometric Drift condition, if there exist constants , , and a function , such that

 PγV≤λV+bI{C}, for all γ∈Γ, (10)

where is the small set defined in (9).

###### Assumption 3 (Simultaneous Polynomial Drift).

The collection of transition kernels satisfies the Simultaneous Polynomial Drift condition, if there exist constants , , , and a function , such that

 PγV≤V−cVα+bI{C}, for all γ∈Γ, (11)

where is the small set defined in (9).

Most theoretical work on Adaptive MCMC has been developed under simultaneous geometric or polynomial drift defined above, however these assumptions are not well suited for some classes of algorithms, such as the Random Scan Gibbs Samplers. Hence, in  we introduce a relaxed version of the simultaneous drift condition that only requires (10) to hold locally.

###### Assumption 4 (Local Simultaneous Geometric Drift).

The collection of transition kernels satisfies the Local Simultaneous Geometric Drift condition, if for every there exist an open neighborhood , such that , and there exist constants , , and a function , such that

 Pγ⋆Vγ≤λγVγ+bγI{C}, for all γ⋆∈Bγ, (12)

where is the small set defined in (9).

The above formulation of the Local Simultaneous Drift condition is easy to verify in some fairly general settings, c.f. Theorem 10 of 

for the case of Random Scan Gibbs Samplers indexed by the vector of selection probabilities. The following theorem makes Local Simultaneous Drift condition operational in the sense that it helps conclude global stability.

###### Theorem 11 of .

Let satisfy Assumption 4, and let be a compact set in some topology. Then, there exists a finite partition of into sets such that , and a version of the Local Simultaneous Geometric Drift condition (12) holds inside with a set dependent drift function and coefficients , are independent of , i.e.

 PγVi≤λVi+bI{C}, for all % γ∈Fi, (13)

where is the small set defined in (9).

For the CLT to hold, we require a bound on the regeneration times of the Markov chain generated by a kernel . Assumption 1 allows construction of a split chain on the space defined as

 P(Yn−1=1|Xn−1)=δIXn−1∈C, P(Xn∈A|Yn−1=1,Xn−1)=ν(A), P(Xn∈A|Yn−1=0,Xn−1)=Qγ(Xn−1,A),

where

 Qγ(x,⋅)=Pγ(x,⋅)−δν(⋅)I{C}1−δI{C}.

Note that marginally is a Markov chain that evolves according to . Regeneration time is defined as

 T=inf{n≥1:Yn−1=1}. (14)
###### Assumption 5.

For some a function of interest satisfies

 supγ∈ΓEν,γ[T−1∑j=0f(Xj)]2+δ<∞, (15)

where is a regeneration time of a Markov chain with transition kernel .

For functions and define a norm as

 ∥g∥V:=supx|g(x)|V(x).

For a singed measure , the corresponding norm is defined as

 ∥μ∥V:=supg:∥g∥V=1∥μ(g)∥V.

Suppose that the parameter space is a metric space. We say that the kernel is a continuous function of in norm if for any sequence such that ,

 supx∥Pγn(x,⋅)−Pγ(x,⋅)∥VV(x)→0 as n→∞.

We are now ready to state the main results of the paper.

### 3.1 Simultaneous Geometric Ergodicity

###### Theorem 1.

Let a collection of Markov kernels with an invariant distribution satisfy Assumptions 1 and 2, and let be the drift coefficients in (10).

Fix an arbitrary real number and let be a sequence such that for some , ,

 c2kβ≥nk≥c1kβ. (16)

For these parameters consider a chain generated by the AirMCMC Algorithm 1.

Then for any starting distribution such that , and any function such that :

1. [label=)]

2. For any , the MSE of converges to at a rate
, i.e.,

 limN→∞MSE(^πN(f))=O⎛⎜⎝1Nmin{1,2β1+β}⎞⎟⎠,

in particular, the WLLN holds.

3. If , the rate in mean-square convergence is , i.e.,

 MSE(^πN(f))=O(1N).
4. If , the SLLN holds,

 ^πN(f)a.s.−−→π(f).
5. Suppose , Assumption 5 holds, is a metric space, and the adapted parameter converges to a limit almost surely (where

itself might be a random variable). Assume that

is a continuous function of in norm. If also, has a positive asymptotic variance , then the CLT holds, i.e.,

Assumption 5 is standard to verify under simultaneous geometric ergodicity Assumption 2. We present the corresponding proposition below.

###### Proposition 1.

Let the Assumption 2 hold and . Then Assumption 5 holds for any function such that for some .

### 3.2 Local Simultaneous Geometric Ergodicity

In order to extend Theorem 1 to the local geometric ergodicity settings, we need to modify AirMCMC algorithm. We introduce a set , where all the drift functions that satisfy (12), are bounded on . Algorithm 3 is a modified version of AirMCMC, where the adaptations are allowed to take place only when the chain hits .

Remark. Efficiency of the algorithm depends on the choice of the set . If is too “small”, adaptations will not occur frequently. However under the conditions of Theorem 2, the set will be visited infinitely many times so that the adaptation will continue. Moreover, Theorem 11 of  (presented above) implies that, if the parameter set is compact, there exits a finite number of drift functions that satisfy Assumption 4. Theorem 14.2.5. of  implies that for large , level sets , cover most of the support of for large , meaning that, with the appropriate choice of , the adaptations will occur in most of the iterations of the modified Algorithm 3.

###### Theorem 2.

Let a collection of Markov kernels with an invariant distribution satisfy Assumptions 1 and 4, and let be the drift coefficients in (10). Assume that is a compact set in some topology and let be any set such that , .

Fix an arbitrary real number and let be a sequence such that for some , ,

 c2kβ≥nk≥c1kβ.

For these parameters consider the chain generated by the AirMCMC algorithm 3.

Then for any starting distribution such that , and any function such that , :

1. [label=)]

2. For any , the MSE of converges to at a rate
, i.e.,

 limN→∞MSE(^πN(f))=O⎛⎜⎝1Nmin{1,2β1+β}⎞⎟⎠,

in particular, the WLLN holds.

3. If , the rate in mean-square convergence is , i.e.,

 MSE(^πN(f))=O(1N).
4. If , the SLLN holds,

 ^πN(f)a.s.−−→π(f).
5. Suppose , Assumption 5 holds, is a metric space, and the adaptive parameter converges to a limit almost surely (where itself might be a random variable). Assume that for every , is a continuous function of in some open neighbourhood of in norm. If also, has a positive asymptotic variance
then the CLT holds, i.e.,

The following proposition allows to practically verify Assumption 5 in the local geometric ergodicity settings.

###### Proposition 2.

Let the Assumption 4 hold and for . Then Assumption 5 holds for any function such that for some and all .

### 3.3 Simultaneous Polynomial Ergodicity

In this section we extend Theorem 1 for the case of polynomially ergodic kernels , .

###### Theorem 3.

Let a collection of Markov kernels with an invariant distribution satisfy Assumptions 1 and 3, and let be the drift coefficients in (11). Assume also that and .

Fix an arbitrary real number and let be a sequence such that for some , ,

 c2kβ≥nk≥c1kβ.

For these parameters consider the chain generated by the AirMCMC algorithm 1.

Then for any starting distribution such that , and any function such that :

1. [label=)]

2. For any the WLLN holds, i.e, for any

 limN→∞PX0,γ0(∣∣^πN(f)−π(f)∣∣≥ε)=0.
3. If , the SLLN holds,

 ^πN(f)a.s.−−→π(f).
4. Suppose , Assumption 5 holds, is a metric space, and the adaptive parameter converges to a limit almost surely (where itself might be a random variable). Assume that is a continuous function of in norm. If also, has a positive asymptotic variance , then the CLT holds, i.e.,

Remark. It follows from the theorem that disregarding the value of .

As before, we present a proposition allows to practically verify Assumption 5 in simultaneous polynomial ergodicity settings.

###### Proposition 3.

Let the Assumption 3 hold and . Then Assumption 5 holds for any function such that for some .

### 3.4 Convergence in distribution

We have shown in the previous section that under regularity conditions of Theorems 1, 2, and 3, the AirMCMC algorithm produces a chain with various convergence properties. However, without any additional assumptions the chain might fail to converge in distribution, as we demonstrate in Example LABEL:example:slln_without_ergodicity below. On the other, we show in Theorem 4 that imposing an additional diminishing adaptation condition (17), guarantees ergodicity (i.e., convergence in distribution) of the AirMCMC algorithm. We argue that this is a minor condition that either holds in practice or can be easily enforced. In Theorem 5 we introduce an AirMCMC Algorithm 4, where the sequence of increasing lags is randomised, which ensures the diminishing adaptation condition.

The diminishing adaptation condition is a restriction on the adaptation size of the algorithm:

 supx∈X∥Pγn(x,⋅)−Pγn+1(x,⋅)∥TVP→0 as n→∞, (17)

where is the total variation distance, is a valued random variable. Here for a signed measure , where the supremum is taken over all measurable sets.

The following theorem demonstrates that the regularity conditions of the previous section together with the diminishing adaptation condition imply convergence in distribution of the AirMCMC algorithms.

###### Theorem 4.

Suppose that the diminishing adaptation condition (17) and Assumption 1 hold. Let also one of the following conditions hold

1. [label*=()]

2. Assumption 2 and for the corresponding drift function ,
;

3. Assumption 4 and for the corresponding collection of drift functions , for all ;

4. Assumption 3 and for the corresponding drift function , every level set is a uniform small set, i.e., satisfies (9).

Then any AirMCMC Algorithm 1 (or, in case 2, any modified AirMCMC Algorithm 3, where the set in the algorithm settings is such that for the corresponding drift functions , , ) produces an ergodic chain , i.e.,

 ∥L(Xn)−π∥TV→0,

where is the distribution law of .

As argued in , the diminishing adaptation condition is not an issue in practice. The condition holds for many typical adaptive MCMC algorithms (e.g., as for the standard Adaptive Metropolis or Adaptive Gibbs Samplers, see [10, 38]). For the adaptive algorithms where the condition does not hold (e.g., as for KAMH ) or it is hard to verify the condition, we could, nevertheless, easily modify the algorithms in order to enforce (17). For example, at the adaptation times , we could flip a coin with success probability to decide whether to adapt the Markov kernel. If , then (17) holds. Notice that the sequence can decay arbitrarily slowly.

Alternatively, for the AirMCMC algorithms, we could allow the sequence of increasing lags to be random. More precisely, let sequence be deterministic that satisfies (16) for some . We could consider an AirMCMC Algorithm 1, where in Step 2 we set for some . Since, satisfies (16), we could still prove the statements of Theorems 1, 2, 3 for this randomised version of the AirMCMC. Moreover, the resulting Algorithm 4 would be ergodic and satisfy the statements of Theorems 1, 2 or 3 under the corresponding regularity conditions. We summarise our observations in Theorem 5 below.

###### Theorem 5.

Consider settings of Theorem 1 (alternatively, of Theorem 2 or 3), where the condition (16) holds for a sequence . Consider an AirMCMC Algorithm 4 (in case of the settings of Theorem 2, we allow adaptations in Step 2 to happen only if the chain hits the corresponding set ). Then the adaptive chain produced by the algorithm satisfies statements of Theorem 1 (alternatively, of Theorem 2 or 3, respectively).

Moreover, for any sequence of lags , the AirMCMC Algorithm 4 satisfies the diminishing adaptation condition (17). Under regularity conditions of Theorem 4 (in case of the settings 2 of the theorem, we allow adaptations to happen only if the chain hits the corresponding set ), the adaptive chain produced by the algorithm converges in distribution.

We conclude this section with a counterexample that demonstrates that an AirMCMC Algorithm 1 might fail to be ergodic (i.e., the corresponding adaptive chain does not converge in distribution) without the diminishing adaptation condition.

Example 1 . This example is a modified version of Example 4 of Roberts & Rosenthal . Our goal is to construct an AirMCMC algorithm that satisfies conditions of Theorem 1 but fails to be ergodic. Let . For some , define a target as , , . For , let correspond to a Metropolis-Hastings kernel with proposals

 Q1(x,⋅)∼Uniform{x−1,x+1}, Q2(x,⋅)∼Uniform{x−2,x−1,x+1,x+2}.

proceeds as follows. At every iteration given , simulate proposal , with probability set , otherwise, reject the proposal, i.e., . If the proposal is outside , then we always reject it. Consider the following adaptive Algorithm 5.

###### Proposition 4.

Kernels , satisfy the simultaneous minorisation Assumption 1 and the simultaneous drift Assumption 2. Therefore, by virtue of Theorem 1, the SLLN holds for Algorithm 5. However, the adaptive chain produced by the algorithm does not converge in distribution.

Proof of Proposition 4. Algorithm 5 is designed in such a way, that Steps 1 and 2 “drift” the adaptive chain away from the correct stationary distribution. Since the chain approaches the stationary distribution arbitrarily closely after Step 3, we conclude that at times ,

 ∥∥L(X2k2+2)−π∥∥TV>δ

for some . Therefore, does not converge in distribution. We provide a detailed proof of the proposition in Appendix A.

## 4 Comparison with available Adaptive MCMC theory

AMCMC algorithms have received an increasing attention in the past two decades with much research devoted to studying ergodicity property [6, 7, 10, 20, 26, 38], robustness and stability of the algorithms [1, 11, 46], as well as asymptotic behaviour of the average (1) of the adaptive chain output [2, 4, 17, 41, 45]. In the current paper we are interested in the latter part, i.e., in studying the asymptotic behaviour of (1).

As discussed in , convergence of any AMCMC algorithm depends on the combination of two factors: the speed of convergence of the underlying Markov kernels (their mixing properties) and the adaptation scheme of the algorithm. In practice the appropriate combination of mixing and adaptation is established by verifying the containment and diminishing adaptation conditions. Together these conditions imply convergence in distribution of the AMCMC (see ). Violating either of the conditions can ruin the convergence of an AMCMC scheme (see e.g., example in Section 3 of ; Examples 1 and 2 in ). As we discussed in Section 3.4, the diminishing adaptation is a mild condition, that can be imposed, if necessary, by slightly modifying the adaptation procedure.

The containment condition is not necessary for convergence, but an ergodic adaptive algorithm that fails the containment, is also more inefficient than any of its non-adaptive counterparts, as was proven in . The containment is a technical condition, which is notoriously hard to verify directly. However, it is implied by the regularity assumptions presented in Section 3 (see [7, 10, 38]). As demonstrated in Example 4 of , even on finite state spaces the containment and diminishing adaptation conditions alone do not imply the SLLN.

Under the diminishing adaptation and simultaneous geometric drift condition (10), the SLLN was established in, e.g., [2, 4, 41, 45]. Moreover, under an additional assumption that the adapted parameters converge, the CLT was established in . The SLLN was also established under the simultaneous polynomial drift condition (11) in . Note, however, the authors effectively require the joint process to be an inhomogeneous Markov chain. The results are well-suited for many popular algorithms, e.g., Adaptive Metropolis-Hastings (see Section 3.2 in ), Adaptive Metropolis-within-Gibbs (see, e.g., [26, 40]), or Adaptive Metropolis adjusted Langevin Algorithm (see ).

On the other hand, there are algorithms that do not meet the conditions of [2, 4, 41, 45]. For example, the Adaptive Random Scan Gibbs (ARSG) sampler, recently presented in , generally does not satisfy the simultaneous drift condition, whereas the Kernel Adaptive Metropolis-Hastings (KAMH) algorithm, proposed by Sejdinovic et al. , produces an adaptive chain that is not Markov. Furthermore, none of the available adaptive MCMC results quantifies the MSE rate of convergence

We have introduced a concept of AirMCMC algorithms, for which we have relaxed the generally imposed conditions. First, we do not require the joint adaptive chain to be Markov. Secondly, for the modified AirMCMC Algorithm 3, instead of the simultaneous geometric drift condition (10), we require only the local geometric drift (12) to hold, which is a natural condition for the ARSGS. Thus, we could prove the SLLN, MSE convergence, and convergence in distribution for the Air versions of the ARSGS and the KAMH in Section 5.

Moreover, for the AirMCMC algorithms, under the local geometric drift Assumption 4 or the simultaneous polynomial drift Assumption 3, we have established the CLT. We have also derived the MSE convergence under the local or simultaneous geometric drift conditions (Assumptions 4 and 2, respectively).

We emphasize that virtually any AMCMC algorithm can be transformed into an Air version via lagging the adaptations in a way described in Algorithms 1, 3, and 4.

The technique we have used for analysis is tightly related to the one developed by Gilks et al. . The key idea in  is to allow adaptations of the Markov kernel to happen only at suitably constructed regeneration times of the chain. Under only Assumption 1, it is then possible to establish the SLLN, CLT, and MSE convergence. This is an effective idea for AMCMC in low dimensional spaces but impractical in higher dimensions, since the chain typically regenerates at a rate which recedes to 0 exponentially in dimension.

By introducing an increasing sequence of iteration between adaptation in Algorithms 1, 3, and 4, we have shown that the regularity conditions of Section 3 guarantee that the chain regenerates between adaptations with an increasingly high probability. Since grows sufficiently fast, we can use the technique of  to analyse the Markov tours of the adaptive chain between the regenerations, and control the remainder terms of the adaptive chain using the explicit bounds of .

## 5 Examples: Air versions of complex AMCMC algorithms

### 5.1 Adaptive Random Scan Gibbs Sampler

We could directly apply Theorem 2 to the ARSG sampler studied in . Let be a probability vector and assume that the target distribution sits on a product space . Recall, that the RSGS proceeds at each iteration by first choosing a coordinate with probability , and then updating the coordinate from its full conditional distributions.

In  we develop Adaptive RSGS that gradually tunes the selection probability vector to find its optimal value, which is based on spectral gap maximisation for a normal analogue of the target. We refer the reader to  for details. In the ARSGS, the adaptations of the sampling weights are separated by RSGS iterations. Therefore, if the sequence is chosen to be non-decreasing, the ARSGS already fits into AIRMCMC framework.

As we mentioned in Section 4, it is hard to verify the simultaneous geometric drift condition (10) for the ARSGS. On the other hand, the local simultaneous geometric drift condition (12) is a natural property for the ARSGS as long as the RSGS Markov kernel is geometrically ergodic for at least some selection probability vector (see Theorem 10 of ). We summarise our observations in the following theorem

###### Theorem 6.

Let be a target distribution on , where for some positive integers . Consider a collection of RSGS kernels parametrised by the sampling weights