Since the seminal article by gelfand1990sampling, Markov chain Monte Carlo methods (MCMC) have become ubiquitous in statistical computing. These methods suffer from a well-known problem: although it is easy to construct many MCMC algorithms that are guaranteed to converge eventually, it is typically very difficult to tell how long convergence will take for any given MCMC algorithm (see e.g. [jones2001honest, diaconis2008gibbs]). Even worse, it is usually difficult to tell if an MCMC has converged yet even after running it (see e.g. [gelman1992inference] for an important early paper in the “diagnostics” literature, [bhatnagar2011computational] for a proof that the problem is intractable in general, and [hsu2015mixing, huber2016perfect] for cases where the problem becomes tractable).
In the decades since the publication of gelfand1990sampling, many techniques have been developed to try to solve this problem for classes of MCMC algorithms that are important in statistics (see e.g. the surveys by jones2001honest,diaconis2009markov). The most popular of these techniques is the “drift-and-minorization” method, introduced and popularized in rosenthal1995minorization, meyn1994computable, meyn2012markov. The purpose of this note is to replace the minorization condition (see Inequality 1) by a hitting condition (see Definition 2.5). It is straightforward to check that the usual drift-and-minorization condition immediately implies our drift-and-hitting condition, and in this sense our assumptions are weaker (and thus our result stronger) than typical drift-and-minorization conditions; see Section 4 for further discussion of the relationship between these conditions.
We now set some notation and recall an important special case of the drift-and-minorization bound. Throughout this paper, unless otherwise mentioned, we use to denote the state space of the underlying Markov process. Denote by the transition kernel of some Markov chain on state space with (unique) stationary measure on a Borel -field . For any , we write to denote the -step transition kernel generated from .
We say that exhibits a “drift” or “Lyapunov” condition if there exists a function and constants , so that
for all . We say that exhibits a “minorization” condition if there exists a set , constants , , and measure on so that
[Thm. 12]rosenthal1995minorization says that any kernel satisfying these conditions compatibly is geometrically ergodic, and also gives quantitative bounds on the convergence rate:
Theorem 1.1 (Paraphrase of [Thm. 12]rosenthal1995minorization).
There are more sophisticated versions of this drift-and-minorization bound, including results in the same paper [rosenthal1995minorization], but most are based on two conditions that are similar to 1 and 1.
The goal of the present paper is to present results similar to Theorem 1.1, in which the minorization condition 1 has been replaced by related hitting conditions. Recall that the hitting time of a set for a Markov chain is:
We now introduce the maximum hitting time (of sets with large measure):
Let . The maximum hitting time (with parameter ) is
where is the expectation of a measure in the space which generates the underlying Markov process and the subscript is the starting point of the Markov process.
When the kernel is clear from the context, we write .
1.1. Guide to Paper
1.2. Previous Work
This paper is based on the asymptotic equivalence of mixing and hitting times as established by in the sequence of papers [oliveira2012mixing, finitemixhit, basu2015characterization, anderson2018mixhit, anderson2019mixhit2]. This relationship allows us to show that our hitting condition (see Definition 2.5) implies something very close to the minorization condition 1, and thus to use the framework of rosenthal1995minorization. We note that our recent works anderson2018mixhit,anderson2019mixhit2 are based on the work of finitemixhit,oliveira2012mixing, which initially established this equivalence of mixing and hitting times in the special case that is finite. anderson2018mixhit and anderson2019mixhit2 make heavy use of hyperfinite representation of Markov processes, which is established by Markovpaper.
2. Bounded Mixing Times for Dominated Chains
We recall the definition of mixing time:
Fix and . The mixing time in with respect to is defined as
The lazy mixing time in with respect to is
For notational convenience, we write or to mean or , write or to mean or , and write or to mean or .
In practice, it is rare for transition kernels corresponding to MCMC algorithms on unbounded state spaces to have finite mixing times (this would require jumps of unbounded size, and these are often difficult to construct). For this reason, the mixing and maximal hitting times cannot be directly compared in a meaningful way. We will fix this problem by constructing related Markov chains that both (i) have finite mixing times and (ii) have dynamics that are “slightly faster” than those of . We introduce the following definition to formalize this idea:
Definition 2.2 (Dominated Chain).
Let be a transition kernel on state space with associated -field . We say that a transition kernel with support contained in is -dominated by if it satisfies the following two properties:
The stationary measure of is given by
for all .
The transitions of satisfy
for all and with .
There are many constructions in the literature that satisfy this -domination property, such as the usual trace of a chain on . In this paper, we focus on restrictions of Gibbs samplers and Metropolis–Hastings chains.
Definition 2.3 (Restricted Sampler).
Recall that there is a unique Metropolis–Hastings kernel associated with any reversible proposal kernel and target distribution . Let be any reversible transition kernel with stationary measure and let have . We define the restriction of to to be the Metropolis–Hastings kernel with proposal and target as in Item 1.
In the special case of the Gibbs sampler, another restriction is sometimes useful:
Definition 2.4 (Restricted Gibbs Samplers).
Recall that the Gibbs sampler is uniquely defined by the target distribution. If is a Gibbs sampler targeting and has , we define the Gibbs restriction of to to be the Gibbs sampler targeting the measure defined in Item 1.
It is clear that both of these constructions give -dominated chains, as does the usual trace process (see e.g. Section 6 of [beltran2010tunneling] for the last result); in general, all three are different. For the remainder of this section, we will always fix a kernel and denote by any specific kernel that is -dominated by .
The following hitting condition plays an essential role throughout the paper.
Fix . For , denote by the maximum hitting time of large sets for . We say that satisfies the -hitting condition if there exists such that .
For a fixed collection of constants , define to be the collection of kernels satisfying .
We recall that the main results in [oliveira2012mixing, finitemixhit, basu2015characterization, anderson2018mixhit, anderson2019mixhit2] show that there exist universal constants so that this is satisfied for all in certain large classes.
It is straightforward to check that these conditions imply that for all , but for convenience we focus on a single such value.
Finally, let denote the mixing time for . We now show that we can obtain a minorization condition similar to Section 1 by combining the hitting condition in Definition 2.5 and the drift condition in Section 1.
Fix and pick as in Definition 2.5. Let . As , by assumption, we have
Next, we denote by a Markov chain with transition kernel starting at point , and we denote by a Markov chain with transition kernel starting at the same point . By Section 1, we have for all
Iterating this bound, we have
By assumption, , so we have
for all . By a union bound and Markov’s inequality, we have
By the coupling inequality and Section 2, we have . By triangle inequality again, we have for all . ∎
The following result is a minor modification of [Thm. 12]rosenthal1995minorization.
Let satisfy satisfy Section 1. Let be a measurable subset of such that
for some .
for some .
Then for every and every we have
We first show the condition implies a pseudo-minorization condition from [Roberts2001], which is weaker than the minorization condition. By the definition of total variation distance, for any , there exists such that
Then we have . For any , define
Then it can be verified that is a valid probability measure and
Therefore, is a -pseudo-small set in the sense of [Roberts2001]. The result then follows directly from [Roberts2001, Proposition 2]. ∎
We now present the main result of this section.
3. Application to Gibbs and Metropolis–Hastings Samplers
3.1. Application to Gibbs Samplers
We begin by studying a large class of Gibbs samplers defined in our companion paper anderson2019mixhit2. We consider a class of Gibbs samplers targeting a measure supported on a compact subset of Euclidean space; without loss of generality we assume that the support of is inside of . For , let denote the projection to the -th coordinate. For and , let
the line in that passes through and is parallel to the ’th coordinate axis. Let be the connected component (the largest connected subset) of that contains . For the remainder of this section, we consider fixed , so that e.g. either or . We make the following assumption on :
has a continuous density function with respect to the Lebesgue measure. Furthermore, for every and .
We now set some further notation. Set and let be the usual conditional distribution of on (that is, the measure with density and support ) Define the typical Gibbs sampler with target by
for every and . In the context of the Gibbs sampler only, the -dominated chain will always refer to the chain from Definition 2.4.
Theorem 3.1 ([Corollary. 4.5]anderson2019mixhit2).
We now state the following elementary facts about drift conditions.
Suppose is a transition kernel satisfying Section 1 with function and constants and . Then we have
If is furthermore a Gibbs sampler, then
for any set of the form , any and any .
We now present the main result in this section:
Let be the transition kernel of a Gibbs sampler with targeting distribution . Suppose satisfies Section 1 and Definition 2.5 with compatibly for some compact with . Suppose satisfies Assumption 1. Suppose
for some , where and . Then there exists and such that
for all and . Moreover, and depend only on the parameters .
3.2. Application to Metropolis–Hastings Chain
In this section, we give results analogous to Lemma 3.3 for the following class of Metropolis–Hastings chain defined in anderson2019mixhit2:
Definition 3.5 (Metropolis–Hastings Chain).
Fix a distribution with continuous density function supported on a compact subset of Euclidean space; without loss of generality we assume that the support of is a subset of . We fix a reversible kernel with unique stationary measure whose support contains that of . We assume that has continuous density and that, for all for which it is defined, has density . Finally, we assume that the mapping is continuous. We define the acceptance function by the formula
Define to be the transition kernel given by the formula
Note that is defined uniquely by its target and proposal .
Theorem 3.6 ([Cor. 6.5]anderson2019mixhit2).
In the context of the Metropolis–Hastings sampler only, the -dominated chain will always refer to the chain from Definition 2.3. We note that Lemma 3.2 remains true with “Gibbs” replaced by “Metropolis–Hastings” the one time it is used. We then have the main result of this section:
There exist a universal set of constants with the following property:
4. Comparison to Minorization and Illustrative Example
The main goal of this paper was to illustrate how the minorization condition 1.1 in Theorem 1.1 can be replaced by a hitting-time condition. We have not yet seriously addressed the obvious question: why would anyone wish to do this?
We would like to be able to say that using maximum hitting times gives better bounds. However, the equivalence results of [oliveira2012mixing, finitemixhit, basu2015characterization, anderson2018mixhit, anderson2019mixhit2] say that this nice response can’t be correct. After all, if mixing times and maximum hitting times are equal up to some universal constants, methods based on maximum hitting times must give bounds that are essentially equivalent to those obtained from the pseudo-minorization condition of [Roberts2001].
Instead of obtaining better results, we claim that bounds on the maximum hitting time often have easier proofs. The “difficulty” of a proof is of course subjective, but we give evidence based on the following concrete claims:
Maximum hitting time bounds are never “harder” than pseudo-minorization bounds: Any pseudo-minorization bound can immediately be turned into a maximum hitting time bound (see 4).
In some realistic examples, we can compute maximum hitting bounds but not pseudo-minorization bounds: We have a broad class of simple but realistic examples for which we can compute good bounds on the maximum hitting time, but have no realistic way to directly compute good bounds on the mixing time (see Theorem 4.2).
To see the first, consider any Markov chain with transition kernel , unique stationary measure and mixing time . Then for any measurable with and starting point ,
which immediately implies
Next, we consider a class of examples with state space . We begin by defining the following class of nearly-unimodal measures. These are meant to mimic the typical “small set” used in a pseudo-minorization argument - we expect the target distribution to be reasonably close to unimodal, but might not have much detailed control over the size of the mode or fluctuations.
Fix and . For a density and constant , denote by the
’th quantile of. We say that a density on is unimodal if there exists some such that
for all and
for all . We say that a density is -nearly unimodal if
is unimodal, and
for all , and
for all .
We have the following:
Fix , . There exists a universal constant so that the following holds:
Let be a continuous density that is -nearly unimodal for some . For , let be the transition kernel given in Definition 3.5 with proposal kernel and target . Then the maximum hitting time of satisfies
for all sufficiently large.
We check that the conclusion holds for a simple discretization of the process, then translate the bound back to our original kernel . For , we define the transition kernel of a closely-related “birth-and-death” chain on by the formula
and . Since is a birth-and-death chain, it is reversible with respect to some measure .
Next, fix a measurable set with ; by the (uniform) continuity of , it is straightforward to check that for all sufficiently large. Since , there exist a point so that
Similarly, there exists satisfying 4. Denote by , the closest points in to and , respectively (break ties arbitrarily).
Denote by the hitting time of a set for a transition kernel , extending the notion in Equation 1. Since birth-and-death chains can’t “skip” any values,
The expectations appearing in the middle expression have simple explicit formulas (see e.g. [palacios1996note]). Comparing the explicit formula for to the explicit formula for the simple random walk on , we see that all terms in the explicit formula differ by at most a multiplicative factor depending only on . Thus, there exists a universal constant such that
Since (see e.g. Example 10.20 of [markovmix]), we have shown that
Define . By 4, there exists a universal constant such that
We now relate this to . Fix some and consider sample paths , . Applying standard convergence results (e.g. Theorem 2.5 of [ethier2009markov]), one can quickly confirm that both of these processes converge to the same limiting continuous-time process in Skorohod’s topology on as and time is rescaled in the usual way, as . Since can’t jump more than distance , this convergence implies the existence of a constant such that
Since this holds for all measurable , this completes the proof. ∎
When , the kernel analyzed above is exactly the usual “ball” walk on , so this upper bound is in fact sharp. While our analysis is fairly simple, we know of no simple way to get comparable results by directly analyzing the mixing time. See further discussion in e.g. [johndrow2018fast, yuen2000applications], where the methods studied cannot give bounds better than for any walk in this class.