Drift, Minorization, and Hitting Times

10/14/2019 ∙ by Robert M. Anderson, et al. ∙ 0

The “drift-and-minorization” method, introduced and popularized in (Rosenthal, 1995; Meyn and Tweedie, 1994; Meyn and Tweedie, 2012), remains the most popular approach for bounding the convergence rates of Markov chains used in statistical computation. This approach requires estimates of two quantities: the rate at which a single copy of the Markov chain “drifts” towards a fixed “small set”, and a “minorization condition” which gives the worst-case time for two Markov chains started within the small set to couple with moderately large probability. In this paper, we build on (Oliveira, 2012; Peres and Sousi, 2015) and our work (Anderson, Duanmu, Smith, 2019a; Anderson, Duanmu, Smith, 2019b) to replace the “minorization condition” with an alternative “hitting condition” that is stated in terms of only one Markov chain.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Since the seminal article by gelfand1990sampling, Markov chain Monte Carlo methods (MCMC) have become ubiquitous in statistical computing. These methods suffer from a well-known problem: although it is easy to construct many MCMC algorithms that are guaranteed to converge eventually, it is typically very difficult to tell how long convergence will take for any given MCMC algorithm (see e.g. [jones2001honest, diaconis2008gibbs]). Even worse, it is usually difficult to tell if an MCMC has converged yet even after running it (see e.g. [gelman1992inference] for an important early paper in the “diagnostics” literature, [bhatnagar2011computational] for a proof that the problem is intractable in general, and [hsu2015mixing, huber2016perfect] for cases where the problem becomes tractable).

In the decades since the publication of gelfand1990sampling, many techniques have been developed to try to solve this problem for classes of MCMC algorithms that are important in statistics (see e.g. the surveys by jones2001honest,diaconis2009markov). The most popular of these techniques is the “drift-and-minorization” method, introduced and popularized in rosenthal1995minorization, meyn1994computable, meyn2012markov. The purpose of this note is to replace the minorization condition (see Inequality 1) by a hitting condition (see Definition 2.5). It is straightforward to check that the usual drift-and-minorization condition immediately implies our drift-and-hitting condition, and in this sense our assumptions are weaker (and thus our result stronger) than typical drift-and-minorization conditions; see Section 4 for further discussion of the relationship between these conditions.

We now set some notation and recall an important special case of the drift-and-minorization bound. Throughout this paper, unless otherwise mentioned, we use to denote the state space of the underlying Markov process. Denote by the transition kernel of some Markov chain on state space with (unique) stationary measure on a Borel -field . For any , we write to denote the -step transition kernel generated from .

We say that exhibits a “drift” or “Lyapunov” condition if there exists a function and constants , so that

for all . We say that exhibits a “minorization” condition if there exists a set , constants , , and measure on so that

for all . We say satisfies conditions 1, 1 compatibly if it satisfies both and there exists such that

[][Thm. 12]rosenthal1995minorization says that any kernel satisfying these conditions compatibly is geometrically ergodic, and also gives quantitative bounds on the convergence rate:

Theorem 1.1 (Paraphrase of [][Thm. 12]rosenthal1995minorization).

Let satisfy conditions 1, 1 compatibly. Then there exists and such that

for all and . Moreover, there is an explicit formula for and in terms of the constants .

There are more sophisticated versions of this drift-and-minorization bound, including results in the same paper [rosenthal1995minorization], but most are based on two conditions that are similar to 1 and 1.

The goal of the present paper is to present results similar to Theorem 1.1, in which the minorization condition 1 has been replaced by related hitting conditions. Recall that the hitting time of a set for a Markov chain is:

We now introduce the maximum hitting time (of sets with large measure):

Definition 1.2.

Let . The maximum hitting time (with parameter ) is

where is the expectation of a measure in the space which generates the underlying Markov process and the subscript is the starting point of the Markov process.

When the kernel is clear from the context, we write .

1.1. Guide to Paper

We give our main drift-and-hit theorem in Section 2, show that its conditions are satisfied for various common MCMC chains in Section 3, and describe why it can (sometimes) give better results than the usual drift-and-minorization theorem in Section 4.

1.2. Previous Work

This paper is based on the asymptotic equivalence of mixing and hitting times as established by in the sequence of papers [oliveira2012mixing, finitemixhit, basu2015characterization, anderson2018mixhit, anderson2019mixhit2]. This relationship allows us to show that our hitting condition (see Definition 2.5) implies something very close to the minorization condition 1, and thus to use the framework of rosenthal1995minorization. We note that our recent works anderson2018mixhit,anderson2019mixhit2 are based on the work of finitemixhit,oliveira2012mixing, which initially established this equivalence of mixing and hitting times in the special case that is finite. anderson2018mixhit and anderson2019mixhit2 make heavy use of hyperfinite representation of Markov processes, which is established by Markovpaper.

2. Bounded Mixing Times for Dominated Chains

We recall the definition of mixing time:

Definition 2.1.

Fix and . The mixing time in with respect to is defined as

The lazy mixing time in with respect to is

where .

For notational convenience, we write or to mean or , write or to mean or , and write or to mean or .

In practice, it is rare for transition kernels corresponding to MCMC algorithms on unbounded state spaces to have finite mixing times (this would require jumps of unbounded size, and these are often difficult to construct). For this reason, the mixing and maximal hitting times cannot be directly compared in a meaningful way. We will fix this problem by constructing related Markov chains that both (i) have finite mixing times and (ii) have dynamics that are “slightly faster” than those of . We introduce the following definition to formalize this idea:

Definition 2.2 (Dominated Chain).

Let be a transition kernel on state space with associated -field . We say that a transition kernel with support contained in is -dominated by if it satisfies the following two properties:

  1. The stationary measure of is given by

    for all .

  2. The transitions of satisfy

    for all and with .

There are many constructions in the literature that satisfy this -domination property, such as the usual trace of a chain on . In this paper, we focus on restrictions of Gibbs samplers and Metropolis–Hastings chains.

Definition 2.3 (Restricted Sampler).

Recall that there is a unique Metropolis–Hastings kernel associated with any reversible proposal kernel and target distribution . Let be any reversible transition kernel with stationary measure and let have . We define the restriction of to to be the Metropolis–Hastings kernel with proposal and target as in Item 1.

In the special case of the Gibbs sampler, another restriction is sometimes useful:

Definition 2.4 (Restricted Gibbs Samplers).

Recall that the Gibbs sampler is uniquely defined by the target distribution. If is a Gibbs sampler targeting and has , we define the Gibbs restriction of to to be the Gibbs sampler targeting the measure defined in Item 1.

It is clear that both of these constructions give -dominated chains, as does the usual trace process (see e.g. Section 6 of [beltran2010tunneling] for the last result); in general, all three are different. For the remainder of this section, we will always fix a kernel and denote by any specific kernel that is -dominated by .

The following hitting condition plays an essential role throughout the paper.

Definition 2.5.

Fix . For , denote by the maximum hitting time of large sets for . We say that satisfies the -hitting condition if there exists such that .

For a fixed collection of constants , define to be the collection of kernels satisfying .

We recall that the main results in [oliveira2012mixing, finitemixhit, basu2015characterization, anderson2018mixhit, anderson2019mixhit2] show that there exist universal constants so that this is satisfied for all in certain large classes.

Definition 2.6.

We say that satisfies the drift condition 1 and the -hitting condition (Definition 2.5) compatibly for if:

  1. satisfies both the drift condition 1 and the -hitting condition.

  2. satisfies the drift condition 1 with the same as .

  3. .

  4. for some .

It is straightforward to check that these conditions imply that for all , but for convenience we focus on a single such value.

Finally, let denote the mixing time for . We now show that we can obtain a minorization condition similar to Section 1 by combining the hitting condition in Definition 2.5 and the drift condition in Section 1.

Theorem 2.7.

Let be a transition kernel that satisfies Section 1 and Definition 2.5 compatibly for some with and some . Define

for some . Suppose for some . Then, for , we have

Proof.

Fix and pick as in Definition 2.5. Let . As , by assumption, we have

Next, we denote by a Markov chain with transition kernel starting at point , and we denote by a Markov chain with transition kernel starting at the same point . By Section 1, we have for all

Iterating this bound, we have

By assumption, , so we have

for all . By a union bound and Markov’s inequality, we have

By the coupling inequality and Section 2, we have . By triangle inequality again, we have for all . ∎

The following result is a minor modification of [][Thm. 12]rosenthal1995minorization.

Lemma 2.8.

Let satisfy satisfy Section 1. Let be a measurable subset of such that

  1. for some .

  2. for some .

Then for every and every we have

Proof.

We first show the condition implies a pseudo-minorization condition from [Roberts2001], which is weaker than the minorization condition. By the definition of total variation distance, for any , there exists such that

Then we have . For any , define

Then it can be verified that is a valid probability measure and

Therefore, is a -pseudo-small set in the sense of [Roberts2001]. The result then follows directly from [Roberts2001, Proposition 2]. ∎

We now present the main result of this section.

Theorem 2.9.

Let be a transition kernel that satisfies Section 1 with function and constants and and Definition 2.5 compatibly for some with and some . Define

for some . Suppose for some . Then there exists and such that

for all and . Moreover, there is an explicit formula for and in terms of the constants .

Proof.

It follows from Theorem 2.7 and Lemma 2.8. ∎

3. Application to Gibbs and Metropolis–Hastings Samplers

3.1. Application to Gibbs Samplers

We begin by studying a large class of Gibbs samplers defined in our companion paper anderson2019mixhit2. We consider a class of Gibbs samplers targeting a measure supported on a compact subset of Euclidean space; without loss of generality we assume that the support of is inside of . For , let denote the projection to the -th coordinate. For and , let

the line in that passes through and is parallel to the ’th coordinate axis. Let be the connected component (the largest connected subset) of that contains . For the remainder of this section, we consider fixed , so that e.g. either or . We make the following assumption on :

Assumption 1.

has a continuous density function with respect to the Lebesgue measure. Furthermore, for every and .

We now set some further notation. Set and let be the usual conditional distribution of on (that is, the measure with density and support ) Define the typical Gibbs sampler with target by

for every and . In the context of the Gibbs sampler only, the -dominated chain will always refer to the chain from Definition 2.4.

Theorem 3.1 ([][Corollary. 4.5]anderson2019mixhit2).

Let . Suppose the target distribution is supported on a compact subset of Euclidean space. Then there exist universal constants with the following property: for every of the form 3.1 with satisfying Assumption 1, we have

We now state the following elementary facts about drift conditions.

Lemma 3.2.

Suppose is a transition kernel satisfying Section 1 with function and constants and . Then we have

If is furthermore a Gibbs sampler, then

for any set of the form , any and any .

We now present the main result in this section:

Lemma 3.3.

There exist a universal set of constants with the following property: Any Gibbs sampler that satisfies Assumption 1 and conditions (1), (4) of Definition 2.6 for some set of the form in fact satisfies all four conditions of Definition 2.6 with this choice of .

Proof.

Condition (2) is immediate from Lemma 3.2. Condition (3) is exactly the content of Theorem 3.1.

Putting together Lemma 3.3 and Lemma 2.8 immediately gives:

Theorem 3.4.

Let be the transition kernel of a Gibbs sampler with targeting distribution . Suppose satisfies Section 1 and Definition 2.5 with compatibly for some compact with . Suppose satisfies Assumption 1. Suppose

for some , where and . Then there exists and such that

for all and . Moreover, and depend only on the parameters .

3.2. Application to Metropolis–Hastings Chain

In this section, we give results analogous to Lemma 3.3 for the following class of Metropolis–Hastings chain defined in anderson2019mixhit2:

Definition 3.5 (Metropolis–Hastings Chain).

Fix a distribution with continuous density function supported on a compact subset of Euclidean space; without loss of generality we assume that the support of is a subset of . We fix a reversible kernel with unique stationary measure whose support contains that of . We assume that has continuous density and that, for all for which it is defined, has density . Finally, we assume that the mapping is continuous. We define the acceptance function by the formula

Define to be the transition kernel given by the formula

Note that is defined uniquely by its target and proposal .

Theorem 3.6 ([][Cor. 6.5]anderson2019mixhit2).

Let . Then there exist universal constants with the following property: for every Metropolis–Hastings chain of the form 3.5 satisfying conditions in Definition 3.5, we have

In the context of the Metropolis–Hastings sampler only, the -dominated chain will always refer to the chain from Definition 2.3. We note that Lemma 3.2 remains true with “Gibbs” replaced by “Metropolis–Hastings” the one time it is used. We then have the main result of this section:

Lemma 3.7.

There exist a universal set of constants with the following property:

Any Metropolis–Hastings sampler of the form 3.5 satisfying conditions in Definition 3.5 and conditions (1), (4) of Definition 2.6 for some set of the form in fact satisfies all four conditions of Definition 2.6 with this choice of .

Proof.

Condition (2) is immediate from Lemma 3.2111As we have just noted, this applies with “Gibbs” replaced by “Metropolis–Hastings.”. Condition (3) is exactly the content of Theorem 3.1.

Combining Lemma 2.8 and Lemma 3.7 tells us that Theorem 3.4 holds for Metropolis–Hastings samplers as well as Gibbs samplers (we don’t rewrite the theorem to save space).

4. Comparison to Minorization and Illustrative Example

The main goal of this paper was to illustrate how the minorization condition 1.1 in Theorem 1.1 can be replaced by a hitting-time condition. We have not yet seriously addressed the obvious question: why would anyone wish to do this?

We would like to be able to say that using maximum hitting times gives better bounds. However, the equivalence results of [oliveira2012mixing, finitemixhit, basu2015characterization, anderson2018mixhit, anderson2019mixhit2] say that this nice response can’t be correct. After all, if mixing times and maximum hitting times are equal up to some universal constants, methods based on maximum hitting times must give bounds that are essentially equivalent to those obtained from the pseudo-minorization condition of [Roberts2001].

Instead of obtaining better results, we claim that bounds on the maximum hitting time often have easier proofs. The “difficulty” of a proof is of course subjective, but we give evidence based on the following concrete claims:

  1. Maximum hitting time bounds are never “harder” than pseudo-minorization bounds: Any pseudo-minorization bound can immediately be turned into a maximum hitting time bound (see 4).

  2. In some realistic examples, we can compute maximum hitting bounds but not pseudo-minorization bounds: We have a broad class of simple but realistic examples for which we can compute good bounds on the maximum hitting time, but have no realistic way to directly compute good bounds on the mixing time (see Theorem 4.2).

To see the first, consider any Markov chain with transition kernel , unique stationary measure and mixing time . Then for any measurable with and starting point ,

which immediately implies

Next, we consider a class of examples with state space . We begin by defining the following class of nearly-unimodal measures. These are meant to mimic the typical “small set” used in a pseudo-minorization argument - we expect the target distribution to be reasonably close to unimodal, but might not have much detailed control over the size of the mode or fluctuations.

Definition 4.1.

Fix and . For a density and constant , denote by the

’th quantile of

. We say that a density on is unimodal if there exists some such that

for all and

for all . We say that a density is -nearly unimodal if

  1. is unimodal, and

  2. for all , and

  3. for all .

We have the following:

Theorem 4.2.

Fix , . There exists a universal constant so that the following holds:

Let be a continuous density that is -nearly unimodal for some . For , let be the transition kernel given in Definition 3.5 with proposal kernel and target . Then the maximum hitting time of satisfies

for all sufficiently large.

Proof.

We check that the conclusion holds for a simple discretization of the process, then translate the bound back to our original kernel . For , we define the transition kernel of a closely-related “birth-and-death” chain on by the formula

and . Since is a birth-and-death chain, it is reversible with respect to some measure .

Next, fix a measurable set with ; by the (uniform) continuity of , it is straightforward to check that for all sufficiently large. Since , there exist a point so that

Similarly, there exists satisfying 4. Denote by , the closest points in to and , respectively (break ties arbitrarily).

Denote by the hitting time of a set for a transition kernel , extending the notion in Equation 1. Since birth-and-death chains can’t “skip” any values,

The expectations appearing in the middle expression have simple explicit formulas (see e.g. [palacios1996note]). Comparing the explicit formula for to the explicit formula for the simple random walk on , we see that all terms in the explicit formula differ by at most a multiplicative factor depending only on . Thus, there exists a universal constant such that

Since (see e.g. Example 10.20 of [markovmix]), we have shown that

Define . By 4, there exists a universal constant such that

We now relate this to . Fix some and consider sample paths , . Applying standard convergence results (e.g. Theorem 2.5 of [ethier2009markov]), one can quickly confirm that both of these processes converge to the same limiting continuous-time process in Skorohod’s topology on as and time is rescaled in the usual way, as . Since can’t jump more than distance , this convergence implies the existence of a constant such that

Combining this with Inequalities 4 and 4, we have

Since this holds for all measurable , this completes the proof. ∎

When , the kernel analyzed above is exactly the usual “ball” walk on , so this upper bound is in fact sharp. While our analysis is fairly simple, we know of no simple way to get comparable results by directly analyzing the mixing time. See further discussion in e.g. [johndrow2018fast, yuen2000applications], where the methods studied cannot give bounds better than for any walk in this class.