Restart perturbations for lazy, reversible Markov chains: trichotomy and pre-cutoff equivalence

07/05/2019
by   Daniel Vial, et al.
University of Michigan
0

Given a lazy, reversible Markov chain with n states and transition matrix P_n, a distribution σ_n over the states, and some α_n ∈ (0,1), we consider restart perturbations, which take the following form: with probability 1-α_n, sample the next state from P_n; with probability α_n, sample the next state from σ_n (i.e. "restart" at a random state). Our main object of study is π_n - π̃_n, where π_n and π̃_n are the stationarity distributions of the original and perturbed chains and · denotes total variation. Our first result characterizes π_n - π̃_n in terms of the ϵ-mixing times t_mix^(n)(ϵ) of the P_n chain, assuming these mixing times exhibit cutoff. Namely, we show that if α_n t_mix^(n)(ϵ) → 0, then π_n - π̃_n → 0 for any restart perturbation; if α_n t_mix^(n)(ϵ) →∞, then π_n - π̃_n → 1 for some restart perturbation; and if α_n t_mix^(n)(ϵ) → c ∈ (0,∞), then _n →∞π_n - π̃_n ≤ 1 - e^-c for any restart perturbation (and the bound is tight). Similar "trichotomies" have appeared in several result results; however, these existing results consider generative models for the chain, whereas ours applies more broadly. Our second result shows the weaker notion of pre-cutoff is (almost) equivalent to a certain notion of "sensitivity to perturbation", in the sense that π_n - π̃_n → 1 for certain perturbations. This complements a recent result by Basu, Hermon, and Peres, which shows that cutoff is equivalent to a certain notion of "hitting time cutoff".

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

01/12/2021

Perturbations of copulas and Mixing properties

This paper explores the impact of perturbations of copulas on the depend...
12/30/2017

On approximating the stationary distribution of time-reversible Markov chains

Approximating the stationary probability of a state in a Markov chain th...
08/18/2021

Geometry-informed irreversible perturbations for accelerated convergence of Langevin dynamics

We introduce a novel geometry-informed irreversible perturbation that ac...
03/18/2020

The influence of initial perturbation power spectra on the growth of a turbulent mixing layer induced by Richtmyer-Meshkov instability

This paper investigates the influence of different broadband perturbatio...
08/24/2017

Mixing time estimation in reversible Markov chains from a single sample path

The spectral gap γ of a finite, ergodic, and reversible Markov chain is ...
12/14/2019

Mixing Time Estimation in Ergodic Markov Chains from a Single Trajectory with Contraction Methods

The mixing time t_mix of an ergodic Markov chain measures the rate of co...
03/17/2014

A reversible infinite HMM using normalised random measures

We present a nonparametric prior over reversible Markov chains. We use c...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Markov chains are common tools for modeling complex phenomena, such as the movement of asset prices in financial markets or the processing of tasks in data centers. A fundamental concern is how modeling inaccuracies affect the chain’s steady-state behavior, i.e. how changes to the chain’s transition matrix affect its stationary distribution. Mathematically, we formalize this as follows. Let be the transition matrix of a Markov chain with states and stationary distribution . Denote by the transition matrix and the stationary distribution of another chain, obtained by perturbing each row of by at most (in total variation). Then the main question we study is as follows: how does the perturbation magnitude relate to the error magnitude (where denotes total variation) as the number of states grows?

Before previewing our results, we briefly outline two basic notions that play prominent roles. The first notion is a class of perturbations we call restart perturbations, for which is obtained from as follows. From the current state, flip a coin that lands heads with probability . If heads, sample the next state from some auxiliary distribution (i.e. “restart” the chain at a random state, distributed as ); if tails, sample the next state from (i.e. follow the original chain). In the case where describes the simple random walk on some underlying graph, this perturbation is more commonly known as PageRank [11], a model for Internet browsing.111Here nodes in the underlying graph are web pages and edges are hyperlinks between pages. Choosing the next state from corresponds to following a hyperlink; “restarting” corresponds to typing in a new page’s web address. Also, this perturbation yields an example of a Doeblin chain, for which so-called “perfect sampling” is possible [1, 12].

A second important notion is that of mixing times and cutoff. Roughly, the -mixing time is the number of steps the chain with transition matrix must take before its distribution is -close to (see (7) for a formal definition). Certain chains exhibit cutoff, meaning

(1)

Intuitively, (1) says the chain is far from stationarity for many steps, then abruptly becomes close to stationarity. A weaker condition is pre-cutoff, which has similar intuition but only requires

(2)

We now preview our two main results. Our first result, Theorem 1, says that the relative asymptotics of and fully characterize the asymptotics of in the case of restart perturbations. More specifically, we prove that the following trichotomy occurs:

  • If , then for any restart perturbation.

  • If , then for some restart perturbation.

  • If , an intermediate behavior occurs: all restart perturbations satisfy , and some restart perturbation attains the bound.

We note that Theorem 1 holds assuming the original chain is lazy (), reversible (), and exhibits cutoff. The laziness and reversibility assumptions are inherited from [4], which contains an inequality used to prove our lower bounds (see Section 3). Hence, we suspect these assumptions may be artifacts of our analysis. In contrast, we believe some notion of cutoff is fundamentally necessary (as will be discussed shortly). We also note that parts of our analysis hold more generally; see Lemmas 1 and 2.

Interestingly, Theorem 1 says that a threshold phenomena for the original chain – cutoff – translates into a different threshold phenomena for the perturbed chain – the trichotomy shown above. Another point of interest is that similar trichotomies have been established in several recent papers. For example, [7] shows that the restart perturbation adopts the cutoff behavior of the original chain if , has a distinct convergence to stationarity if , and exhibits an intermediate behavior if , assuming the original chain is the simple random walk on a particular random graph. Similar results were obtained in [2] for random walks on dynamic random graphs; here denotes the “rate of change” of edges. Finally, [13] studies the matrix , where corresponds to restarting at state ; the authors prove this matrix has dimension if , conjecture the dimension is if , and prove the dimension is for some if , when the original chain is generated as in [7]. See Section 6 for more details on these papers.

Ultimately, this work, [7], [2], and [13] all study different questions, but the similarities speak to a much deeper phenomena: some aspect of the original chain is unaffected when , this aspect is significantly altered when , and an intermediate behavior occurs when . However, in contrast to [7], [2], and [13], we work directly with the stationary distribution, which is arguably the most fundamental such aspect one would hope to understand. Additionally, unlike these works, we do not assume a generative model for the original chain; in this sense, our results are more general, while demonstrating a similar idea.

Our second result concerns pre-cutoff. As alluded to above, we believe some notion of cutoff is fundamental for lower bounds like those in Theorem 1. Indeed, in Theorem 2 we show that for lazy and reversible chains, pre-cutoff (defined in (2)) implies a certain perturbation condition, and

(3)

(which is slightly stronger than the negation of (2)) implies the negation of the perturbation condition. Roughly, this condition is as follows: for certain and all , there exists a sequence of restart perturbations with restart probabilities and stationary distributions s.t. . Hence, Theorem 2 says that chains with pre-cutoff are sensitive to perturbation, in the sense that certain perturbations maximally change the stationary distribution, and the converse (almost) holds. The only gap in our logic involves the case

(4)

which only occurs for a class of chains that are of little interest (see Section 4). Thus, for all intents and purposes, Theorem 2 is an equivalence between pre-cutoff and perturbation sensitivity.

The main utility of Theorem 2 is that, while different notions of cutoff have been proven for many different chains, there is little general theory. In fact, only recently was an abstract condition equivalent to cutoff determined in [4] (this being a certain notion of “hitting time cutoff”). Additionally, while Theorem 2 relies on an inequality from [4], we believe it is much more than a corollary of this inequality. Instead, we believe our work nicely complements [4], since we consider pre-cutoff instead of cutoff, and since our equivalent notion is different. See Section 6 for details on [4].

In short, this paper contributes to two lines of work. First, we add to the growing collection of “trichotomy” results; unlike existing results, however, we study the stationary distribution directly and do not assume a generative model for the chain. Second, we add to the general theory of cutoff in a similar vein to [4], but for a different notion of cutoff and a different equivalent notion.

The remainder of the paper is organized as follows. We begin in Section 2 with definitions. Sections 3 and 4 contain the two theorems described above. We present examples in Section 5, with details deferred to Appendix A. Section 6 discusses related work. Proofs can be found in Appendix B.

2 Preliminaries

We begin with some notation. Let denote the set of nonnegative integers, and let be a time-homogeneous, irreducible, and aperiodic Markov chain with state space . We denote by the transition matrix of this chain, i.e. the matrix with -th entry

(5)

It is a standard result that this chain has a unique stationary distribution

, i.e. a unique vector

satisfying and . Here and moving forward, we treat all vectors as row vectors. For , we let denote the length- vector with 1 in the -th coordinate and zeros elsewhere.222For simplicity, we suppress the dependence on in the notation , but this dependence will be clear from context. Also, we let denote the set of distributions on , so that (for example) . Finally, we let denote the set of transition matrices for time-homogeneous, irreducible, and aperiodic Markov chains with state space , so that (for example) .

Some of our results will only apply to a strict subset of . In particular, certain results will require the chain to be lazy, meaning , and reversible, meaning . We note that any chain can be made lazy without changing its stationary distribution; namely, by considering instead of , where is the identity matrix. In this sense, reversibility is the most restrictive of our assumptions. However, this is a fairly common restriction in the mixing times literature, since it guarantees the eigenvalues of are real and allows one to use certain linear algebraic techniques (see e.g. Chapter 12 of [8]).

As discussed in Section 1, the mixing time of will play a pivotal role. To define mixing times, we first define the distance between the -step distribution and stationarity as

(6)

where denotes total variation distance, for . For , we can now define the -mixing time as

(7)

As is convention in the literature, we set . We also note the following monotocity property follows immediately from the definition, but we record it here as it will be used often:

(8)

Having defined mixing times, we recall the two notions of cutoff from Section 1. First, a sequence with is said to exhibit cutoff if333Note the fraction in (9) may be ill-defined for small , since can occur. However, since , for fixed and large, so for such . Along these lines, we at times assume , with the implicit understanding that this holds for large .

(9)

A basic result (see e.g. Section 18.1 of [8]) says that cutoff occurs if and only if

(10)

Thus, cutoff means the graph of approaches a step function as , when the -axis is normalized by . Put differently, the chain is quite far from stationarity at time e.g. , then suddenly reaches stationarity at time e.g. . The weaker notion of pre-cutoff states

(11)

For the perturbation analysis described in the introduction, it will be convenient to introduce some additional notation. First, given and , we define

(12)

In words, is the set of transition matrices for time-homogeneous, irreducible, and aperiodic chains whose rows differ from the rows of by at most in total variation. We will denote the unique stationary distribution of by . A particular subset of is the class of restart perturbations discussed in the introduction. Such perturbations have the form

(13)

for some and , where is the length- row vector of ones. The corresponding chain has the following dynamics: given , flip a coin that lands heads with probability ; if heads, let (i.e. restart at a state sampled from ); if tails, let (i.e. follow the original chain). Note that such perturbations only depend on the restart probability and the restart distribution . Thus, we will use the notation

(14)

to define restart perturbations. We denote the corresponding stationary distribution by . Moving forward, will typically depend on , in which case we write and .

Finally, we note the following (standard) notation for will be used: we write , , , and , respectively, if , , and , and , respectively.

3 Trichotomy

In this section, we formulate our first main result, the trichotomy described in Section 1. For transparency, we begin with two lemmas, parts of which require weaker assumptions than the theorem. We then collect these results under our strongest assumptions in Theorem 1.

The first of these lemmas concerns the case . The lemma states that, if the perturbation magnitude is dominated by the inverse mixing time, no perturbation can change the stationary distribution. On the other hand, if dominates the inverse mixing time, one can find a perturbation that maximally changes the stationary distribution. Note the former case holds for all bounded perturbations (not just the restart variety). Also, while the latter case does require laziness and reversibility, it does not require cutoff (only pre-cutoff). Hence, Lemma 1 contains stronger results for the case than will be stated in Theorem 1.

Lemma 1.

Let , and let be independent of . Assume . Then the following hold:

  • If and , then s.t. ,

    (15)
  • If , exhibits pre-cutoff, and each is lazy and reversible, then s.t.  and

    (16)

    In particular, , is a restart perturbation, i.e.  for some .

Proof.

See Appendix B.2

We offer several remarks on the proof. The case is simpler and relies on standard mixing time results. In particular, we use the well-known fact that distance to stationarity decays exponentially after it reaches (mathematically, ), hence the additional assumption in this case. The case is more involved. The key step here is to establish a weaker version of (16): namely, s.t. , s.t.

(17)

After proving (17), we define a vanishing sequence and apply (17) to each to reach the stronger conclusion shown in (16). (The extension to (16) is not as immediate as taking in (17), because the left side of (17) has a dependence on ; however, it is still reasonably simple.)

Before proceeding, we discuss further the key step from the case, i.e. the proof of (17). This proof involves a construction of that relies on a result from the aforementioned [4]. Roughly speaking, this result shows that one can find a state , a subset of states , and some , such that is unlikely to reach within steps when started from . Further, in the case of pre-cutoff, is large and is comparable to . In summary, the chain started from makes its first visit to a “large” set just before .

This argument suggests a good construction for the perturbed chain: set , i.e. perturb the chain by restarting at with probability at each step. On this perturbed chain, the number of steps between restarts at is (in expectation) ; hence, when , restarts occur at intervals typically much shorter than . In other words, the perturbed chain rarely wanders steps from . But, per the previous paragraph, the chain started from requires steps to reach . Hence, the perturbed chain rarely visits and thus assigns a small stationary measure to . Finally, since is large, the definition of total variation ensures is also large. This intuition is the key idea behind (17).

We next turn to the second lemma, which considers the case . This lemma contains two bounds; one analogous to the upper bound (15) and one analogous to the lower bound (16). Here we require stronger assumptions than Lemma 1. For the upper bound, we restrict to restart perturbations and we assume as . This latter assumption is minor, since typically one studies the growth rate of , and thus chains that mix in constant time are of less interest. For the lower bound, we again assume laziness and reversibility, as well as strengthening the pre-cutoff assumption of Lemma 1 to cutoff. The proof is similar to that of Lemma 1, but the stronger assumptions allow for a tighter analysis.

Lemma 2.

Let , and let be independent of . Assume . Then the following hold:

  • If , then s.t. ,

    (18)
  • If exhibits cutoff and each is lazy and reversible, then s.t.  and .

Proof.

See Appendix B.3. ∎

Before proceeding, we comment on the upper bound in the case , which (we note) includes the usual case of interest . Here one can verify

(19)

Hence, for smaller , the upper bound in Lemma 2 is , while for larger , the bound is . Note the former bound approaches 0 as , and thus approaches the case of Lemma 1. Furthermore, the latter bound approaches 1 and thus becomes trivial as ; this is expected due to the case of Lemma 1.

Combining Lemmas 1 and 2, we arrive at our first main result. Theorem 1 collects the results of the lemmas under our strongest assumptions: the chain is lazy, reversible, and exhibits cutoff, and the perturbation is restricted to the restart variety. Under these assumptions, we can fully characterize perturbation behavior. Note that these assumptions are stronger than those required for the upper bounds in the lemmas; this in turn allows us to discard the assumption from the Lemma 1 upper bound and the assumption from the Lemma 2 upper bound.

Theorem 1.

Let , and let be independent of . Assume exhibits cutoff, each is lazy and reversible, and . Then the following hold:

  • If , then s.t. ,

    (20)
  • If , then s.t. ,

    (21)

    Furthermore, (21) is tight, i.e.  s.t.  and

    (22)
  • If , then s.t.  and

    (23)
Proof.

See Appendix B.4. ∎

Note that, given , the theorem guarantees existence of a sequence of distributions s.t. 

. This is somewhat surprising: we may not know the underlying chain’s structure explicitly, and thus we may lack expressions for (or even estimates of)

and ; nevertheless, we obtain a precise asymptotic comparison of these distributions.

4 Pre-cutoff equivalence

We next turn to Theorem 2. As discussed in the introduction, the theorem provides a near-equivalence between pre-cutoff and a certain perturbation condition. More specifically, we will show that pre-cutoff implies a certain perturbation condition, and that this condition fails whenever

(24)

The caveat of Theorem 2 being a near-equivalence arises because (24) is stronger than the negation of pre-cutoff. Indeed, one can construct sequences of chains for which pre-cutoff and (24) both fail. For instance, in Section 5 we provide two example sequences with drastically different cutoff behaviors; if we construct a new sequence that oscillates between these two, we obtain

(25)

However, this oscillating sequence is pathological; the literature almost exclusively considers chains defined in the same manner for each . Thus, the “near-equivalence” caveat is a small one.

Before presenting Theorem 2, we must define the perturbation condition. However, this condition is somewhat mysterious, so we first discuss the difficulty in deriving it, in hopes of making it less opaque. We begin with the most obvious candidate, the condition from Lemma 1:

(26)

Indeed, we have already proven that pre-cutoff implies (26) (assuming laziness and reversibility). The difficulty arises in showing that (26) fails whenever (24) holds. The most obvious approach is as follows. When (24) holds, it is possible that for some fixed ,

(27)

which suggests setting for some independent of , since then

(28)

Our task would then be reduced to upper bounding (perhaps via techniques used for upper bounds above). Unfortunately, while it is possible that (27) holds, this is not guaranteed.

While this first attempt fails, it illustrates the dissonance at hand: (26) considers sequences depending only on ; the analogous sequence in (24) depends on both and . Hence, as a second attempt, we could modify (26) to involve a sequence depending on both and . However, if (26) is modified in this manner, it is no longer implied by pre-cutoff via Lemma 1, so this direction of the proof may become difficult.

It turns out this issue can be resolved by placing appropriate restrictions on the set of sequences of restart probabilities appearing in the perturbation condition. In particular, we will say that the sequence coincides with the mixing times if444As shown in the proof of Theorem 2, such sequences always exist (at least under the assumption of laziness).

(29)

and we will restrict to sequences that coincide with the mixing times. More specifically, we define the following perturbation condition for use in our second main result.

Condition 1.

that coincide with the mixing times , such that

(30)

We note the definition of “coincides with” yields the following useful property: when pre-cutoff holds and coincides with the mixing times, . In words, not only is the supremum in (29) infinite, the limit inferior in (29) is infinite, for every . This allows us to prove (via Lemma 1) that Condition 1 is implied by pre-cutoff, while also proving that Condition 1 fails (via the approach discussed above) whenever (24) holds.

With Condition 1 in place, we present Theorem 2; see Figure 1 for a graphical depiction.

Theorem 2.

Let be a sequence with lazy and reversible for each . If exhibits pre-cutoff, Condition 1 holds; if satisfies (24), Condition 1 fails.

Proof.

See Appendix B.5. ∎

Figure 1: Partition of lazy/reversible sequences induced by Condition 1. Theorem 2 says chains satisfying pre-cutoff and (24), respectively, are contained in the subsets for which Condition 1 holds and fails, respectively. The gray subset contains e.g. the pathological example from Section 4; if we disregard this subset, we obtain an equivalence between Condition 1 and pre-cutoff.

5 Illustrative examples

Our results suggest a deep connection between some notion of cutoff and some notion of perturbation sensitivity. Here we illustrate this with two example chains called the winning streak reversal (WSR) and the complete graph bijection (CGB). The key insights are summarized pictorially in Figure 2 and discussed here; most details are deferred to Appendix A.

At left in Figure 2, we plot versus for . Note the WSR exhibits a clear cutoff behavior, dropping suddenly from to . In contrast, the CGB initially falls from to , after which point decays gradually in . Hence, roughly speaking, the WSR “makes no progress” towards stationarity until step ; in contrast, the CGB “makes half its progress” towards stationarity after a single step. However, despite this drastic difference, both chains satisfy (see Proposition 1, Appendix A). At right in Figure 2, we show the error for a certain555Intuitively, one should choose “far from” . Thus, in Figure 2 we let be uniform for the WSR (since is highly non-uniform; see Appendix A) and set for the CGB (since is roughly uniform; see Appendix A). and for . Note for both chains, and that restarts occur every steps (in expectation). For the WSR, error rapidly increases from to ; for the CGB, error approaches . Beyond these example perturbations, we can also prove a perturbation result stronger than Theorem 1 for the WSR; in contrast, the conclusion of Theorem 1 fails for the CGB (see Proposition 2, Appendix A).

In summary, we can (roughly) say the following, which illustrates the intuition of our results:

  • The WSR requires steps to make any progress to stationarity. Thus, with the perturbed chain restarting every steps, it never approaches the original stationary distribution. Consequently, the perturbed chain wanders far from this distribution.

  • The CGB makes half its progress to stationarity at time . Hence, one step after each restart, the perturbed chain comes close to the original stationary distribution. Consequently, the perturbed chain cannot wander too far from this distribution.

Ultimately, while the cutoff/perturbation connection is perhaps obvious for these chains, this is because their cutoff behaviors lie at opposite extremes among chains with (see discussion preceding Proposition 1, Appendix A). The main contribution of our work is to extend this connection to a wider class of chains (lazy and reversible), for which it is far less obvious.

Figure 2: Convergence for (left) and restart perturbation error (right) for example chains.

6 Related work

We now return to discuss the existing trichotomy results mentioned in the introduction, those from [7, 2, 13]. All of these works consider the directed configuration model (DCM), a means of constructing a graph from a given degree sequence via random edge pairings. It was recently shown that for random walks on the DCM, and for a wider class of randomly-generated chains, cutoff occurs at steps [5, 6]. More precisely, [5, 6] prove an analogue of (10), namely

(31)

where is defined in terms of the given degrees and denotes convergence in probability. Using these results, Theorem 2 in [7] states that for certain sequences of distributions , the distance to stationarity corresponding to satisfies the following:

  • If , (31) holds with replaced by , i.e.  is a step function.

  • If , , i.e.  decays exponentially in .

  • If , the behavior is intermediate: for , decays exponentially, as in the case; for , , as in the case.

In [2], the authors study a dynamic version of the DCM for which an fraction of edges are randomly sampled and re-paired at each time step. The main result (Theorem 1.4) says the distance to stationarity of the non-backtracking random walk on this dynamic DCM follows a trichotomy similar to the one from [7]. Finally, [13], also using ideas from [5, 6], studies the matrix with rows , i.e. the -th row corresponds to restarting at node . The authors study

(32)

which can be viewed as a measure of the dimension of , and whose form is motivated algorithmically (see Section 2.3 in [13]). When describes the random walk on the DCM and , the authors show , where also depends on the given degrees (see Theorem 1 in [13]). If instead , the authors show , and if , they conjecture , e.g.  (see Section 7.4 in [13]).

Ultimately, as discussed in the introduction, these results all echo Theorem 1 and hint at a deeper phenomena. However, prior to this work, one may have (erroneously) suspected that such results rely crucially on some property of the DCM, since [7, 2, 13] all study this generative model. In contrast, the present paper suggests that some notion of cutoff is the crucial property. Accordingly, it is unsurprising that the trichotomy results in [7, 13] rely on the cutoff results from [5, 6].

Our other result, Theorem 2, relates closely to the aforementioned [4]. Here it is shown that mixing cutoff (9) is equivalent to a notion of “hitting time cutoff”. Namely, Theorem 3 in [4] shows that for sequences of lazy, reversible, and irreducible chains, (9) is equivalent to each of the following:

(33)