This work tackles the challenge of constructing confidence intervals for the mixing time of reversible Markov chains based on a single sample path. Letbe an irreducible, aperiodic time-homogeneous Markov chain on a finite state space with transition matrix . Under this assumption, the chain converges to its unique stationary distribution regardless of the initial state distribution :
The mixing time of the Markov chain is the number of time steps required for the chain to be within a fixed threshold of its stationary distribution:
Here, is the probability assigned to set by , and the supremum is over all possible initial distributions . The problem studied in this work is the construction of a non-trivial confidence interval , based only on the observed sample path and , that succeeds with probability in trapping the value of the mixing time .
This problem is motivated by the numerous scientific applications and machine learning tasks in which the quantity of interest is the meanfor some function
of the states of a Markov chain. This is the setting of the celebrated Markov Chain Monte Carlo (MCMC) paradigm liu2001monte, but the problem also arises in performance prediction involving time-correlated data, as is common in reinforcement learning sutton98. Observable, ora posteriori bounds on mixing times are useful in the design and diagnostics of these methods; they yield effective approaches to assessing the estimation quality, even when a priori knowledge of the mixing time or correlation structure is unavailable.
1.1. Main results.
Consider a reversible ergodic Markov chain on states with absolute spectral gap and stationary distribution minorized by . As is well-known (see, for example, [Theorems 12.3 and 12.4]LePeWi08),
where is the relaxation time. Hence, it suffices to estimate and . Our main results are summarized as follows.
In Section 3.1, we show that in some problems observations are necessary for any procedure to guarantee constant multiplicative accuracy in estimating (Theorems 3.2 and 3.1). Essentially, in some problems every state may need to be visited about times, on average, before an accurate estimate of the mixing time can be provided, regardless of the actual estimation procedure used.
In Section 3.2, we give a point estimator for , based an a single sample path, and prove in Theorem 3.4 that with high probability if the path is of length . (The notation suppresses logarithmic factors.) We also provide and analyze a point estimator for . This establishes the feasibility of estimating the mixing time in this setting, and the dependence on and in the path length matches our lower bound (up to logarithmic factors) in the case where . color=red!20!white,color=red!20!white,todo: color=red!20!white,Cs: I think this should be . We note, however, that these results give only a priori confidence intervals that depend on the unknown quantities and . As such, the results do not lead to a universal (chain-independent) stopping rule for stopping the chain when the relative error is below the prescribed accuracy.
In Section 4, we propose a procedure for a posteriori constructing confidence intervals for and that depend only on the observed sample path and not on any unknown parameters. We prove that the intervals shrink at a rate (Theorems 4.2 and 4.1). These confidence intervals trivially lead to a universal stopping rule to stop the chain when a prescribed relative error is achieved.
1.2. Related work.
There is a vast statistical literature on estimation in Markov chains. For instance, it is known that under the assumptions on
from above, the law of large numbers guarantees that the sample meanconverges almost surely to
meyn1993markov, while the central limit theorem tells us that as, the distribution of the deviation
will be normal with mean zero and asymptotic variancekipnis1986central.
Although these asymptotic results help us understand the limiting behavior of the sample mean over a Markov chain, they say little about the finite-time non-asymptotic behavior, which is often needed for the prudent evaluation of a method or even its algorithmic design ,DBLP:conf/valuetools/KontoyiannisLM06,flegal2011implementing,Gyori-paulin15. To address this need, numerous works have developed Chernoff-type bounds on , thus providing valuable tools for non-asymptotic probabilistic analysis gillman1998chernoff,leon2004optimal,DBLP:conf/valuetools/KontoyiannisLM06,kontorovich2014uniform,Paulin15. These probability bounds are larger than the corresponding bounds for independent and identically distributed (iid) data due to the temporal dependence; intuitively, for the Markov chain to yield a fresh draw that behaves as if it was independent of , one must wait time steps. Note that the bounds generally depend on distribution-specific properties of the Markov chain (e.g., , , ), which are often unknown a priori in practice. Consequently, much effort has been put towards estimating these unknown quantities, especially in the context of MCMC diagnostics, in order to provide data-dependent assessments of estimation accuracy [e.g.,]GaSmi00:eigval,jones2001,flegal2011implementing,1209.0703,Gyori-paulin15. However, these approaches generally only provide asymptotic guarantees, and hence fall short of our goal of empirical bounds that are valid with any finite-length sample path. In particular, they also fail to provide universal stopping rules that allow the estimation of (for example) the mixing time with a fixed relative accuracy.
Learning with dependent data is another main motivation to our work. Many results from statistical learning and empirical process theory have been extended to sufficiently fast mixing, dependent data [e.g.,]Yu94,MR1921877,gamarnik03,MoRo08,DBLP:conf/nips/SteinwartC09,Steinwart2009175, providing learnability assurances (e.g., generalization error bounds). These results are often given in terms of mixing coefficients, which can be consistently estimated in some cases McDoShaSche11. However, the convergence rates of the estimates from McDoShaSche11, which are needed to derive confidence bounds, are given in terms of unknown mixing coefficients. When the data comes from a Markov chain, these mixing coefficients can often be bounded in terms of mixing times, and hence our main results provide a way to make them fully empirical, at least in the limited setting we study.
It is possible to eliminate many of the difficulties presented above when allowed more flexible access to the Markov chain. For example, given a sampling oracle that generates independent transitions from any given state (akin to a “reset” device), the mixing time becomes an efficiently testable property in the sense studied by BaFoRuSmiWhi00,BaFoRuSmiWhi13,DBLP:conf/nips/BhattacharyaV15. Note that in this setting, DBLP:conf/nips/BhattacharyaV15 asked if one could approximate (up to logarithmic factors) with a number of queries that is linear in both and ; our work answers the question affirmatively (up to logarithmic corrections) in the case when the stationary distribution is near uniform. Finally, when one only has a circuit-based description of the transition probabilities of a Markov chain over an exponentially-large state space, there are complexity-theoretic barriers for many MCMC diagnostic problems BhaBoMo11.
This paper is based on the conference paper of HKS, combined with the results in the unpublished manuscript of LP:EstGap.
We denote the set of positive integers by , and the set of the first positive integers by . The non-negative part of a real number is , and . We use for natural logarithm, and for logarithm with an arbitrary constant base
. Boldface symbols are used for vectors and matrices (e.g.,, ), and their entries are referenced by subindexing (e.g., , ). For a vector , denotes its Euclidean norm; for a matrix , denotes its spectral norm. We use to denote the diagonal matrix whose -th entry is . The probability simplex is denoted by , and we regard vectors in as row vectors.
Let be a
row-stochastic matrix for an ergodic (i.e., irreducible and aperiodic) Markov chain. This implies there is a unique stationary distributionwith for all [Corollary 1.17]LePeWi08. We also assume that is reversible (with respect to ):
The minimum stationary probability is denoted by .
Define the matrices
The th entry of the matrix contains the doublet probabilities associated with : is the probability of seeing state followed by state when the chain is started from its stationary distribution. The matrix is symmetric on account of the reversibility of , and hence it follows that is also symmetric. (We will strongly exploit the symmetry in our results.) Further, , hence and
are similar and thus their eigenvalue systems are identical. Ergodicity and reversibility imply that the eigenvalues ofare contained in the interval , and that is an eigenvalue of with multiplicity [Lemmas 12.1 and 12.2]LePeWi08. Denote and order the eigenvalues of as
Let , and define the (absolute) spectral gap to be , which is strictly positive on account of ergodicity.
Let be a Markov chain whose transition probabilities are governed by . For each , let denote the marginal distribution of , so
Note that the initial distribution is arbitrary, and need not be the stationary distribution .
The goal is to estimate and from the length sample path , and also to construct confidence intervals that and with high probability; in particular, the construction of the intervals should be fully empirical and not depend on any unobservable quantities, including and themselves. As mentioned in the introduction, it is well-known that the mixing time of the Markov chain (defined in Eq. 1) is bounded in terms of and , as shown in Eq. 2. Moreover, convergence rates for empirical processes on Markov chain sequences are also often given in terms of mixing coefficients that can ultimately be bounded in terms of and (as we will show in the proof of our first result). Therefore, valid confidence intervals for and can be used to make these rates fully observable.
3. Point estimation
In this section, we present lower and upper bounds on achievable rates for estimating the spectral gap as a function of the length of the sample path .
3.1. Lower bounds
The purpose of this section is to show lower bounds on the number of observations necessary to achieve a fixed multiplicative (or even just additive) accuracy in estimating the spectral gap . By Eq. 2, the multiplicative accuracy lower bound for gives the same lower bound for estimating the mixing time. Our first result holds even for two state Markov chains and shows that a sequence length of is necessary to achieve even a constant additive accuracy in estimating .
Pick any . Consider any estimator that takes as input a random sample path of length from a Markov chain starting from any desired initial state distribution. There exists a two-state ergodic and reversible Markov chain distribution with spectral gap and minimum stationary probability such that
Next, considering state chains, we show that a sequence of length is required to estimate up to a constant multiplicative accuracy. Essentially, the sequence may have to visit all states at least times each, on average. This holds even if is within a factor of two of the largest possible value of that it can take, i.e., when is nearly uniform.
There is an absolute constant such that the following holds. Pick any positive integer and any . Consider any estimator that takes as input a random sample path of length from a -state reversible Markov chain starting from any desired initial state distribution. There is an ergodic and reversible Markov chain distribution with spectral gap and minimum stationary probability such that
3.2. A plug-in based point estimator and its accuracy
Let us now consider the problem of estimating . For this, we construct a natural plug-in estimator. Along the way, we also provide an estimator for the minimum stationary probability, allowing one to use the bounds from Eq. 2 to trap the mixing time.
Define the random matrixand random vector by
to be the symmetrized version of the (possibly non-symmetric) matrix
Let be the eigenvalues of . Our estimator of the minimum stationary probability is , and our estimator of the spectral gap is . The astute reader may notice that our estimator is ill-defined when is not positive valued. In this case, we can simply set .
These estimators have the following accuracy guarantees:
There exists an absolute constant such that the following holds. Let be an ergodic and reversible Markov chain with spectral gap and minimum stationary probability . Let and be the estimators described above. For any , with probability at least ,
Theorem 3.3 implies that the sequence lengths sufficient to estimate and to within constant multiplicative factors are, respectively,
The proof of Theorem 3.3 is based on analyzing the convergence of the sample averages and to their expectation, and then using perturbation bounds for eigenvalues to derive a bound on the error of . However, since these averages are formed using a single sample path from a (possibly) non-stationary Markov chain, we cannot use standard large deviation bounds; moreover applying Chernoff-type bounds for Markov chains to each entry of would result in a significantly worse sequence length requirement, roughly a factor of larger. Instead, we adapt probability tail bounds for sums of independent random matrices tropp2015intro to our non-iid setting by directly applying a blocking technique of Bernstein27 as described in the article of Yu94. Due to ergodicity, the convergence rate can be bounded without any dependence on the initial state distribution . The proof of Theorem 3.3 is given in Section 6.
3.3. Improving the plug-in estimator
We can bootstrap the plug-in estimator in Eq. 5 to show that in fact, to obtain any prescribed multiplicative accuracy, steps suffice to estimate . The idea is to apply the estimator from Eq. 5 to the -skipped chain for some . This chain has spectral gap . Thus, letting be the plug-in estimator for based on the -skipped chain, a natural estimator of is .
Why may this improve on the original plug-in estimator from Section 3.2? Observe that for , so the additive accuracy bound from Eq. 5 for the plug-in estimator on is roughly the same for all . However, when is bounded away from and , a small additive error in estimating with translates to a small multiplicative error in estimating using . So it suffices to use the skipped chain estimator with some . Since is not known (of course), we use a doubling trick to find a suitable value of .
The estimator is defined as follow. For simplicity, assume is a power of two. Initially, set . Let and . If or , then set and return . Otherwise, increment by one and repeat.
There exists a polynomial function of the logarithms of , , , and such that the following holds. Let be an ergodic and reversible Markov chain with spectral gap and minimum stationary probability . Let be the estimator defined above. For any , if , then with probability at least ,
4. A posteriori confidence intervals
In this section, we describe and analyze a procedure for constructing confidence intervals for the stationary probabilities and the spectral gap .
We first note that the point estimators from Theorem 3.3 and Theorem 3.4 fall short of being directly suitable for obtaining a fully empirical, a posteriori confidence interval for and . This is because the deviation terms themselves depend inversely both on and , and hence can never rule out (or an arbitrarily small positive value) as a possibility for or .111Using Theorem 3.3, it is possible to trap in the union of two empirical confidence intervals—one around and the other around zero, both of which shrink in width as the sequence length increases. In effect, the fact that the Markov chain could be slow mixing and the long-term frequency of some states could be small makes it difficult to be confident in the estimates provided by and .
The main idea behind our procedure, given as Algorithm 1, is to use the Markov property to eliminate the dependence of the confidence intervals on the unknown quantities (including and ). Specifically, we estimate the transition probabilities from the sample path using simple state visit counts: as a consequence of the Markov property, for each state, the frequency estimates converge at a rate that depends only on the number of visits to the state, and in particular the rate (given the visit count of the state) is independent of the mixing time of the chain.
With confidence intervals for the entries of in hand, it is possible to form a confidence interval for based on the eigenvalues of an estimated transition probability matrix by appealing to the Ostrowski-Elsner theorem (cf. Theorem 1.4 on Page 170 of stewart1990matrix.) However, directly using this perturbation result leads to very wide intervals, shrinking only at a rate of . We avoid this slow rate by constructing confidence intervals for the symmetric matrix , so that we can use a stronger perturbation result (namely Weyl’s inequality, as in the proof of Theorem 3.3) available for symmetric matrices.
To form an estimate of based on an estimate of the transition probabilities, one possibility is to estimate using state visit counts as was done in Section 3, and appeal to the relation to form a plug-in estimate of . However, it is not clear how to construct a confidence interval for the entries of because the accuracy of this estimator depends on the unknown mixing time.
We adopt a different strategy for estimating . We form the matrix using smoothed frequency estimates of (Step 1), then compute the group inverse of (Step 2), followed by finding the unique stationary distribution of (Step 3), this way decoupling the bound on the accuracy of from the mixing time. The group inverse of is uniquely defined; and if defines an ergodic chain (which is the case here due to the use of the smoothed estimates), can be computed at the cost of inverting an matrix [Theorem 5.2]meyer1975role.222 The group inverse of a square matrix , a special case of the Drazin inverse, is the unique matrix satisfying , and . Further, given , the unique stationary distribution of can be read out from the last row of [Theorem 5.3]meyer1975role. The group inverse is also used to determine the relative sensitivity of to , which is quantified by
We can regard as a plug-in estimator for , which is defined by substituting the group inverse of in for in Eq. 6.
We can now follow the strategy based on estimating alluded to above. Using and , we construct the plug-in estimate of , and use the eigenvalues of its symmetrization to form the estimate of the spectral gap (Steps 4 and 5). In the remaining steps, we use matrix perturbation analyses to relate and , viewing as the perturbation of ; and also to relate and , viewing as a perturbation of . Both analyses give error bounds entirely in terms of observable quantities (e.g., ), tracing back to empirical error bounds for the estimate of .
The most computationally expensive step in Algorithm 1 is the computation of the group inverse , which, as noted earlier, reduces to matrix inversion. Thus, with a standard implementation of matrix inversion, the algorithm’s time complexity is , while its space complexity is .
4.2. Main result
We now state our main theorems. Below, the big- notation should be interpreted as follows. For a random sequence and a (non-random) positive sequence parameterized by , we say “ holds almost surely as ” if there is some universal constant such that for all , holds almost surely.
Suppose Algorithm 1 is given as input a sample path of length from an ergodic and reversible Markov chain and confidence parameter . Let denote the spectral gap, the unique stationary distribution, and the minimum stationary probability. Then, on an event of probability at least ,
almost surely as .
The proof of Theorem 4.1 is given in Section 8. As mentioned above, the obstacle encountered in Theorem 3.3 is avoided by exploiting the Markov property. We establish fully observable upper and lower bounds on the entries of that converge at a rate using standard martingale tail inequalities; this justifies the validity of the bounds from Step 6. Properties of the group inverse meyer1975role,cho2001comparison and eigenvalue perturbation theory stewart1990matrix are used to validate the empirical bounds on and developed in the remaining steps of the algorithm.
The first part of Theorem 4.1 provides valid empirical confidence intervals for each and for , which are simultaneously valid at confidence level . The second part of Theorem 4.1 shows that the width of the intervals decrease as the sequence length increases. The rate at which the widths shrink is given in terms of , , , and . We show in Section 8.5 (Lemma 8.8) that
It is easy to combine Theorems 4.1 and 3.3 to yield intervals whose widths shrink at least as fast as both the non-empirical intervals from Theorem 3.3 and the empirical intervals from Theorem 4.1. Specifically, determine lower bounds on and using Algorithm 1, , ; then plug-in these lower bounds for and in the deviation bounds in Eq. 5 from Theorem 3.3. This yields a new interval centered around the estimate of from Theorem 3.3 and the new interval no longer depends on unknown quantities. The interval is a valid probability confidence interval for , and for sufficiently large , the width shrinks at the rate given in Eq. 5. We can similarly construct an empirical confidence interval for using Eq. 4, which is valid on the same probability event.333For the interval, we only plug-in lower bounds on and only where these quantities appear as and in Eq. 4. It is then possible to “solve” for observable bounds on . See Section 9 for details. Finally, we can take the intersection of these new intervals with the corresponding intervals from Algorithm 1. This is summarized in the following Theorem, which we prove in Section 9.
Finally, note that a stopping rule that stops when and are estimated with a given relative error can be obtained as follows. At time :
It is easy to see then that with probability , the algorithm only stops when the relative accuracy of its estimate is at least . Combined with the lower bounds, we conjecture that the expected stopping time of the resulting procedure is optimal up to factors. color=red!20!white,color=red!20!white,todo: color=red!20!white,Cs: I think this is true. Are we opening a can of worms?
5. Proofs of Theorems 3.2 and 3.1
5.1. Proof of Theorem 3.1
Fix . Consider two Markov chains given by the following stochastic matrices:
Each Markov chain is ergodic and reversible; their stationary distributions are, respectively, and . We have in both cases. For the first Markov chain, , and hence the spectral gap is ; for the second Markov chain, , so the spectral gap is .
In order to guarantee , it must be possible to distinguish the two Markov chains. Assume that the initial state distribution has mass at least on state . (If this is not the case, we swap the roles of states and in the constructions above.) With probability at least half, the initial state is ; and both chains have the same transition probabilities from state . The chains are indistinguishable unless the sample path eventually reaches state . But with probability at least , a sample path of length starting from state
always remains in the same state (this follows from properties of the geometric distribution and the assumption). ∎
5.2. Proof of Theorem 3.2
We consider -state Markov chains of the following form:
for some . Such a chain is ergodic and reversible, and its unique stationary distribution satisfies
We fix and set . Consider the following different Markov chains of the type described above:
: . For this Markov chain, .
for : for , and . For these Markov chains, , and . So .
The spectral gap in each chain satisfies ; in for , it is half of what it is in . Also for each .
In order to guarantee , it must be possible to distinguish from each , . But is identical to except for the transition probabilities from state . Therefore, regardless of the initial state, the sample path must visit all states in order to distinguish from each , . For any of the Markov chains above, the earliest time in which a sample path visits all states stochastically dominates a generalized coupon collection time , where is the number of steps required to see the -th distinct state in the sample path beyond the first
. The random variablesare independent, and are geometrically distributed, . We have that
where . By the Paley-Zygmund inequality,
Since (for an appropriate absolute constant ), with probability at least , the sample path does not visit all states. ∎
6. Proof of Theorem 3.3
In this section, we prove Theorem 3.3.