Let be a finite state space and be the coordinate process on , where denotes the set of nonnegative integers. Given an initial distribution on , and a stochastic matrix , there exists a unique probability measure on the sequence space such that the coordinate process is a Markov chain with respect to , with transition probability matrix . If we assume further that is irreducible, then there exists a unique stationary distribution , and for any real-valued function the empirical mean converges -almost-surely to the stationary mean . The goal of this work is to quantify the rate of this convergence by developing finite sample upper bounds for the large deviations probability
The significance of studying finite sample bounds for such tail probabilities is not only theoretical but also practical, since concentration inequalities for Markov dependent random variables have wide applicability in statistics, computer science and learning theory. Just to mention a few applications, first and foremost this convergence forms the backbone behind all Markov chain Monte Carlo (MCMC) integration techniques, seeMetropolis et al. (1953). Moreover, tail bounds of this form have been used by Jerrum, Sinclair and Vigoda (2001) to develop an approximation algorithm for the permanent of a nonnegative matrix. In addition, in the stochastic multi-armed bandit literature the analysis of learning algorithms is based on tail bounds of this type, see the survey of Bubeck and Cesa-Bianchi (2012). More specifically the work of Moulos (2019) uses such a bound to tackle a Markovian identification problem.
1.1 Chernoff Bound
The classic large deviations theory for Markov chains due to Miller (1961); Donsker and Varadhan (1975); Gertner (1977); Ellis (1984); Dembo and Zeitouni (1998) suggests that asymptotically the large deviations probability decays exponentially and the rate is given by the convex conjugate
of the log-Perron-Frobenius eigenvalueof the nonnegative irreducible matrix . In particular
Our objective is to develop a finite sample bound which captures this exponential decay and has a constant prefactor that does not depend on , and is thus useful in applications. A counting based approach by Davisson, Longo and Sgarro (1981) is able to capture this exponential decay but with a suboptimal prefactor that depends polynomially on . Through the development in the book of Dembo and Zeitouni (1998) (Theorem 3.1.2), which is also presented by Watanabe and Hayashi (2017), one is able to obtain a constant prefactor, which though depends on . This is unsatisfactory because exact large deviations for Markov chains, see Miller (1961); Kontoyiannis and Meyn (2003), yield that, at least when the supremum is attained at , then
is a right Perron-Frobenius eigenvector of. Here denotes that the ratio of the expressions on the left hand side and the right hand side converges to , and denotes the second derivative in of at . Thus, if we allow dependence on , then the prefactor should be able to capture a decay of the order
. If we insist on a constant prefactor though, the best that we can hope for is a constant prefactor, because otherwise we will contradict the central limit theorem for Markov chains. This is argued formally afterRemark 7 at the end of Section 3.
In our work we establish a tail bound with the optimal rate of exponential decay and a constant prefactor which depends only on the function and the stochastic matrix , under the following conditions on . Let , and . Based on we define two set of states, the ones of maximum value , and the ones of minimum value . We require that satisfies the following assumptions:
the submatrix of with rows and columns in is irreducible;
for every , there exists such that ;
the submatrix of with rows and columns in is irreducible;
for every , there exists such that .
With those assumptions we are essentially enforcing that after suitable tilts of the transition probability matrix we are able to produce new Markov chains that can realize any stationary mean in . Our assumptions are general enough to capture all Markov chains for which all the transitions have a positive probability, reversible or not, as well as all finitely supported IID sequences.
The key technique to derive our Chernoff type bound is the old idea due to Esscher (1932) of an exponential tilt, which lies at the heart of large deviations. In the world of statistics those exponential changes of measure go by the name exponential families and the standard reference is the book of Brown (1986)
. Exponential tilts of stochastic matrices generalize those of finitely supported probability distributions, and were first introduced in the work ofMiller (1961). Subsequently they formed one of the main tools in the study of large deviations for Markov chains, see Donsker and Varadhan (1975); Gertner (1977); Ellis (1984); Dembo and Zeitouni (1998); Balaji and Meyn (2000); Kontoyiannis and Meyn (2003). Naturally they are also the key object when one conditions on the second-order empirical distribution of a Markov chain and considers conditional limit theorems as in Csiszár, Cover and Choi (1987). A more recent development by Nagaoka (2005) gives an information geometry perspective to this concept, while Hayashi and Watanabe (2016)
examine the problem of parameter estimation for exponential families of stochastic matrices.
Here we build on exponential families of stochastic matrices and together with some Perron-Frobenius theory, analyticity of Perron-Frobenius eigenvalues and eigenvectors, as well as conjugate duality we are able to establish our main Chernoff type bound.
According to Proposition 1, when is a positive stochastic matrix, i.e. all the transitions have positive probability, we can replace with
1.2 Hoeffding Bound
Although Chernoff type bounds for Markov chains have not been extensively studied in the literature, and that’s exactly the focus of this work, there is a vast literature on Hoeffding type inequalities for Markov chains. Gillman (1993) obtained the first Hoeffding type bound for finite reversible Markov chains. Reversibility is a key assumption in his work because it leads to self-adjoint operators and then it is possible to apply the matrix perturbation theory of Kato (1966) in order to derive a bound on the largest eigenvalue of the perturbed self-adjoint operator . Later on Dinwoodie (1995) obtained an improved prefactor. Using the same spectral techniques Lezaud (1998) obtained a Bernstein type inequality which is also applicable to some nonreversible finite Markov chains, and which was later improved in the work of Paulin (2015). Kahale (1997) introduced the idea of reducing the problem to a two state chain which turned out to be very fruitful. León and Perron (2004)
employed this idea and by performing exact calculations they obtained a bound which is optimal for two state chains in the large deviations sense, as well as a Hoeffding type bound with variance proxy, where is the second largest eigenvalue of the reversible stochastic matrix , as opposed to the classic variance proxy for IID sequences due to Hoeffding (1963). Miasojedow (2014) extended this work to general state spaces without the reversibility assumption, Rao (2019) considered finite stationary Markov chains but allowed time-varying functions , and finally Jiang, Sun and Fan (2018); Fan, Jiang and Sun (2018) obtained both Bernstein and Hoeffding type bounds for general state space Markov chains and time-varying functions .
Here we develop a Hoeffding type bound by loosening up our Chernoff type bound in Theorem 1 using a Pinsker type inequality in Lemma 8. In the process a Hoeffding type lemma, in Lemma 9, is established as the dual of our Pinsker type inequality.
Our variance proxy , according to Lemma 2, has an interpretation as a worst case variance among all the tilted Markov chains, and thus parallels the variance proxy from the IID case which is the supremum of the variances among the tilted distributions and which can be upper bounded by .
1.3 Organization of Paper
The rest of the paper proceeds as follows. Section 2 contains the classic construction of exponential families of stochastic matrices, the duality between the canonical and mean parametrization, as well as many other useful properties for our bounds. In Section 3 we analyze the limiting behavior of the family under our assumptions A 1-A 4, and we establish our main Chernoff (Theorem 1) and Hoeffding (Theorem 2) type bounds. Finally in Section 5 we develop a uniform multiplicative ergodic theorem (Theorem 5).
2 Exponential Family of Stochastic Matrices
Exponential tilting of stochastic matrices originates in the work of Miller (1961). Following this we define an exponential family of stochastic matrices which is able to produce Markov chains with shifted stationary means. The generator of the exponential family is an irreducible stochastic matrix , which for this section is not assumed to satisfy A 1-A 4, and represents the canonical parameter of the family. Then we define
(or , where is thought as an operator over matrices). has the same nonnegativity structure as , hence it is irreducible and we can use the Perron-Frobenius theory in order to normalize it and turn it into a stochastic matrix. Let (or ) be the spectral radius of , which from the Perron-Frobenius theory is an simple eigenvalue of , called the Perron-Frobenius eigenvalue, associated with unique left and right eigenvectors (or ) such that they are both positive, and , see for instance Theorem 8.4.4 in the book Horn and Johnson (2013). Using we define a family of nonnegative irreducible matrices, parametrized by , in the following way
which are stochastic, since
In addition their stationary distributions are given by
Note that the generator stochastic matrix, , is the member of the family that corresponds to , i.e. , and , where
is the all ones vector. In addition it is possible that the family is degenerate as the following example suggests.
Let , and . Then , and for any .
A basic property of the family is that the composition of with , is the transform , and so composition is commutative. Furthermore we can undo the transform by applying .
For any irreducible stochastic matrix , and any
It suffices to check that is a right eigenvector of the matrix with entries , with the corresponding eigenvalue being . This is a straightforward calculation. ∎
2.2 Mean Parametrization
The exponential family can be reparametrized using the mean parameters . The duality between the canonical parameters and the mean parameters is manifested through the log-Perron-Frobenius eigenvalue . More specifically, from Lemma 2 it follows that there are two cases for the mapping . In the nondegenerate case that this mapping is nonconstant, is a strictly increasing bijection between the set of canonical parameters and the set of mean parameters, which is an open interval. Therefore, with some abuse of notation, for any we may write for . In the degenerate case that the mapping is constant, , and the set is the singleton . An illustration of the degenerate case is Example 1.
Let be an irreducible stochastic matrix, and a real-valued function on the state space . Then
and are analytic functions of on .
, where denotes the bivariate distribution defined by .
Either for all (degenerate case), or is an injection (nondegenerate case).
Moreover, in the degenerate case is linear, while in the nondegenerate case is strictly convex.
2.3 Relative Entropy Rate and Conjugate Duality
For two probability distributions and over the same measurable space we define the relative entropy between and as
Relative entropies of stochastic processes are most of the time trivial, and so we resort to the notion of relative entropy rate. Let be two stochastic matrices over the same state space . We further assume that is irreducible with associated stationary distribution . For any initial distribution on we define the relative entropy rate between the Markov chain induced by with initial distribution , and the Markov chain induced by with initial distribution as
where and denote the finite dimensional distributions of the probability measures restricted to the sigma algebra . Note that indeed the definition is independent of the initial distribution , since we can easily see using ergodic theory that
where denotes the bivariate distribution
and we use the standard notational conventions , and .
For stochastic matrices which are elements of the exponential family we simplify the relative entropy rate notation as follows. For and we write
Let and . Then
We further define the convex conjugate of as . Moreover, since we saw in Lemma 2 that is convex and analytic, we have that the biconjugate of is itself, i.e. . The convex conjugate represents the rate of exponential decay for large deviation events, and in the following Lemma 4, which is established in Appendix B, we derive a closed form expression for it.
3 Optimal Chernoff Bound
3.1 Class of Stochastic Matrices
In order to develop our bounds we assume that the irreducible stochastic matrix satisfies A 1-A 4. Under those conditions we are able to show in Proposition 1 that the ratio of the entries of the right Perron-Frobenius eigenvector is uniformly bounded. Moreover, those conditions capture a large class of Markov chains, for instance Markov chains where all the transitions have positive probabilities, and Markov chains that induce IID processes. For those two categories we provide explicit uniform bounds in Proposition 1.
The following example suggests that we cannot meet the requirement that the ratios of the entries of the right Perron-Frobenius eigenvector is uniformly bounded if we drop assumption A 1 or assumption A 3.
Let , and . Then , and as .
Let and . Then , and as .
The natural interpretation of those conditions is that they allow us to create new Markov chains with any stationary mean in the interval , by selecting appropriate tilting levels . This is formalized in Corollary 2.
3.2 Limiting Behavior of the Family
Define the matrix
and note that , as well as . Hence will help us study the asymptotic behavior of , since
Due to the structure imposed on through A 1-A 4, the following Lemma 5, which constitutes a simple extension of the Perron-Frobenius theory for matrices which are not necessarily irreducible, suggests that is a simple eigenvalue of , which is associated with unique left and right eigenvectors such that for and for , is positive, and . Similarly, is a simple eigenvalue of , which is associated with unique left and right eigenvectors such that for and for , is positive, and .
Let be a nonnegative matrix such that after a consistent renumbering of its rows and columns we can assume that consists of an irreducible square block , and a rectangular block such that none of the rows of is zero, for some , assembled together in the following way
Then, is a simple eigenvalue of , which we call the Perron-Frobenius eigenvalue, and is associated with unique left and right eigenvectors such that has its first coordinates positive and its last coordinates equal to zero, is positive, , and .
Let be the unique left and right eigenvectors of corresponding to the Perron-Frobenius eigenvalue , such that both of them are positive, and . Observe that the vectors
are left and right eigenvectors of with associated eigenvalue , and satisfy all the conditions.
In addition, any eigentriple of eigenvalue and corresponding left and right eigenvectors of , will certainly have , and gives rise to an eigentriple for . Therefore, and the uniqueness of follows from the uniqueness of . ∎
Note that from Lemma 5 for we recover the classic Perron-Frobenius theorem.
A continuity argument for simple eigenvalues and their corresponding eigenvectors, enables us to describe the asymptotic behavior of in Lemma 6.
, as , and so the following is a well defined stochastic matrix
, as , and so the following is a well defined stochastic matrix
Note that both and possess the structure of Lemma 5. Consider Lemma 10 in Appendix A, with taken to be . For in a sufficiently small neighborhood of the function identified in the proof of that lemma is analytic and equals for all in that neighborhood that have the structure in Lemma 5. Now, since as , we have is in this neighborhood for all sufficiently large , and , being irreducible, satisfies the conditions of Lemma 5. The conclusion is now immediate. A similar argument works when the in Lemma 10 is taken to be . ∎
which together with Lemma 2 (b) means that any mean in the interval can be realized by some exponential tilt .
A critical ingredient to obtain our tail bounds is the following Proposition 1 which states that under the assumptions A 1-A 4 the ratio of the entries of the right Perron-Frobenius eigenvector stays uniformly bounded.
where and are constants depending on the stochastic matrix , and the function . In particular
if induces an IID process, i.e. has identical rows, then and ;
if is a positive stochastic matrix, then
Moreover using the chain rule we see that for
To see why this formula holds, first observe that , being irreducible, satisfies the conditions of Lemma 5. Next, observe that the last coordinates of , in the notation of the proof of Lemma 10, are all strictly positive. With some abuse of notation since we are not really thinking of as being enumerated, let us write the last coordinates of as , for . Lemma 10 then implies, that for all , the ratio is analytic in a sufficiently small neighborhood of . Since a small enough variation in centered around the given results in a variation of centered around that lies in the set of matrices in this neighborhood that satisfy Lemma 5 (in fact all such matrices are irreducible and, even further, are of the form , for some ), we may use the notation for the ratio . The point is that what is really intended in the partial derivatives on the right hand side of the preceding equation is .
Here, to be able to write the expression on the right hand side of the preceding equation, we first observe that satisfies the conditions of Lemma 5 and so the last coordinates of , in the notation of the proof of Lemma 10, are all strictly positive, and so, by Lemma 10, for all the ratio is analytic in a neighborhood of . Further, the equality in the preceding equation is justified by the fact that, for all , is just an alternate notation for , and, for all large enough, lies in the neighborhood around guaranteed by Lemma 10.
Therefore , and similarly , establishing that .
Furthermore for the two cases for which we have a special handle on we argue as follows.
Let be the probability distribution driving the IID process, i.e. all the rows of are identical and equal to . Then we can see that and for all , since is the rank one matrix .
If is a positive stochastic matrix then, for any we have that