Optimal Chernoff and Hoeffding Bounds for Finite Markov Chains

07/10/2019
by   Vrettos Moulos, et al.
berkeley college
0

This paper develops an optimal Chernoff type bound for the probabilities of large deviations of sums ∑_k=1^n f (X_k) where f is a real-valued function and (X_k)_k ∈N_0 is a finite Markov chain with an arbitrary initial distribution and an irreducible stochastic matrix coming from a large class of stochastic matrices. Our bound is optimal in the large deviations sense attaining a constant prefactor and an exponential decay with the optimal large deviations rate. Moreover through a Pinsker type inequality and a Hoeffding type lemma, we are able to loosen up our Chernoff type bound to a Hoeffding type bound and reveal the sub-Gaussian nature of the sums. Finally we show a uniform multiplicative ergodic theorem for our class of Markov chains.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

09/10/2021

Optimal bounds for bit-sizes of stationary distributions in finite Markov chains

An irreducible stochastic matrix with rational entries has a stationary ...
10/11/2019

Iterated Decomposition of Biased Permutations Via New Bounds on the Spectral Gap of Markov Chains

The spectral gap of a Markov chain can be bounded by the spectral gaps o...
01/05/2020

A Hoeffding Inequality for Finite State Markov Chains and its Applications to Markovian Bandits

This paper develops a Hoeffding inequality for the partial sums ∑_k=1^n ...
05/20/2016

Quantifying the accuracy of approximate diffusions and Markov chains

Markov chains and diffusion processes are indispensable tools in machine...
01/30/2021

On the Stability of Random Matrix Product with Markovian Noise: Application to Linear Stochastic Approximation and TD Learning

This paper studies the exponential stability of random matrix products d...
07/27/2018

On the Inability of Markov Models to Capture Criticality in Human Mobility

We examine the non-Markovian nature of human mobility by exposing the in...
01/13/2020

Decisiveness of Stochastic Systems and its Application to Hybrid Models

In [ABM07], Abdulla et al. introduced the concept of decisiveness, an in...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Let be a finite state space and be the coordinate process on , where denotes the set of nonnegative integers. Given an initial distribution on , and a stochastic matrix , there exists a unique probability measure on the sequence space such that the coordinate process is a Markov chain with respect to , with transition probability matrix . If we assume further that is irreducible, then there exists a unique stationary distribution , and for any real-valued function the empirical mean converges -almost-surely to the stationary mean . The goal of this work is to quantify the rate of this convergence by developing finite sample upper bounds for the large deviations probability

The significance of studying finite sample bounds for such tail probabilities is not only theoretical but also practical, since concentration inequalities for Markov dependent random variables have wide applicability in statistics, computer science and learning theory. Just to mention a few applications, first and foremost this convergence forms the backbone behind all Markov chain Monte Carlo (MCMC) integration techniques, see 

Metropolis et al. (1953). Moreover, tail bounds of this form have been used by Jerrum, Sinclair and Vigoda (2001) to develop an approximation algorithm for the permanent of a nonnegative matrix. In addition, in the stochastic multi-armed bandit literature the analysis of learning algorithms is based on tail bounds of this type, see the survey of Bubeck and Cesa-Bianchi (2012). More specifically the work of Moulos (2019) uses such a bound to tackle a Markovian identification problem.

1.1 Chernoff Bound

The classic large deviations theory for Markov chains due to Miller (1961); Donsker and Varadhan (1975); Gertner (1977); Ellis (1984); Dembo and Zeitouni (1998) suggests that asymptotically the large deviations probability decays exponentially and the rate is given by the convex conjugate

of the log-Perron-Frobenius eigenvalue

of the nonnegative irreducible matrix . In particular

Our objective is to develop a finite sample bound which captures this exponential decay and has a constant prefactor that does not depend on , and is thus useful in applications. A counting based approach by Davisson, Longo and Sgarro (1981) is able to capture this exponential decay but with a suboptimal prefactor that depends polynomially on . Through the development in the book of Dembo and Zeitouni (1998) (Theorem 3.1.2), which is also presented by Watanabe and Hayashi (2017), one is able to obtain a constant prefactor, which though depends on . This is unsatisfactory because exact large deviations for Markov chains, see Miller (1961); Kontoyiannis and Meyn (2003), yield that, at least when the supremum is attained at , then

where and

is a right Perron-Frobenius eigenvector of

. Here denotes that the ratio of the expressions on the left hand side and the right hand side converges to , and denotes the second derivative in of at . Thus, if we allow dependence on , then the prefactor should be able to capture a decay of the order

. If we insist on a constant prefactor though, the best that we can hope for is a constant prefactor, because otherwise we will contradict the central limit theorem for Markov chains. This is argued formally after 

Remark 7 at the end of Section 3.

In our work we establish a tail bound with the optimal rate of exponential decay and a constant prefactor which depends only on the function and the stochastic matrix , under the following conditions on . Let , and . Based on we define two set of states, the ones of maximum value , and the ones of minimum value . We require that satisfies the following assumptions:

A 1.

the submatrix of with rows and columns in is irreducible;

A 2.

for every , there exists such that ;

A 3.

the submatrix of with rows and columns in is irreducible;

A 4.

for every , there exists such that .

With those assumptions we are essentially enforcing that after suitable tilts of the transition probability matrix we are able to produce new Markov chains that can realize any stationary mean in . Our assumptions are general enough to capture all Markov chains for which all the transitions have a positive probability, reversible or not, as well as all finitely supported IID sequences.

The key technique to derive our Chernoff type bound is the old idea due to Esscher (1932) of an exponential tilt, which lies at the heart of large deviations. In the world of statistics those exponential changes of measure go by the name exponential families and the standard reference is the book of Brown (1986)

. Exponential tilts of stochastic matrices generalize those of finitely supported probability distributions, and were first introduced in the work of 

Miller (1961). Subsequently they formed one of the main tools in the study of large deviations for Markov chains, see Donsker and Varadhan (1975); Gertner (1977); Ellis (1984); Dembo and Zeitouni (1998); Balaji and Meyn (2000); Kontoyiannis and Meyn (2003). Naturally they are also the key object when one conditions on the second-order empirical distribution of a Markov chain and considers conditional limit theorems as in Csiszár, Cover and Choi (1987). A more recent development by Nagaoka (2005) gives an information geometry perspective to this concept, while Hayashi and Watanabe (2016)

examine the problem of parameter estimation for exponential families of stochastic matrices.

Here we build on exponential families of stochastic matrices and together with some Perron-Frobenius theory, analyticity of Perron-Frobenius eigenvalues and eigenvectors, as well as conjugate duality we are able to establish our main Chernoff type bound.

Theorem 1.

Let be an irreducible stochastic matrix on the finite state space , with stationary distribution , which combined with a real-valued function satisfies A 1-A 4. Then for any initial distribution

where is the constant from Proposition 1, and depends only on the stochastic matrix and the function .

Remark 1.

Since is arbitrary and our assumptions A 1-A 4 are symmetric among the maximum and minimum value of Theorem 1 also yields a Chernoff type bound for the lower tail. In particular

Remark 2.

According to Proposition 1, when is a positive stochastic matrix, i.e. all the transitions have positive probability, we can replace with

Remark 3.

According to Proposition 1, when induces an IID sequence, i.e. all the rows of are identical, then . Thus Theorem 1 generalizes the classic bound of Chernoff (1952) for finitely supported IID sequences.

1.2 Hoeffding Bound

Although Chernoff type bounds for Markov chains have not been extensively studied in the literature, and that’s exactly the focus of this work, there is a vast literature on Hoeffding type inequalities for Markov chains. Gillman (1993) obtained the first Hoeffding type bound for finite reversible Markov chains. Reversibility is a key assumption in his work because it leads to self-adjoint operators and then it is possible to apply the matrix perturbation theory of Kato (1966) in order to derive a bound on the largest eigenvalue of the perturbed self-adjoint operator . Later on Dinwoodie (1995) obtained an improved prefactor. Using the same spectral techniques Lezaud (1998) obtained a Bernstein type inequality which is also applicable to some nonreversible finite Markov chains, and which was later improved in the work of Paulin (2015). Kahale (1997) introduced the idea of reducing the problem to a two state chain which turned out to be very fruitful. León and Perron (2004)

employed this idea and by performing exact calculations they obtained a bound which is optimal for two state chains in the large deviations sense, as well as a Hoeffding type bound with variance proxy

, where is the second largest eigenvalue of the reversible stochastic matrix , as opposed to the classic variance proxy for IID sequences due to Hoeffding (1963). Miasojedow (2014) extended this work to general state spaces without the reversibility assumption, Rao (2019) considered finite stationary Markov chains but allowed time-varying functions , and finally Jiang, Sun and Fan (2018); Fan, Jiang and Sun (2018) obtained both Bernstein and Hoeffding type bounds for general state space Markov chains and time-varying functions .

Here we develop a Hoeffding type bound by loosening up our Chernoff type bound in Theorem 1 using a Pinsker type inequality in Lemma 8. In the process a Hoeffding type lemma, in Lemma 9, is established as the dual of our Pinsker type inequality.

Theorem 2.

Let be an irreducible stochastic matrix on the finite state space , with stationary distribution , which combined with a real-valued function satisfies A 1-A 4. Then for any initial distribution

where , denotes the second derivative of in , and are the constants from Proposition 1.

Remark 4.

Since is arbitrary and our assumptions A 1-A 4 are symmetric among the maximum and minimum value of Theorem 1 also yields a Hoeffding type bound for the lower tail. In particular

Remark 5.

According to Proposition 1, when induces an IID sequence, i.e. all the rows of are identical, then and . Thus Theorem 2 generalizes the classic bound of Hoeffding (1963) for finitely supported IID sequences.

Remark 6.

Our variance proxy , according to Lemma 2, has an interpretation as a worst case variance among all the tilted Markov chains, and thus parallels the variance proxy from the IID case which is the supremum of the variances among the tilted distributions and which can be upper bounded by .

1.3 Organization of Paper

The rest of the paper proceeds as follows. Section 2 contains the classic construction of exponential families of stochastic matrices, the duality between the canonical and mean parametrization, as well as many other useful properties for our bounds. In Section 3 we analyze the limiting behavior of the family under our assumptions A 1-A 4, and we establish our main Chernoff (Theorem 1) and Hoeffding (Theorem 2) type bounds. Finally in Section 5 we develop a uniform multiplicative ergodic theorem (Theorem 5).

2 Exponential Family of Stochastic Matrices

2.1 Construction

Exponential tilting of stochastic matrices originates in the work of Miller (1961). Following this we define an exponential family of stochastic matrices which is able to produce Markov chains with shifted stationary means. The generator of the exponential family is an irreducible stochastic matrix , which for this section is not assumed to satisfy A 1-A 4, and represents the canonical parameter of the family. Then we define

(or , where is thought as an operator over matrices). has the same nonnegativity structure as , hence it is irreducible and we can use the Perron-Frobenius theory in order to normalize it and turn it into a stochastic matrix. Let (or ) be the spectral radius of , which from the Perron-Frobenius theory is an simple eigenvalue of , called the Perron-Frobenius eigenvalue, associated with unique left and right eigenvectors (or ) such that they are both positive, and , see for instance Theorem 8.4.4 in the book Horn and Johnson (2013). Using we define a family of nonnegative irreducible matrices, parametrized by , in the following way

which are stochastic, since

In addition their stationary distributions are given by

since

Note that the generator stochastic matrix, , is the member of the family that corresponds to , i.e. , and , where

is the all ones vector. In addition it is possible that the family is degenerate as the following example suggests.

Example 1.

Let , and . Then , and for any .

A basic property of the family is that the composition of with , is the transform , and so composition is commutative. Furthermore we can undo the transform by applying .

Lemma 1.

For any irreducible stochastic matrix , and any

Proof.

It suffices to check that is a right eigenvector of the matrix with entries , with the corresponding eigenvalue being . This is a straightforward calculation. ∎

2.2 Mean Parametrization

The exponential family can be reparametrized using the mean parameters . The duality between the canonical parameters and the mean parameters is manifested through the log-Perron-Frobenius eigenvalue . More specifically, from Lemma 2 it follows that there are two cases for the mapping . In the nondegenerate case that this mapping is nonconstant, is a strictly increasing bijection between the set of canonical parameters and the set of mean parameters, which is an open interval. Therefore, with some abuse of notation, for any we may write for . In the degenerate case that the mapping is constant, , and the set is the singleton . An illustration of the degenerate case is Example 1.

Lemma 2.

Let be an irreducible stochastic matrix, and a real-valued function on the state space . Then

  1. [label=()]

  2. and are analytic functions of on .

  3. .

  4. , where denotes the bivariate distribution defined by .

  5. Either for all (degenerate case), or is an injection (nondegenerate case).

    Moreover, in the degenerate case is linear, while in the nondegenerate case is strictly convex.

The proof of Lemma 2 can be found in Appendix B.

2.3 Relative Entropy Rate and Conjugate Duality

For two probability distributions and over the same measurable space we define the relative entropy between and as

Relative entropies of stochastic processes are most of the time trivial, and so we resort to the notion of relative entropy rate. Let be two stochastic matrices over the same state space . We further assume that is irreducible with associated stationary distribution . For any initial distribution on we define the relative entropy rate between the Markov chain induced by with initial distribution , and the Markov chain induced by with initial distribution as

where and denote the finite dimensional distributions of the probability measures restricted to the sigma algebra . Note that indeed the definition is independent of the initial distribution , since we can easily see using ergodic theory that

where denotes the bivariate distribution

and we use the standard notational conventions , and .

For stochastic matrices which are elements of the exponential family we simplify the relative entropy rate notation as follows. For and we write

For those relative entropy rates Lemma 3 suggests an alternative representation based on the parametrization. Its proof can be found in Appendix B.

Lemma 3.

Let and . Then

We further define the convex conjugate of as . Moreover, since we saw in Lemma 2 that is convex and analytic, we have that the biconjugate of is itself, i.e. . The convex conjugate represents the rate of exponential decay for large deviation events, and in the following Lemma 4, which is established in Appendix B, we derive a closed form expression for it.

Lemma 4.

An inspection of how the supremum was obtained in the previous Lemma 4 yields the following Corollary 1.

Corollary 1.

3 Optimal Chernoff Bound

3.1 Class of Stochastic Matrices

In order to develop our bounds we assume that the irreducible stochastic matrix satisfies A 1-A 4. Under those conditions we are able to show in Proposition 1 that the ratio of the entries of the right Perron-Frobenius eigenvector is uniformly bounded. Moreover, those conditions capture a large class of Markov chains, for instance Markov chains where all the transitions have positive probabilities, and Markov chains that induce IID processes. For those two categories we provide explicit uniform bounds in Proposition 1.

The following example suggests that we cannot meet the requirement that the ratios of the entries of the right Perron-Frobenius eigenvector is uniformly bounded if we drop assumption A 1 or assumption A 3.

Example 2.

Let , and . Then , and as .

Similarly a birth-death chain shows the necessity of assumptions A 2 and A 4.

Example 3.

Let and . Then , and as .

The natural interpretation of those conditions is that they allow us to create new Markov chains with any stationary mean in the interval , by selecting appropriate tilting levels . This is formalized in Corollary 2.

3.2 Limiting Behavior of the Family

Define the matrix

and note that , as well as . Hence will help us study the asymptotic behavior of , since

Note that

and similarly

Due to the structure imposed on through A 1-A 4, the following Lemma 5, which constitutes a simple extension of the Perron-Frobenius theory for matrices which are not necessarily irreducible, suggests that is a simple eigenvalue of , which is associated with unique left and right eigenvectors such that for and for , is positive, and . Similarly, is a simple eigenvalue of , which is associated with unique left and right eigenvectors such that for and for , is positive, and .

Lemma 5.

Let be a nonnegative matrix such that after a consistent renumbering of its rows and columns we can assume that consists of an irreducible square block , and a rectangular block such that none of the rows of is zero, for some , assembled together in the following way

Then, is a simple eigenvalue of , which we call the Perron-Frobenius eigenvalue, and is associated with unique left and right eigenvectors such that has its first coordinates positive and its last coordinates equal to zero, is positive, , and .

Proof.

Let be the unique left and right eigenvectors of corresponding to the Perron-Frobenius eigenvalue , such that both of them are positive, and . Observe that the vectors

are left and right eigenvectors of with associated eigenvalue , and satisfy all the conditions.

In addition, any eigentriple of eigenvalue and corresponding left and right eigenvectors of , will certainly have , and gives rise to an eigentriple for . Therefore, and the uniqueness of follows from the uniqueness of . ∎

Note that from Lemma 5 for we recover the classic Perron-Frobenius theorem.

A continuity argument for simple eigenvalues and their corresponding eigenvectors, enables us to describe the asymptotic behavior of in Lemma 6.

Lemma 6.
  1. [label=()]

  2. , as , and so the following is a well defined stochastic matrix

  3. , as , and so the following is a well defined stochastic matrix

Proof.

Note that both and possess the structure of Lemma 5. Consider Lemma 10 in Appendix A, with taken to be . For in a sufficiently small neighborhood of the function identified in the proof of that lemma is analytic and equals for all in that neighborhood that have the structure in Lemma 5. Now, since as , we have is in this neighborhood for all sufficiently large , and , being irreducible, satisfies the conditions of Lemma 5. The conclusion is now immediate. A similar argument works when the in Lemma 10 is taken to be . ∎

The combination of the extended Perron-Frobenius theory Lemma 5, and the limiting behavior of the exponential family Lemma 6 imply that

which together with Lemma 2 (b) means that any mean in the interval can be realized by some exponential tilt .

Corollary 2.

Let be an irreducible stochastic matrix on the finite state space , which combined with a real-valued function satisfies A 1-A 4. Then the exponential family generated by can realize any stationary mean in the interval , i.e. .

A critical ingredient to obtain our tail bounds is the following Proposition 1 which states that under the assumptions A 1-A 4 the ratio of the entries of the right Perron-Frobenius eigenvector stays uniformly bounded.

Proposition 1.

Let be an irreducible stochastic matrix on the finite state space , which combined with a real-valued function satisfies A 1-A 4. Then

where and are constants depending on the stochastic matrix , and the function . In particular

  • if induces an IID process, i.e. has identical rows, then and ;

  • if is a positive stochastic matrix, then

Proof.

Lemma 2 yields that is continuous, and so in conjunction with Lemma 6 we have that the ratio of the entries of the right Perron-Frobenius eigenvector is uniformly bounded, hence .

Moreover using the chain rule we see that for

To see why this formula holds, first observe that , being irreducible, satisfies the conditions of Lemma 5. Next, observe that the last coordinates of , in the notation of the proof of Lemma 10, are all strictly positive. With some abuse of notation since we are not really thinking of as being enumerated, let us write the last coordinates of as , for Lemma 10 then implies, that for all , the ratio is analytic in a sufficiently small neighborhood of . Since a small enough variation in centered around the given results in a variation of centered around that lies in the set of matrices in this neighborhood that satisfy Lemma 5 (in fact all such matrices are irreducible and, even further, are of the form , for some ), we may use the notation for the ratio . The point is that what is really intended in the partial derivatives on the right hand side of the preceding equation is .

Lemma 10 in Appendix A ensures that is continuous at , more precisely

Here, to be able to write the expression on the right hand side of the preceding equation, we first observe that satisfies the conditions of Lemma 5 and so the last coordinates of , in the notation of the proof of Lemma 10, are all strictly positive, and so, by Lemma 10, for all the ratio is analytic in a neighborhood of . Further, the equality in the preceding equation is justified by the fact that, for all , is just an alternate notation for , and, for all large enough, lies in the neighborhood around guaranteed by Lemma 10.

Therefore , and similarly , establishing that .

Furthermore for the two cases for which we have a special handle on we argue as follows.

  • Let be the probability distribution driving the IID process, i.e. all the rows of are identical and equal to . Then we can see that and for all , since is the rank one matrix .

  • If is a positive stochastic matrix then, for any we have that