    # A Hoeffding Inequality for Finite State Markov Chains and its Applications to Markovian Bandits

This paper develops a Hoeffding inequality for the partial sums ∑_k=1^n f (X_k), where {X_k}_k ∈Z_> 0 is an irreducible Markov chain on a finite state space S, and f : S → [a, b] is a real-valued function. Our bound is simple, general, since it only assumes irreducibility and finiteness of the state space, and powerful. In order to demonstrate its usefulness we provide two applications in multi-armed bandit problems. The first is about identifying an approximately best Markovian arm, while the second is concerned with regret minimization in the context of Markovian bandits.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Let be a Markov chain on a finite state space , with initial distribution

, and irreducible transition probability matrix

, governed by the probability law . Let be its stationary distribution, and

be a real-valued function on the state space. Then the strong law of large numbers for Markov chains states that,

 1nn∑k=1f(Xk)Pq−a.s.→Eπ[X1], as n→∞.

Moreover, the central limit theorem for Markov chains provides a rate for this convergence,

 √n(1nn∑k=1f(Xk)−Eπ[X1])d→N(0,σ2), as n→∞,

where

is the asymptotic variance.

Those asymptotic results are insufficient in many applications which require finite-sample estimates. One of the most central such application is the convergence of Markov chain Monte Carlo (MCMC) approximation techniques

[Metropolis et al., 1953], where a finite-sample estimate is needed to bound the approximation error. Further applications include theoretical computer science and the approximation of the permanent [Jerrum et al., 2001]

, as well as statistical learning theory and multi-armed bandit problems

[Moulos, 2019].

Motivated by this discussion we provide a finite-sample Hoeffding inequality for finite Markov chains. In the special case that the random variables

are independent and identically distributed according to , Hoeffding’s classical inequality [Hoeffding, 1963] states that,

 Pπ(1nn∑k=1f(Xk)≥Eπ[X1]+ϵ)≤exp{−2nϵ2(b−a)2}. (1)

In our Theorem 1 we develop a version of Hoeffding’s inequality for finite state Markov chains. Our bound is very simple and easily computable, since it is based on martingale techniques and it only involves hitting times of Markov chains which are very well studied for many types of Markov chains [Aldous and Fill, 2002]. It is worth mentioning that our bound is based solely on irreduciblity, and it does not make any extra assumptions like aperiodicity or reversibility which prior works require.

There is a rich literature on finite-sample bounds for Markov chains. One of the earliest works [Davisson et al., 1981] uses counting and a generalization of the method of types, in order to derive a Chernoff bound for ergodic, i.e. irreducible and aperiodic, Markov chains. An alternative approach [Watanabe and Hayashi, 2017, Moulos and Anantharam, 2019], uses the theory of large deviations to derive sharper Chernoff bounds. When reversibility is assumed, the transition probability matrix is symmetric with respect to the space , which enables the use of matrix perturbation theory. This idea leads to Hoeffding inequalities that involve the spectral gap of the Markov chain and was initiated in [Gillman, 1993]. Refinements of this bound were given in a series of works [Dinwoodie, 1995, Kahale, 1997, Lezaud, 1998, León and Perron, 2004, Miasojedow, 2014]. In [Rao, 2019, Fan et al., 2018] a generalized spectral gap is introduced in order to obtain bounds even for a certain class of irreversible Markov chains as long as they posses a strictly positive generalized spectral gap. Information-theoretic ideas are used in [Kontoyiannis et al., 2006] in order to derive a Hoeffding inequality for Markov chains with general state spaces that satisfy Doeblin’s condition, which in the case of a finite state space is equivalent with ergodicity. Our approach uses Doob’s martingale combined with Azuma’s inequality, and is probably closest to the work of [Glynn and Ormoneit, 2002], where they establish a bound for Markov chains with general state spaces using martingale techniques, but their result heavily relies on the Markov chains satisfying Doeblin’s condition, and is thus not applicable to periodic Markov chains.

To illustrate the applicability of our bound we use it to study two Markovian multi-armed bandit problems. The stochastic multi-armed bandits problem is a prototypical statistical problem, where one is given multiple options, referred to as arms, and each of them is associated with a probability distribution. The emphasis is put on focusing as quickly as possible on the best available option, rather than estimating with high confidence the statistics of each option. The cornerstone of this field is the pioneering work of Lai and Robbins

[Lai and Robbins, 1985]. Here we study two variants of the multi-armed bandits problem where the probability distributions of the arms form Markov chains. First we consider the task of identifying with some fixed confidence an approximately best arm, and we use our bound to analyze the median elimination algorithm, originally proposed in [Even-Dar et al., 2006] for the case of IID bandits. Then we turn into the problem of regret minimization for Markovian bandits, where we analyze the UCB algorithm that was introduced in [Auer et al., 2002] for IID bandits. For a thorough introduction to multi-armed bandits we refer the interested reader to the survey [Bubeck and Cesa-Bianchi, 2012].

## 2 A Hoeffding Inequality for Finite State Markov Chains

The central quantity that shows up in our Hoeffding inequality, and makes it differ from the classical IID Hoeffding inequality, is the maximum hitting time of a Markov chain with an irreducible transition probability matrix . This is defined as , which is ensured to be finite due to irreduciblity and the finiteness of the state sapce, and is the first time to visit state .

###### Theorem 1.

Let be a Markov chain on a finite state space , driven by an initial distribution , and an irreducible transition probability matrix . Let be a real-valued function. Then, for any ,

 Pq(1nn∑k=1f(Xk)−1nn∑k=1Eq[f(Xk)]≥ϵ)≤exp{−nϵ22(b−a)2HitT(P)2}.
###### Proof.

We define the sums , for , and the filtration for . Then , where , is a martingale with respect to , the so called Doob martingale. We now proceed on deriving bounds on the martingale differences.

For , using the triangle inequality we obtain,

 |Mk+1−Mk| =|(f(Xk+1)+E(Sk+2,n∣Fk+1)) −(E(Sk+1,n−1∣Fk)+E(f(Xn)∣Fk))| ≤|f(Xk+1)−E(f(Xn)∣Fk)| +|E(Sk+2,n∣Fk+1)−E(Sk+1,n−1∣Fk)|.

The first term can be upper bounded by using the fact that takes values in . For the second term using the Markov property and the time-homogeneity of the Markov chain we have that,

 |E(Sk+2,n∣Fk+1)−E(Sk+1,n−1∣Fk)| =|E(Sk+2,n∣Xk+1)−E(Sk+1,n−1∣Xk)| ≤maxx,x′∈S|E[S2,n−k∣X1=x)]−E[S2,n−k∣X1=x′]|.

We now use a hitting time argument. Due to the fact that , for all , we have the following pointwise inequality,

 S2,n−k≤(b−a)(Tx−1)+STx+1,Tx+n−k−1.

Taking -expectations, and using the strong Markov property we obtain,

Consequently, for ,

 |Mk+1−Mk|≤(b−a)HitT(P).

For by repeating the same steps we have that,

 |M1−M0|≤|f(X1)−Eq[f(X1)]|+|E(S2,n∣X1)−Eq[S2,n]| ≤(b−a)+∑x′∈S|E(S2,n∣X1)−E[S2,n∣X1=x′]|q(x′) ≤(b−a)+maxx,x′∈S|E[S2,n∣X1=x]−E[S2,n∣X1=x′]| ≤(b−a)HitT(P).

The conclusion now follows by observing that , and applying Azuma’s inequality, [Azuma, 1967]. ∎

###### Example 1.

Consider a two-state Markov chain with and , with . Then,

 HitT(P)=max{E[Geometric(p)],E[Geometric(r)]}=1/min{p,r},

and Theorem 1 takes the form,

 Pq(1nn∑k=1f(Xk)−1nn∑k=1Eq[f(Xk)]≥ϵ)≤exp{−nmin{p2,r2}ϵ22(b−a)2}.
###### Example 2.

Consider the random walk on the -cycle with state space , and transition probability matrix . If

is odd, then the Markov chain is aperiodic, while if

is even, then the Markov chain has period . Then,

 HitT(P)=maxy∈SE[Ty∣X1=0]=1+maxy∈Sy(m−y)=1+⌊m2/4⌋,

and Theorem 1 takes the form,

 Pq(1nn∑k=1f(Xk)−1nn∑k=1Eq[f(Xk)]≥ϵ)≤exp{−nϵ22(b−a)2(1+⌊m2/4⌋)2}.
###### Remark 1.

By substituting with in Theorem 1 we obtain the following bound for the lower tail,

 Pq(1nn∑k=1f(Xk)−1nn∑k=1Eq[f(Xk)]≤−ϵ)≤exp{−nϵ22(b−a)2HitT(P)2},

and combining the upper and lower tail bounds we obtain the following two-sided bound,

 Pq(∣∣ ∣∣1nn∑k=1f(Xk)−1nn∑k=1Eq[f(Xk)]∣∣ ∣∣≥ϵ)≤2exp{−nϵ22(b−a)2HitT(P)2}.

Note that when the Markov chain is initialized with its stationary distribution this takes the form,

 Pπ(∣∣ ∣∣1nn∑k=1f(Xk)−Eπ[f(X1)]∣∣ ∣∣≥ϵ)≤2exp{−nϵ22(b−a)2HitT(P)2}.
###### Remark 2.

Observe that the technique used to establish Theorem 1 is limited to Markov chains with a finite state space . Indeed, if is a Markov chain on a countably infinite state space with an irreducible and positive recurrent transition probability matrix and a stationary distribution , then we claim that,

 1π(y)≤supx∈SE[Ty∣X1=x], for all y∈S,

from which it follows that , due to the fact that and is countably infinite. The aforementioned inequality can be established as follows.

 1π(y) =E[inf{n≥2:Xn=y}∣X1=y]−1 =∑x∈SE[inf{n≥2:Xn=y}∣X2=x]P(y,x) ≤supx∈SE[inf{n≥2:Xn=y}∣X2=x] =supx∈SE[Ty∣X1=x].

## 3 Markovian Multi-Armed Bandits

### 3.1 Setup

There are arms, and each arm is associated with a parameter which uniquely encodes111 and the set of irreducible transition probability matrices have the same cardinality, and hence there is a bijection between them. an irreducible transition probability matrix . We will denote the overall parameter configuration of all arms with . Arm evolves according to the stationary Markov chain, , driven by the irreducible transition probability matrix which has a unique stationary distribution , so that . There is a common reward function which generates the reward process . The reward process, in general, is not going to be a Markov chain, unless is injective, and it will have more complicated dependencies than the underlying Markov chain. Each time that we select arm , this arm evolves by one transition and we observe the corresponding sample from the reward process , while all the other arms stay rested.

The stationary reward of arm is . Let be the maximum stationary mean, and for simplicity assume that there exists a unique arm, , attaining this maximum stationary mean, i.e. . In the following sections we will consider two objectives: identifying an best arm with some fixed confidence level using as few samples as possible, and minimizing the expected regret given some fixed horizon .

### 3.2 Approximate Best Arm Identification

In the approximate best arm identification problem, we are given an approximation accuracy , and a confidence level . Our goal is to come up with an adaptive algorithm which collects a total of samples, and returns an arm that is within from the best arm, , with probability at least , i.e.

 PAθ(μ∗(θ)≥μ(θ^a)+ϵ)≤δ.

Such an algorithm is called -PAC (probably approximately correct).

In [Mannor and Tsitsiklis, 0304] a lower bound for the sample complexity of any -PAC algorithm is derived. The lower bound states that no matter the -PAC algorithm , there exists an instance such that the sample complexity is at least,

 EAθ[N]=Ω(Kϵ2log1δ).

A matching upper bound is provided for IID bandits in [Even-Dar et al., 2006] in the form of the median elimination algorithm. We demonstrate the usefulness of our Hoeffding inequality, by providing an analysis of the median elimination algorithm in the more general setting of Markovian bandits.

###### Theorem 2.

If then, the -Median-Elimination algorithm is -PAC, and its sample complexity is upper bounded by .

###### Proof.

The total number of sampling rounds is at most , and we can set them equal to by setting , for , where . Fix . We claim that,

 Pβ−MEθ(maxa∈Arμ(θa)≥maxa∈Ar+1μ(θa)+ϵr)≤δr. (2)

We condition on the value of . If , then the claim is trivially true, so we only consider the case . Let , and . We consider the following set of bad arms,

 Br={b∈Ar:¯Yb[r]≥¯Ya∗r[r], μ∗r≥μ(θb)+ϵr},

and observe that,

 Pβ−MEθ(μ∗r≥μ∗r+1+ϵr)≤Pβ−MEθ(|Br|≥|Ar|/2). (3)

In order to upper bound the latter fix and write,

 Pβ−MEθ(¯Yb[r]≥¯Ya∗r[r],μ∗r≥μ(θb)+ϵr∣∣¯Ya∗r[r]>μ∗r−ϵr/2) ≤Pθb(¯Yb[r]≥μ(θb)+ϵr/2)≤δr/3,

where in the last inequality we used Theorem 1. Now via Markov’s inequality this yields,

 (4)

Furthermore, Remark 1 gives that for any ,

 Pθa(¯Ya[r]≤μ(θa)−ϵr/2)≤δr/3. (5)

We obtain (2) by using (4) and (5) in (3).

With (2) in our possession, the fact that median elimination is -PAC follows through a union bound,

 Pβ−MEθ(μ∗(θ)≥μ(θ^a)+ϵ) ≤Pβ−MEθ⎛⎝⌈log2K⌉⋃r=1{μ∗r≥μ∗r+1+ϵr}⎞⎠ ≤∞∑r=1δr≤δ.

Regarding the sample complexity, we have that the total number of samples is at most,

 K⌈log2K⌉∑r=1Nr/2r−1 ≤2K+64βKϵ2∞∑r=1(89)r−1log2r3δ =O(Kϵ2log1δ).

### 3.3 Regret Minimization

Our device to solve the regret minimization problem is an adaptive allocation rule, , which is a sequence of random variables where is the arm that we select at time . Let , be the number of times we selected arm up to time . Our decision, , at time is based on the information that we have accumulated so far. More precisely, the event is measurable with respect to the -field generated by the past decisions , and the past observations .

Given a time horizon , and a parameter configuration , the expected regret incurred when the adaptive allocation rule is used, is defined as,

 Rϕθ(T)=∑b∉a∗(θ)Eϕθ[Nb(T)]Δb(θ),

where . Our goal is to come up with an adaptive allocation rule that makes the expected regret as small as possible.

There is a known asymptotic lower bound on how much we can minimize the expected regret. Any adaptive allocation rule that is uniformly good across all parameter configurations should satisfy the following instance specific, asymptotic regret lower bound (see [Anantharam et al., 1987] for details),

 ∑b≠a∗(θ)Δb(θ)¯¯¯¯¯D(θb∥∥θa∗(θ))≤liminfT→∞Rϕθ(T)logT,

where

is the Kullback-Leibler divergence rate between the Markov chains with transition probability matrices

and , given by,

 ¯¯¯¯¯D(θ∥λ)=∑x,y∈SlogPθ(x,y)Pλ(x,y)πθ(x)Pθ(x,y).

Here we utilize our Theorem 1 to provide a finite-time analysis of the -UCB adaptive allocation rule for Markovian bandits, which is order optimal. The -UCB adaptive allocation rule, is a simple and computationally efficient index policy based on upper confidence bounds which was initially proposed in [Auer et al., 2002] for IID bandits. It has already been studied in the context of Markovian bandits in [Tekin and Liu, 2010], but in a more restrictive setting under the further assumptions of aperiodicity and reversibility due the use of the bounds from [Gillman, 1993, Lezaud, 1998].

###### Theorem 3.

If then,

 Rϕβ−UCBθ(T)≤8β⎛⎝∑b≠a∗(θ)1Δb(θ)⎞⎠logT+γγ−2∑b≠a∗(θ)Δb(θ),

where .

###### Proof.

Fix , and observe that,

 Nb(T)≤1+8βΔb(θ)2logT+T−1∑t=2I{ϕt+1=b, Nb(t)≥8βΔb(θ)2logT}.

On the event , we have that, either , or , since otherwise the -UCB index of is larger than the -UCB index of which contradicts the assumption that .

In addition, using Theorem 1, we obtain,

 Pϕβ−UCBθ(¯Yb(t)≥μ(θb)+√2βlogtNb(t)) =t∑n=1Pϕβ−UCBθ(¯Yb(t)≥μ(θb)+√2βlogtNb(t),Nb(t)=n) ≤t∑n=1Pθb⎛⎝1nn∑k=1Ybk≥μ(θb)+√2βlogtn⎞⎠ ≤t∑n=11tγ=1tγ−1.

Similarly we can see that,

 Pϕβ−UCBθ(¯Ya∗(θ)(t)≤μ∗(θ)−√2βlogtNa∗(θ)(t))≤1tγ−1.

The conclusion now follows by putting everything together and using the integral estimate,

 T−1∑t=21tγ−1≤∫∞11tγ−1dt=1γ−2.

## Acknowledgements

We would like to thank Satish Rao for many helpful discussions. This research was supported in part by the NSF grant CCF-1816861.

## References

• [Aldous and Fill, 2002] Aldous, D. and Fill, J. (2002). Reversible markov chains and random walks on graphs.
• [Anantharam et al., 1987] Anantharam, V., Varaiya, P., and Walrand, J. (1987). Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays. II. Markovian rewards. IEEE Trans. Automat. Control, 32(11):977–982.
• [Auer et al., 2002] Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-time Analysis of the Multiarmed Bandit Problem. Mach. Learn., 47(2-3):235–256.
• [Azuma, 1967] Azuma, K. (1967). Weighted sums of certain dependent random variables. Tohoku Math. J. (2), 19:357–367.
• [Bubeck and Cesa-Bianchi, 2012] Bubeck, S. and Cesa-Bianchi, N. (2012). Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems.

Foundations and Trends in Machine Learning

, 5(1):1–122.
• [Davisson et al., 1981] Davisson, L. D., Longo, G., and Sgarro, A. (1981). The error exponent for the noiseless encoding of finite ergodic Markov sources. IEEE Trans. Inform. Theory, 27(4):431–438.
• [Dinwoodie, 1995] Dinwoodie, I. H. (1995). A probability inequality for the occupation measure of a reversible Markov chain. Ann. Appl. Probab., 5(1):37–43.
• [Even-Dar et al., 2006] Even-Dar, E., Mannor, S., and Mansour, Y. (2006).

Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems.

J. Mach. Learn. Res., 7:1079–1105.
• [Fan et al., 2018] Fan, J., Jiang, B., and Sun, Q. (2018). Hoeffding’s lemma for Markov Chains and its applications to statistical learning.
• [Gillman, 1993] Gillman, D. (1993). A Chernoff bound for random walks on expander graphs. In 34th Annual Symposium on Foundations of Computer Science (Palo Alto, CA, 1993), pages 680–691. IEEE Comput. Soc. Press, Los Alamitos, CA.
• [Glynn and Ormoneit, 2002] Glynn, P. W. and Ormoneit, D. (2002). Hoeffding’s inequality for uniformly ergodic Markov chains. Statist. Probab. Lett., 56(2):143–146.
• [Hoeffding, 1963] Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc., 58:13–30.
• [Jerrum et al., 2001] Jerrum, M., Sinclair, A., and Vigoda, E. (2001). A polynomial-time approximation algorithm for the permanent of a matrix with non-negative entries. In

Proceedings of the Thirty-Third Annual ACM Symposium on Theory of Computing

, pages 712–721. ACM, New York.
• [Kahale, 1997] Kahale, N. (1997). Large deviation bounds for Markov chains. Combin. Probab. Comput., 6(4):465–474.
• [Kontoyiannis et al., 2006] Kontoyiannis, I., Lastras-Montaño, L. A., and Meyn, S. P. (2006). Exponential Bounds and Stopping Rules for MCMC and General Markov Chains. In Proceedings of the 1st International Conference on Performance Evaluation Methodolgies and Tools, valuetools ’06, New York, NY, USA. ACM.
• [Lai and Robbins, 1985] Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Adv. in Appl. Math., 6(1):4–22.
• [León and Perron, 2004] León, C. A. and Perron, F. (2004). Optimal Hoeffding bounds for discrete reversible Markov chains. Ann. Appl. Probab., 14(2):958–970.
• [Lezaud, 1998] Lezaud, P. (1998). Chernoff-type bound for finite Markov chains. Ann. Appl. Probab., 8(3):849–867.
• [Mannor and Tsitsiklis, 0304] Mannor, S. and Tsitsiklis, J. N. (2003/04). The sample complexity of exploration in the multi-armed bandit problem. J. Mach. Learn. Res., 5:623–648.
• [Metropolis et al., 1953] Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E. (1953). Equation of state calculations by fast computing machines. The Journal of Chemical Physics, 21(6):1087–1092.
• [Miasojedow, 2014] Miasojedow, B. a. (2014). Hoeffding’s inequalities for geometrically ergodic Markov chains on general state space. Statist. Probab. Lett., 87:115–120.
• [Moulos, 2019] Moulos, V. (2019). Optimal Best Markovian Arm Identification with Fixed Confidence. In 33rd Annual Conference on Neural Information Processing Systems.
• [Moulos and Anantharam, 2019] Moulos, V. and Anantharam, V. (2019). Optimal chernoff and hoeffding bounds for finite state markov chains.
• [Rao, 2019] Rao, S. (2019). A Hoeffding inequality for Markov chains. Electron. Commun. Probab., 24:Paper No. 14, 11.
• [Tekin and Liu, 2010] Tekin, C. and Liu, M. (2010). Online algorithms for the multi-armed bandit problem with Markovian rewards. In 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 1675–1682.
• [Watanabe and Hayashi, 2017] Watanabe, S. and Hayashi, M. (2017). Finite-length analysis on tail probability for Markov chain and application to simple hypothesis testing. Ann. Appl. Probab., 27(2):811–845.