1 Introduction
Let be a Markov chain on a finite state space , with initial distribution
, and irreducible transition probability matrix
, governed by the probability law . Let be its stationary distribution, andbe a realvalued function on the state space. Then the strong law of large numbers for Markov chains states that,
Moreover, the central limit theorem for Markov chains provides a rate for this convergence,
where
is the asymptotic variance.
Those asymptotic results are insufficient in many applications which require finitesample estimates. One of the most central such application is the convergence of Markov chain Monte Carlo (MCMC) approximation techniques
[Metropolis et al., 1953], where a finitesample estimate is needed to bound the approximation error. Further applications include theoretical computer science and the approximation of the permanent [Jerrum et al., 2001], as well as statistical learning theory and multiarmed bandit problems
[Moulos, 2019].Motivated by this discussion we provide a finitesample Hoeffding inequality for finite Markov chains. In the special case that the random variables
are independent and identically distributed according to , Hoeffding’s classical inequality [Hoeffding, 1963] states that,(1) 
In our Theorem 1 we develop a version of Hoeffding’s inequality for finite state Markov chains. Our bound is very simple and easily computable, since it is based on martingale techniques and it only involves hitting times of Markov chains which are very well studied for many types of Markov chains [Aldous and Fill, 2002]. It is worth mentioning that our bound is based solely on irreduciblity, and it does not make any extra assumptions like aperiodicity or reversibility which prior works require.
There is a rich literature on finitesample bounds for Markov chains. One of the earliest works [Davisson et al., 1981] uses counting and a generalization of the method of types, in order to derive a Chernoff bound for ergodic, i.e. irreducible and aperiodic, Markov chains. An alternative approach [Watanabe and Hayashi, 2017, Moulos and Anantharam, 2019], uses the theory of large deviations to derive sharper Chernoff bounds. When reversibility is assumed, the transition probability matrix is symmetric with respect to the space , which enables the use of matrix perturbation theory. This idea leads to Hoeffding inequalities that involve the spectral gap of the Markov chain and was initiated in [Gillman, 1993]. Refinements of this bound were given in a series of works [Dinwoodie, 1995, Kahale, 1997, Lezaud, 1998, León and Perron, 2004, Miasojedow, 2014]. In [Rao, 2019, Fan et al., 2018] a generalized spectral gap is introduced in order to obtain bounds even for a certain class of irreversible Markov chains as long as they posses a strictly positive generalized spectral gap. Informationtheoretic ideas are used in [Kontoyiannis et al., 2006] in order to derive a Hoeffding inequality for Markov chains with general state spaces that satisfy Doeblin’s condition, which in the case of a finite state space is equivalent with ergodicity. Our approach uses Doob’s martingale combined with Azuma’s inequality, and is probably closest to the work of [Glynn and Ormoneit, 2002], where they establish a bound for Markov chains with general state spaces using martingale techniques, but their result heavily relies on the Markov chains satisfying Doeblin’s condition, and is thus not applicable to periodic Markov chains.
To illustrate the applicability of our bound we use it to study two Markovian multiarmed bandit problems. The stochastic multiarmed bandits problem is a prototypical statistical problem, where one is given multiple options, referred to as arms, and each of them is associated with a probability distribution. The emphasis is put on focusing as quickly as possible on the best available option, rather than estimating with high confidence the statistics of each option. The cornerstone of this field is the pioneering work of Lai and Robbins
[Lai and Robbins, 1985]. Here we study two variants of the multiarmed bandits problem where the probability distributions of the arms form Markov chains. First we consider the task of identifying with some fixed confidence an approximately best arm, and we use our bound to analyze the median elimination algorithm, originally proposed in [EvenDar et al., 2006] for the case of IID bandits. Then we turn into the problem of regret minimization for Markovian bandits, where we analyze the UCB algorithm that was introduced in [Auer et al., 2002] for IID bandits. For a thorough introduction to multiarmed bandits we refer the interested reader to the survey [Bubeck and CesaBianchi, 2012].2 A Hoeffding Inequality for Finite State Markov Chains
The central quantity that shows up in our Hoeffding inequality, and makes it differ from the classical IID Hoeffding inequality, is the maximum hitting time of a Markov chain with an irreducible transition probability matrix . This is defined as , which is ensured to be finite due to irreduciblity and the finiteness of the state sapce, and is the first time to visit state .
Theorem 1.
Let be a Markov chain on a finite state space , driven by an initial distribution , and an irreducible transition probability matrix . Let be a realvalued function. Then, for any ,
Proof.
We define the sums , for , and the filtration for . Then , where , is a martingale with respect to , the so called Doob martingale. We now proceed on deriving bounds on the martingale differences.
For , using the triangle inequality we obtain,
The first term can be upper bounded by using the fact that takes values in . For the second term using the Markov property and the timehomogeneity of the Markov chain we have that,
We now use a hitting time argument. Due to the fact that , for all , we have the following pointwise inequality,
Taking expectations, and using the strong Markov property we obtain,
Consequently, for ,
For by repeating the same steps we have that,
The conclusion now follows by observing that , and applying Azuma’s inequality, [Azuma, 1967]. ∎
Example 1.
Example 2.
Remark 1.
By substituting with in Theorem 1 we obtain the following bound for the lower tail,
and combining the upper and lower tail bounds we obtain the following twosided bound,
Note that when the Markov chain is initialized with its stationary distribution this takes the form,
Remark 2.
Observe that the technique used to establish Theorem 1 is limited to Markov chains with a finite state space . Indeed, if is a Markov chain on a countably infinite state space with an irreducible and positive recurrent transition probability matrix and a stationary distribution , then we claim that,
from which it follows that , due to the fact that and is countably infinite. The aforementioned inequality can be established as follows.
3 Markovian MultiArmed Bandits
3.1 Setup
There are arms, and each arm is associated with a parameter which uniquely encodes^{1}^{1}1 and the set of irreducible transition probability matrices have the same cardinality, and hence there is a bijection between them. an irreducible transition probability matrix . We will denote the overall parameter configuration of all arms with . Arm evolves according to the stationary Markov chain, , driven by the irreducible transition probability matrix which has a unique stationary distribution , so that . There is a common reward function which generates the reward process . The reward process, in general, is not going to be a Markov chain, unless is injective, and it will have more complicated dependencies than the underlying Markov chain. Each time that we select arm , this arm evolves by one transition and we observe the corresponding sample from the reward process , while all the other arms stay rested.
The stationary reward of arm is . Let be the maximum stationary mean, and for simplicity assume that there exists a unique arm, , attaining this maximum stationary mean, i.e. . In the following sections we will consider two objectives: identifying an best arm with some fixed confidence level using as few samples as possible, and minimizing the expected regret given some fixed horizon .
3.2 Approximate Best Arm Identification
In the approximate best arm identification problem, we are given an approximation accuracy , and a confidence level . Our goal is to come up with an adaptive algorithm which collects a total of samples, and returns an arm that is within from the best arm, , with probability at least , i.e.
Such an algorithm is called PAC (probably approximately correct).
In [Mannor and Tsitsiklis, 0304] a lower bound for the sample complexity of any PAC algorithm is derived. The lower bound states that no matter the PAC algorithm , there exists an instance such that the sample complexity is at least,
A matching upper bound is provided for IID bandits in [EvenDar et al., 2006] in the form of the median elimination algorithm. We demonstrate the usefulness of our Hoeffding inequality, by providing an analysis of the median elimination algorithm in the more general setting of Markovian bandits.
Theorem 2.
If then, the MedianElimination algorithm is PAC, and its sample complexity is upper bounded by .
Proof.
The total number of sampling rounds is at most , and we can set them equal to by setting , for , where . Fix . We claim that,
(2) 
We condition on the value of . If , then the claim is trivially true, so we only consider the case . Let , and . We consider the following set of bad arms,
and observe that,
(3) 
In order to upper bound the latter fix and write,
where in the last inequality we used Theorem 1. Now via Markov’s inequality this yields,
(4) 
Furthermore, Remark 1 gives that for any ,
(5) 
With (2) in our possession, the fact that median elimination is PAC follows through a union bound,
Regarding the sample complexity, we have that the total number of samples is at most,
∎
3.3 Regret Minimization
Our device to solve the regret minimization problem is an adaptive allocation rule, , which is a sequence of random variables where is the arm that we select at time . Let , be the number of times we selected arm up to time . Our decision, , at time is based on the information that we have accumulated so far. More precisely, the event is measurable with respect to the field generated by the past decisions , and the past observations .
Given a time horizon , and a parameter configuration , the expected regret incurred when the adaptive allocation rule is used, is defined as,
where . Our goal is to come up with an adaptive allocation rule that makes the expected regret as small as possible.
There is a known asymptotic lower bound on how much we can minimize the expected regret. Any adaptive allocation rule that is uniformly good across all parameter configurations should satisfy the following instance specific, asymptotic regret lower bound (see [Anantharam et al., 1987] for details),
where
is the KullbackLeibler divergence rate between the Markov chains with transition probability matrices
and , given by,Here we utilize our Theorem 1 to provide a finitetime analysis of the UCB adaptive allocation rule for Markovian bandits, which is order optimal. The UCB adaptive allocation rule, is a simple and computationally efficient index policy based on upper confidence bounds which was initially proposed in [Auer et al., 2002] for IID bandits. It has already been studied in the context of Markovian bandits in [Tekin and Liu, 2010], but in a more restrictive setting under the further assumptions of aperiodicity and reversibility due the use of the bounds from [Gillman, 1993, Lezaud, 1998].
Theorem 3.
If then,
where .
Proof.
Fix , and observe that,
On the event , we have that, either , or , since otherwise the UCB index of is larger than the UCB index of which contradicts the assumption that .
In addition, using Theorem 1, we obtain,
Similarly we can see that,
The conclusion now follows by putting everything together and using the integral estimate,
∎
Acknowledgements
We would like to thank Satish Rao for many helpful discussions. This research was supported in part by the NSF grant CCF1816861.
References
 [Aldous and Fill, 2002] Aldous, D. and Fill, J. (2002). Reversible markov chains and random walks on graphs.
 [Anantharam et al., 1987] Anantharam, V., Varaiya, P., and Walrand, J. (1987). Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays. II. Markovian rewards. IEEE Trans. Automat. Control, 32(11):977–982.
 [Auer et al., 2002] Auer, P., CesaBianchi, N., and Fischer, P. (2002). Finitetime Analysis of the Multiarmed Bandit Problem. Mach. Learn., 47(23):235–256.
 [Azuma, 1967] Azuma, K. (1967). Weighted sums of certain dependent random variables. Tohoku Math. J. (2), 19:357–367.

[Bubeck and CesaBianchi, 2012]
Bubeck, S. and CesaBianchi, N. (2012).
Regret Analysis of Stochastic and Nonstochastic Multiarmed Bandit
Problems.
Foundations and Trends in Machine Learning
, 5(1):1–122.  [Davisson et al., 1981] Davisson, L. D., Longo, G., and Sgarro, A. (1981). The error exponent for the noiseless encoding of finite ergodic Markov sources. IEEE Trans. Inform. Theory, 27(4):431–438.
 [Dinwoodie, 1995] Dinwoodie, I. H. (1995). A probability inequality for the occupation measure of a reversible Markov chain. Ann. Appl. Probab., 5(1):37–43.

[EvenDar et al., 2006]
EvenDar, E., Mannor, S., and Mansour, Y. (2006).
Action elimination and stopping conditions for the multiarmed bandit and reinforcement learning problems.
J. Mach. Learn. Res., 7:1079–1105.  [Fan et al., 2018] Fan, J., Jiang, B., and Sun, Q. (2018). Hoeffding’s lemma for Markov Chains and its applications to statistical learning.
 [Gillman, 1993] Gillman, D. (1993). A Chernoff bound for random walks on expander graphs. In 34th Annual Symposium on Foundations of Computer Science (Palo Alto, CA, 1993), pages 680–691. IEEE Comput. Soc. Press, Los Alamitos, CA.
 [Glynn and Ormoneit, 2002] Glynn, P. W. and Ormoneit, D. (2002). Hoeffding’s inequality for uniformly ergodic Markov chains. Statist. Probab. Lett., 56(2):143–146.
 [Hoeffding, 1963] Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc., 58:13–30.

[Jerrum et al., 2001]
Jerrum, M., Sinclair, A., and Vigoda, E. (2001).
A polynomialtime approximation algorithm for the permanent of a
matrix with nonnegative entries.
In
Proceedings of the ThirtyThird Annual ACM Symposium on Theory of Computing
, pages 712–721. ACM, New York.  [Kahale, 1997] Kahale, N. (1997). Large deviation bounds for Markov chains. Combin. Probab. Comput., 6(4):465–474.
 [Kontoyiannis et al., 2006] Kontoyiannis, I., LastrasMontaño, L. A., and Meyn, S. P. (2006). Exponential Bounds and Stopping Rules for MCMC and General Markov Chains. In Proceedings of the 1st International Conference on Performance Evaluation Methodolgies and Tools, valuetools ’06, New York, NY, USA. ACM.
 [Lai and Robbins, 1985] Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Adv. in Appl. Math., 6(1):4–22.
 [León and Perron, 2004] León, C. A. and Perron, F. (2004). Optimal Hoeffding bounds for discrete reversible Markov chains. Ann. Appl. Probab., 14(2):958–970.
 [Lezaud, 1998] Lezaud, P. (1998). Chernofftype bound for finite Markov chains. Ann. Appl. Probab., 8(3):849–867.
 [Mannor and Tsitsiklis, 0304] Mannor, S. and Tsitsiklis, J. N. (2003/04). The sample complexity of exploration in the multiarmed bandit problem. J. Mach. Learn. Res., 5:623–648.
 [Metropolis et al., 1953] Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E. (1953). Equation of state calculations by fast computing machines. The Journal of Chemical Physics, 21(6):1087–1092.
 [Miasojedow, 2014] Miasojedow, B. a. (2014). Hoeffding’s inequalities for geometrically ergodic Markov chains on general state space. Statist. Probab. Lett., 87:115–120.
 [Moulos, 2019] Moulos, V. (2019). Optimal Best Markovian Arm Identification with Fixed Confidence. In 33rd Annual Conference on Neural Information Processing Systems.
 [Moulos and Anantharam, 2019] Moulos, V. and Anantharam, V. (2019). Optimal chernoff and hoeffding bounds for finite state markov chains.
 [Rao, 2019] Rao, S. (2019). A Hoeffding inequality for Markov chains. Electron. Commun. Probab., 24:Paper No. 14, 11.
 [Tekin and Liu, 2010] Tekin, C. and Liu, M. (2010). Online algorithms for the multiarmed bandit problem with Markovian rewards. In 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 1675–1682.
 [Watanabe and Hayashi, 2017] Watanabe, S. and Hayashi, M. (2017). Finitelength analysis on tail probability for Markov chain and application to simple hypothesis testing. Ann. Appl. Probab., 27(2):811–845.
Comments
There are no comments yet.