1 Introduction
In this paper we study a generalization of the stochastic multiarmed bandit problem, where there are independent arms, and each arm is associated with a parameter , and modeled as a discrete time stochastic process governed by the probability law . A time horizon is prescribed, and at each round we select arms, where , without any prior knowledge of the statistics of the underlying stochastic processes. The stochastic processes that correspond to the selected arms evolve by one time step, and we observe this evolution through a reward function, while the stochastic processes for the rest of the arms stay frozen. Our goal is to select arms in such a way so as to make the cumulative reward over the whole time horizon as large as possible. For this task we are faced with an exploitation versus exploration dilemma. At each round we need to decide whether we are going to exploit the best arms according to the information that we have gathered so far, or we are going to explore some other arms which do not seem to be so rewarding, just in case that the rewards we have observed so far deviate significantly from the expected rewards. The answer to this dilemma is usually coming by calculating indices for the arms and ranking them according to those indices, which should incorporate both information on how good an arm seems to be as well as on how many times it has been played so far.
1.1 Contributions

We first consider the case that the stochastic processes are irreducible Markov chains, coming from a oneparameter exponential family of Markov chains. The objective is to play as much as possible the arms with the largest stationary means, although we have no prior information about the statistics of the Markov chains. The difference of the best possible expected rewards coming from those best arms and the expected reward coming from the arms that we played is the regret that we incur. To minimize the regret we consider an index based adaptive allocation rule, Algorithm 1, which is based on sample means and upper confidence bounds for the stationary expected rewards using the KullbackLeibler divergence rate. We provide a finitetime analysis, Theorem 1, for this KLUCB adaptive allocation rule which shows that the regret depends logarithmically on the time horizon , and matches exactly the asymptotic lower bound, Corollary 1.

In order to make the finitetime guarantee possible we devise several deviation lemmata for Markov chains. The most profound one is an exponential martingale for Markov chains, Lemma 3, which leads to a maximal inequality for Markov chains, Lemma 4. In the literature there are two approaches that use martingale techniques in order to derive deviation inequalities for Markov chains. Glynn and Ormoneit (2002) use the so called Dynkin’s martingale in order to develop a Hoeffding inequality for Markov chains, and Moulos (2020) uses the so called Doob’s martingale for the same reason. None of those two martingales is directly comparable with the exponential martingale, and there is no evidence that they lead to maximal inequalities. Moreover, a Chernoff bound for Markov chains is devised, Lemma 2, and its relation with the work of Moulos and Anantharam (2019) is discussed in Remark 1.

We then consider the case that the stochastic processes are IID processes, each corresponding to a density coming from a oneparameter exponential family of densities. We establish, Theorem 2, that Algorithm 1 still enjoys the same finitetime regret guarantees, which are asymptotically optimal. The case where Theorem 2 follows directly from Theorem 1 is discussed in Remark 4. The setting of single plays is studied in Cappé et al. (2013), but as we discuss in Remark 5 their KLUCB adaptive allocation rules is incapable to deliver optimal results for the case of multiple plays.
1.2 Motivation
Multiarmed bandits provide a simple abstract statistical model that can be applied to study real world problems such as clinical trials, ad placement, gambling, adaptive routing, resource allocation in computer systems etc. We refer the interested reader to the survey of Bubeck and CesaBianchi (2012) for more context, and to the recent books of Lattimore and Szepesvári (2019); Slivkins (2019). The need for multiple plays can be understood in the setting of resource allocation. Scheduling jobs to a single CPU is an instance of the multiarmed bandit problem with a single play at each round, where the arms correspond to the jobs. If there are multiple CPUs we get an instance of the multiarmed bandit problem with multiple plays. The need of a richer model which allows the presence of Markovian dependence is illustrated in the context of gambling, where the arms correspond to slotmachines. It is reasonable to try to model the assertion that if a slotmachine produced a high reward the th time played, then it is very likely that it will produce a much lower reward the th time played, simply because the casino wants us to lose money and decides to change the reward distribution to a much stingier one. This assertion requires, the reward distributions to depend on the previous outcome, which is precisely captured by the Markovian reward model.
1.3 Related Work
The cornerstone of the multiarmed bandits literature is the pioneering work of Lai and Robbins (1985), which studies the problem for the case of IID rewards and single plays. Lai and Robbins (1985) introduce the change of measure argument to derive a lower bound for the problem, as well as adaptive allocation rules based on upper confidence bounds which are proven to be asymptotically optimal. Anantharam et al. (1987a) extend the results of Lai and Robbins (1985) to the case of IID rewards and multiple plays, while Agrawal (1995) considers index based allocation rules which are only based on sample means and are computationally simpler, although they may not be asymptotically optimal. The work of Agrawal (1995) inspired the first finitetime analysis for the adaptive allocation rule called UCB by Auer et al. (2002), which is though asymptotically suboptimal. The works of Cappé et al. (2013); Garivier and Cappé (2011); Maillard et al. (2011) bridge this gap by providing the KLUCB adaptive allocation rule, with finitetime guarantees which are asymptotically optimal.
The case of Markovian rewards and multiple plays, is initiated in the work of Anantharam et al. (1987b). They report an asymptotic lower bound, as well as an upper confidence bound adaptive allocation rule which is proven to be asymptotically optimal. However, it is unclear if the statistics that they use in order to derive the upper confidence bounds, in their equation (4.2), can be recursively computed, and the practical applicability of their results is therefore questionable. In addition, they don’t provide any finitetime analysis, and they use a different type of assumption on their oneparameter family of Markov chains. In particular, they assume that their oneparameter family of transition probability matrices is logconcave in the parameter, equation (4.1) in Anantharam et al. (1987b), while we assume that it is a oneparameter exponential family of transition probability matrices. Tekin and Liu (2010, 2012) extend the UCB adaptive allocation rule of Auer et al. (2002), to the case of Markovian rewards and multiple plays. They provide a finitetime analysis, but their regret bounds are suboptimal. Moreover they impose a different type of assumption on their configuration of Markov chains. They assume that the transition probability matrices are reversible, so that they can apply the Hoeffding bound for Markov chains from the work of Gillman (1993). In a recent work Moulos (2020) developed a Hoeffding bound for Markov chains, which does not assume any conditions other than irreducibility, and using this he extended the analysis of UCB to an even broader class of Markov chains. One of our main contributions is to bridge this gap and provide a KLUCB adaptive allocation rule, with a finitetime guarantee which is asymptotically optimal.
2 Problem Formulation
2.1 OneParameter Family of Markov Chains
We consider a oneparameter family of irreducible Markov chains on a finite state space . Each member of the family is indexed by a parameter , and is characterized by an initial distribution , and an irreducible transition probability matrix , which give rise to a probability law . There are arms, with overall parameter configuration , and each arm evolves internally as the Markov chain with parameter which we denote by . There is a common noncostant realvalued reward function on the state space , and successive plays of arm result in observing samples from the stochastic process , where . In other words, the distribution of the rewards coming from arm is a function of the Markov chain with parameter , and thus the it can have more complicated dependencies. As a special case, if we pick the reward function to be injective, then the distribution of the rewards is Markovian.
For , due to irreducibility, there exists a unique stationary distribution for the transition probability matrix which we denote with . Furthermore, let be the stationary mean reward corresponding to the Markov chain parametrized by . Without loss of generality we may assume that the arms are ordered so that,
for some and , where means that means that , and we set and .
2.2 Regret Minimization
We fix a time horizon , and at each round we play a set of distinct arms, where is the same through out the rounds, and we observe rewards given by,
where is the number of times we played arm up to time . Using the stopping times , we can also reconstruct the process, from the observed process, via the identity . Our play is based on the information that we have accumulated so far. In other words, the event , for with , belongs to the field generated by . We call the sequence of our plays an adaptive allocation rule. Our goal is to come up with an adaptive allocation rule , that achieves the greatest possible expected value for the sum of the rewards,
which is equivalent to minimizing the expected regret,
(1) 
As a proxy for the regret we will use the following quantity which involves directly the number of times each arm hasn’t been played, and the number of times each arm has been played,
(2) 
For the IID case , and in the more general Markovian case is just a constant term apart from the expected regret . Note that a feature that makes the case of multiple plays more delicate than the case of single plays, even for IID rewards, is the presence of the first summand in Equation 2. For this we also need to analyze the number of times each of the best hasn’t been played.
Lemma 1.
where .
2.3 Asymptotic Lower Bound
A quantity that naturally arises in the study of regret minimization for Markovian bandits is the KullbackLeibler divergence rate
between two Markov chains, which is a generalization of the usual KullbackLeibler divergence between two probability distributions. We denote by
the KullbackLeibler divergence rate between the Markov chain with parameter and the Markov chain with parameter , which is given by,(3) 
where we use the standard notational conventions , and . Indeed note that, if and , for all , i.e. in the special case that the Markov chains correspond to IID processes, then the KullbackLeibler divergence rate is equal to the KullbackLeibler divergence between and ,
Under some regularity assumptions on the oneparameter family of Markov chains, Anantharam et al. (1987b) in their Theorem 3.1 are able to establish the following asymptotic lower bound on the expected regret for any adaptive allocation rule which is uniformly good across all parameter configurations,
(4) 
A further discussion of this lower bound, as well as an alternative derivation can be found in Appendix D,
The main goal of this work is to derive a finite time analysis for an adaptive allocation rule which is based on KullbackLeibler divergence rate indices, that is asymptotically optimal. We do so for the oneparameter exponential family of Markov chains, which forms a generalization of the classic oneparameter exponential family generated by a probability distribution with finite support.
2.4 OneParameter Exponential Family Of Markov Chains
Let be a finite state space, be a nonconstant reward function on the state space, and an irreducible transition probability matrix on , with associated stationary distribution .
will serve as the generator stochastic matrix of the family. Let
be the stationary mean of the Markov chain induced by when is applied. By tilting exponentially the transitions of we are able to construct new transition matrices that realize a whole range of stationary means around and form the exponential family of stochastic matrices. Let , and consider the matrix . Denote by its spectral radius. According to the PerronFrobenius theory, see Theorem 8.4.4 in the book of Horn and Johnson (2013),is a simple eigenvalue of
, called the PerronFrobenius eigenvalue, and we can associate to it unique left and right eigenvectors such that they are both positive, and . Using them we define the member of the exponential family which corresponds to the natural parameter as,(5) 
where is the logPerronFrobenius eigenvalue. It can be easily seen that is indeed a stochastic matrix, and its stationary distribution is given by . The initial distribution associated to the parameter , can be any distribution on , since the KLUCB adaptive allocation rule that we devise, and its guarantees, will be valid no matter the initial distributions.
Exponential families of Markov chains date back to the work of Miller (1961). For a short overview of oneparameter exponential families of Markov chains, as well as proofs of the following properties, we refer the reader to Section 2 in Moulos and Anantharam (2019). The logPerronFrobenius eigenvalue is a convex analytic function on the real numbers, and through its derivative, , we obtain the stationary mean of the Markov chain with transition matrix when is applied, i.e. . When is not the linear function , the logPerronFrobenius eigenvalue, , is strictly convex and thus its derivative is strictly increasing, and it forms a bijection between the natural parameter space, , and the mean parameter space, , which is a bounded open interval.
The KullbackLeibler divergence rate from (3), when instantiated for the exponential family of Markov chains, can be expressed as,
which is convex and differentiable over . Since forms a bijection from the natural parameter space, , to the mean parameter space, , with some abuse of notation we will write for , where . Furthermore, can be extended continuously, to a function , where denotes the closure of . This can even further be extended to a convex function on , by setting if or . For fixed , the function is decreasing for and increasing for . Similarly, for fixed , the function is decreasing for and increasing for .
3 Concentration Lemmata for Markov Chains
In this section we present our concentration results for Markov chains. We start with a Chernoff bound, which remarkably does not imposes any conditions on the Markov chain other than irreducibility which is though a mandatory requirement for the stationary mean to be welldefined.
Lemma 2 (Chernoff bound for irreducible Markov chains).
Let be an irreducible Markov chain over the finite state space with transition probability matrix , initial distribution , and stationary distribution . Let be a nonconstant function on the state space. Denote by the stationary mean when is applied, and by the empirical mean, where . Let be a closed subset of . Then,
where stands for the KullbackLeibler divergence rate in the exponential family of stochastic matrices generated by and , and is a positive constant depending only on the transition probability matrix , the function and the closed set .
Remark 1.
This bound is a variant of Theorem 1 in Moulos and Anantharam (2019), where the authors derive a Chernoff bound under some structural assumptions on the transition probability matrix and the function . In our Lemma 2 we derive a Chernoff bound without any assumptions, relying though on the fact that lies in a closed subset of the mean parameter space.
Next we present an exponential martingale for Markov chains, which in turn leads to a maximal inequality.
Lemma 3 (Exponential martingale for Markov chains).
Let be a Markov chain over the finite state space with an irreducible transition matrix and initial distribution . Let be a nonconstant realvalued function on the state space. Fix and define,
(6) 
Then is a martingale with respect to the filtration , where is the field generated by .
The following definition is the condition that we will use for our maximal inequality to apply.
Definition 1 (Doeblin’s type of condition).
Let be a transition probability matrix on the finite state space . For a nonempty set of states , we say that is Doeblin if, the submatrix of with rows and columns in is irreducible, and for every there exists such that .
Remark 2.
Our Definition 1 is inspired by the classic Doeblin’s Theorem, see Theorem 2.2.1 in Stroock (2014). Doeblin’s Thoerem states that, if the transition probability matrix satisfies Doeblin’s condition (namely there exists , and a state such that for all we have ), then has a unique stationary distribution , and for all initial distributions we have geometric convergence to stationarity, i.e. . Doeblin’s condition, according to our Definition 1, corresponds to being Doeblin for some .
Lemma 4 (Maximal inequality for irreducible Markov chains satisfying Doeblin’s condition).
Let be an irreducible Markov chain over the finite state space with transition matrix , initial distribution , and stationary distribution . Let be a nonconstant function on the state space. Denote by the stationary mean when is applied, and by the empirical mean, where . Assume that is Doeblin. Then for all we have
where is a positive constant depending only on the transition probability matrix and the function .
Remark 3.
If we only consider values of from a bounded subset of , then we don’t need to assume that is Doeblin, and the constant will further depend on this bounded subset. But in the analysis of the KLUCB adaptive allocation rule we will need to consider values of that increase with the time horizon , therefore we have to impose the assumption that is Doeblin, so that has no dependencies on .
IID versions of this maximal inequality have found applicability not only in multiarmed bandit problems, but also in the case of context tree estimation,
Garivier and Leonardi (2011), indicating that our Lemma 4 may be of interest for other applications as well.4 The KLUCB Adaptive Allocation Rule for Multiple Plays and Markovian Rewards
4.1 The Algorithm
For each arm we define the empirical mean at the global time as,
(7) 
and its local time counterpart as,
with their link being , where . At each round we calculate an upper confidence bound index,
(8) 
where is an increasing function, and we denote its local time version by,
It is straightforward to check, using the definition of , the following two relations,
(9)  
(10) 
Furthermore, in Appendix B we study the concentration properties of those upper confidence indices and of the sample means, using the concentration results for Markov chains from Section 3. The KLUCB adaptive allocation rule, and its guarantees are presented below.
Proposition 1.
For each we have that , and so Algorithm 1 is well defined.
Theorem 1 (Markovian rewards and multiple plays: finitetime guarantees).
Let be an irreducible transition probability matrix on the finite state space , and be a realvalued reward function, such that is Doeblin. Assume that the arms correspond to the parameter configuration of the exponential family of Markov chains, as described in Equation 5. Without loss of generality assume that the arms are ordered so that,
Fix . The KLUCB adaptive allocation rule for Markovian rewards and multiple plays, Algorithm 1, with the choice , enjoys the following finitetime upper bound on the regret,
where are constants with respect to , which are given more explicitly in the analysis.
Corollary 1 (Asymptotic optimality).
In the context of Theorem 1 the KLUCB adaptive allocation rule, Algorithm 1, is asymptotically optimal, and,
4.2 Sketch of the Analysis
Due to Lemma 1, it suffices to upper bound the proxy for the expected regret given in Equation 2. Therefore, we can break the analysis in two parts: upper bounding , for , and upper bounding , for .
For the first part, we show in Appendix C that the expected number of times that an arm hasn’t been played, is of the order of .
Lemma 5.
For every arm ,
where and are constants with respect to .
For the second part, if , and , then there are three possibilities:

, and for some ,

, and for all , and ,

.
This means that,
and we handle each of those three terms separately.
We show that the first term is upper bounded by .
Lemma 6.
where and are constant with respect to .
The second term is of the order of , and it is the term that causes the overall logarithmic regret.
Lemma 7.
where , and , are constants with respect to .
Finally, we show that the third term is upper bounded by .
Lemma 8.
where and are constants with respect to .
This concludes the proof of Theorem 1, modulo the four bounds of this subsection which are established in Appendix C.
5 The KLUCB Adaptive Allocation Rule for Multiple Plays and IID Rewards
As a byproduct of our work in Section 4 we further obtain a finitetime regret bound, which is asymptotically optimal, for the case of multiple plays and IID rewards, from an exponential family of probability densities.
We first review the notion of an exponential family of probability densities, for which the standard reference is Brown (1986). Let be a probability space. A oneparameter exponential family is a family of probability densities with respect to the measure on , of the form,
(11) 
where is called the sufficient statistic, is measurable, and there is no such that is called the carrier density, and is a density with respect to , and
is called the logMomentGeneratingFunction and is given by
, which is finite for in the natural parameter space . The logMGF, , is strictly convex and its derivative forms a bijection between the natural parameters, , and the mean parameters, . The KullbackLeibler divergence between and , for , can be written as .For this section, each arm with parameter corresponds to the IID process , where each has density with respect to , which gives rise to the IID reward process , with .
Remark 4.
When there is a finite set such that , then the exponential family of probability densities in Equation 11, is just a special case of the exponential family of Markov chains in Equation 5, as can be seen by setting , for all . Then for all , the logPerronFrobenius eigenvalue coincides with the logMGF, and . Therefore, Theorem 1 already resolves the case of multiple plays and IID rewards from an exponential family of finitely supported densities.
Theorem 2 (IID rewards and multiple plays: finitetime guarantees).
Let be a probability space, a measurable function, and a density with respect to . Assume that the arms correspond to the parameter configuration of the exponential family of probability densities, as described in Equation 11. Without loss of generality assume that the arms are ordered so that,
Fix . The KLUCB adaptive allocation rule for IID rewards and multiple plays, Algorithm 1, with the choice , enjoys the following finitetime upper bound on the regret,
where are constants with respect to .
Consequently, the KLUCB adaptive allocation rule, Algorithm 1, is asymptotically optimal, and,
Remark 5.
For the special case of single plays, , such a finitetime regret bound is derived in Cappé et al. (2013), and here we generalize it for multiple plays, . One striking difference between the case of single plays, and multiple plays, is that in the case of multiple plays one needs to further analyze the number of times that the each of the best arms hasn’t been played, as we we do in Lemma 5, and this is inevitable due the decomposition of the regret in Equation 2. In the case of single plays no such analysis is needed due to the fact that there is only one best arm, and hence we can track the number of times it has been played by analyzing the number of times all the other arms have been played. But the KLUCB adaptive allocation rule proposed in Cappé et al. (2013) is only using KLUCB indices, which on their own are not enough to analyze the number of times each of the best arms hasn’t been played. In order to achieve this, one needs to combine the KLUCB indices, Equation 8, with the mean statistics, Equation 7, as performed in Algorithm 1. This indeed results in optimal regret guarantees for the case of multiple plays.
Acknowledgements
We would like to thank Venkat Anantharam, Jim Pitman and Satish Rao for many helpful discussions. This research was supported in part by the NSF grant CCF1816861.
References
 Agrawal (1995) Agrawal, R. (1995). Sample mean based index policies with regret for the multiarmed bandit problem. Adv. in Appl. Probab., 27(4):1054–1078.
 Anantharam et al. (1987a) Anantharam, V., Varaiya, P., and Walrand, J. (1987a). Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays. I. I.I.D. rewards. IEEE Trans. Automat. Control, 32(11):968–976.
 Anantharam et al. (1987b) Anantharam, V., Varaiya, P., and Walrand, J. (1987b). Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays. II. Markovian rewards. IEEE Trans. Automat. Control, 32(11):977–982.
 Auer et al. (2002) Auer, P., CesaBianchi, N., and Fischer, P. (2002). Finitetime Analysis of the Multiarmed Bandit Problem. Mach. Learn., 47(23):235–256.
 Brown (1986) Brown, L. D. (1986). Fundamentals of statistical exponential families with applications in statistical decision theory, volume 9 of Institute of Mathematical Statistics Lecture Notes—Monograph Series. Institute of Mathematical Statistics, Hayward, CA.

Bubeck and CesaBianchi (2012)
Bubeck, S. and CesaBianchi, N. (2012).
Regret Analysis of Stochastic and Nonstochastic Multiarmed Bandit
Problems.
Foundations and Trends® in Machine Learning
, 5(1):1–122.  Cappé et al. (2013) Cappé, O., Garivier, A., Maillard, O.A., Munos, R., and Stoltz, G. (2013). KullbackLeibler upper confidence bounds for optimal sequential allocation. Ann. Statist., 41(3):1516–1541.
 Combes and Proutiere (2014) Combes, R. and Proutiere, A. (2014). Unimodal bandits without smoothness.
 Cover and Thomas (2006) Cover, T. M. and Thomas, J. A. (2006). Elements of information theory. WileyInterscience [John Wiley & Sons], Hoboken, NJ, second edition.
 Garivier and Cappé (2011) Garivier, A. and Cappé, O. (2011). The KLUCB Algorithm for Bounded Stochastic Bandits and Beyond. In Kakade, S. M. and von Luxburg, U., editors, Proceedings of the 24th Annual Conference on Learning Theory, volume 19 of Proceedings of Machine Learning Research, pages 359–376, Budapest, Hungary. PMLR.
 Garivier and Leonardi (2011) Garivier, A. and Leonardi, F. (2011). Context tree selection: A unifying view. Stochastic Processes and their Applications, 121(11):2488 – 2506.
 Gillman (1993) Gillman, D. (1993). A Chernoff bound for random walks on expander graphs. In 34th Annual Symposium on Foundations of Computer Science (Palo Alto, CA, 1993), pages 680–691. IEEE Comput. Soc. Press, Los Alamitos, CA.
 Glynn and Ormoneit (2002) Glynn, P. W. and Ormoneit, D. (2002). Hoeffding’s inequality for uniformly ergodic Markov chains. Statist. Probab. Lett., 56(2):143–146.
 Horn and Johnson (2013) Horn, R. A. and Johnson, C. R. (2013). Matrix analysis. Cambridge University Press, Cambridge, second edition.
 Kaufmann et al. (2016) Kaufmann, E., Cappé, O., and Garivier, A. (2016). On the Complexity of Bestarm Identification in Multiarmed Bandit Models. J. Mach. Learn. Res., 17(1):1–42.
 Lai and Robbins (1985) Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Adv. in Appl. Math., 6(1):4–22.
 Lattimore and Szepesvári (2019) Lattimore, T. and Szepesvári, C. (2019). Bandit Algorithms.
 Maillard et al. (2011) Maillard, O.A., Munos, R., and Stoltz, G. (2011). A FiniteTime Analysis of Multiarmed Bandits Problems with KullbackLeibler divergences. In Kakade, S. M. and von Luxburg, U., editors, Proceedings of the 24th Annual Conference on Learning Theory, volume 19 of Proceedings of Machine Learning Research, pages 497–514, Budapest, Hungary. PMLR.

Miller (1961)
Miller, H. D. (1961).
A convexity property in the theory of random variables defined on a finite Markov chain.
Ann. Math. Statist., 32:1260–1270.  Moulos (2019) Moulos, V. (2019). Optimal Best Markovian Arm Identification with Fixed Confidence. In 33rd Annual Conference on Neural Information Processing Systems.
 Moulos (2020) Moulos, V. (2020). A Hoeffding Inequality for Finite State Markov Chains and its Applications to Markovian Bandits.
 Moulos and Anantharam (2019) Moulos, V. and Anantharam, V. (2019). Optimal Chernoff and Hoeffding Bounds for Finite State Markov Chains.
 Slivkins (2019) Slivkins, A. (2019). Introduction to MultiArmed Bandits. Foundations and Trends® in Machine Learning, 12(12):1–286.
 Stroock (2014) Stroock, D. W. (2014). An introduction to Markov processes, volume 230 of Graduate Texts in Mathematics. Springer, Heidelberg, second edition.
 Tekin and Liu (2010) Tekin, C. and Liu, M. (2010). Online algorithms for the multiarmed bandit problem with Markovian rewards. In 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 1675–1682.
 Tekin and Liu (2012) Tekin, C. and Liu, M. (2012). Online learning of rested and restless bandits. IEEE Trans. Inf. Theor., 58(8):5588–5611.
 Ville (1939) Ville, J. (1939). Étude critique de la notion de collectif. NUMDAM.
Appendix A Concentration Lemmata for Markov Chains
Proof of Lemma 2..
Using the standard exponential transform followed by Markov’s inequality we obtain that for any ,
We can upper bound the expectation from above in the following way,
where in the last equality we used the fact that is a right PerronFrobenius eigenvector of .
From those two we obtain,
and if we plug in
Comments
There are no comments yet.