# Finite-time Analysis of Kullback-Leibler Upper Confidence Bounds for Optimal Adaptive Allocation with Multiple Plays and Markovian Rewards

We study an extension of the classic stochastic multi-armed bandit problem which involves Markovian rewards and multiple plays. In order to tackle this problem we consider an index based adaptive allocation rule which at each stage combines calculations of sample means, and of upper confidence bounds, using the Kullback-Leibler divergence rate, for the stationary expected reward of Markovian arms. For rewards generated from a one-parameter exponential family of Markov chains, we provide a finite-time upper bound for the regret incurred from this adaptive allocation rule, which reveals the logarithmic dependence of the regret on the time horizon, and which is asymptotically optimal. For our analysis we devise several concentration results for Markov chains, including a maximal inequality for Markov chains, that may be of interest in their own right. As a byproduct of our analysis we also establish, asymptotically optimal, finite-time guarantees for the case of multiple plays, and IID rewards drawn from a one-parameter exponential family of probability densities.

## Authors

• 4 publications
11/16/2017

### Budget-Constrained Multi-Armed Bandits with Multiple Plays

We study the multi-armed bandit problem with multiple plays and a budget...
06/30/2016

### Asymptotically Optimal Algorithms for Budgeted Multiple Play Bandits

We study a generalization of the multi-armed bandit problem with multipl...
09/10/2021

### Optimal bounds for bit-sizes of stationary distributions in finite Markov chains

An irreducible stochastic matrix with rational entries has a stationary ...
01/18/2022

### Bregman Deviations of Generic Exponential Families

We revisit the method of mixture technique, also known as the Laplace me...
09/14/2020

### Hellinger KL-UCB based Bandit Algorithms for Markovian and i.i.d. Settings

In the regret-based formulation of multi-armed bandit (MAB) problems, ex...
05/24/2017

### Boundary Crossing Probabilities for General Exponential Families

We consider parametric exponential families of dimension K on the real l...
12/13/2021

### Stochastic differential equations for limiting description of UCB rule for Gaussian multi-armed bandits

We consider the upper confidence bound strategy for Gaussian multi-armed...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In this paper we study a generalization of the stochastic multi-armed bandit problem, where there are independent arms, and each arm is associated with a parameter , and modeled as a discrete time stochastic process governed by the probability law . A time horizon is prescribed, and at each round we select arms, where , without any prior knowledge of the statistics of the underlying stochastic processes. The stochastic processes that correspond to the selected arms evolve by one time step, and we observe this evolution through a reward function, while the stochastic processes for the rest of the arms stay frozen. Our goal is to select arms in such a way so as to make the cumulative reward over the whole time horizon as large as possible. For this task we are faced with an exploitation versus exploration dilemma. At each round we need to decide whether we are going to exploit the best arms according to the information that we have gathered so far, or we are going to explore some other arms which do not seem to be so rewarding, just in case that the rewards we have observed so far deviate significantly from the expected rewards. The answer to this dilemma is usually coming by calculating indices for the arms and ranking them according to those indices, which should incorporate both information on how good an arm seems to be as well as on how many times it has been played so far.

### 1.1 Contributions

1. We first consider the case that the stochastic processes are irreducible Markov chains, coming from a one-parameter exponential family of Markov chains. The objective is to play as much as possible the arms with the largest stationary means, although we have no prior information about the statistics of the Markov chains. The difference of the best possible expected rewards coming from those best arms and the expected reward coming from the arms that we played is the regret that we incur. To minimize the regret we consider an index based adaptive allocation rule, Algorithm 1, which is based on sample means and upper confidence bounds for the stationary expected rewards using the Kullback-Leibler divergence rate. We provide a finite-time analysis, Theorem 1, for this KL-UCB adaptive allocation rule which shows that the regret depends logarithmically on the time horizon , and matches exactly the asymptotic lower bound, Corollary 1.

2. In order to make the finite-time guarantee possible we devise several deviation lemmata for Markov chains. The most profound one is an exponential martingale for Markov chains, Lemma 3, which leads to a maximal inequality for Markov chains, Lemma 4. In the literature there are two approaches that use martingale techniques in order to derive deviation inequalities for Markov chains. Glynn and Ormoneit (2002) use the so called Dynkin’s martingale in order to develop a Hoeffding inequality for Markov chains, and Moulos (2020) uses the so called Doob’s martingale for the same reason. None of those two martingales is directly comparable with the exponential martingale, and there is no evidence that they lead to maximal inequalities. Moreover, a Chernoff bound for Markov chains is devised, Lemma 2, and its relation with the work of Moulos and Anantharam (2019) is discussed in Remark 1.

3. We then consider the case that the stochastic processes are IID processes, each corresponding to a density coming from a one-parameter exponential family of densities. We establish, Theorem 2, that Algorithm 1 still enjoys the same finite-time regret guarantees, which are asymptotically optimal. The case where Theorem 2 follows directly from Theorem 1 is discussed in Remark 4. The setting of single plays is studied in Cappé et al. (2013), but as we discuss in Remark 5 their KL-UCB adaptive allocation rules is incapable to deliver optimal results for the case of multiple plays.

### 1.2 Motivation

Multi-armed bandits provide a simple abstract statistical model that can be applied to study real world problems such as clinical trials, ad placement, gambling, adaptive routing, resource allocation in computer systems etc. We refer the interested reader to the survey of Bubeck and Cesa-Bianchi (2012) for more context, and to the recent books of Lattimore and Szepesvári (2019); Slivkins (2019). The need for multiple plays can be understood in the setting of resource allocation. Scheduling jobs to a single CPU is an instance of the multi-armed bandit problem with a single play at each round, where the arms correspond to the jobs. If there are multiple CPUs we get an instance of the multi-armed bandit problem with multiple plays. The need of a richer model which allows the presence of Markovian dependence is illustrated in the context of gambling, where the arms correspond to slot-machines. It is reasonable to try to model the assertion that if a slot-machine produced a high reward the -th time played, then it is very likely that it will produce a much lower reward the -th time played, simply because the casino wants us to lose money and decides to change the reward distribution to a much stingier one. This assertion requires, the reward distributions to depend on the previous outcome, which is precisely captured by the Markovian reward model.

### 1.3 Related Work

The cornerstone of the multi-armed bandits literature is the pioneering work of Lai and Robbins (1985), which studies the problem for the case of IID rewards and single plays. Lai and Robbins (1985) introduce the change of measure argument to derive a lower bound for the problem, as well as adaptive allocation rules based on upper confidence bounds which are proven to be asymptotically optimal. Anantharam et al. (1987a) extend the results of Lai and Robbins (1985) to the case of IID rewards and multiple plays, while Agrawal (1995) considers index based allocation rules which are only based on sample means and are computationally simpler, although they may not be asymptotically optimal. The work of Agrawal (1995) inspired the first finite-time analysis for the adaptive allocation rule called UCB by Auer et al. (2002), which is though asymptotically suboptimal. The works of Cappé et al. (2013); Garivier and Cappé (2011); Maillard et al. (2011) bridge this gap by providing the KL-UCB adaptive allocation rule, with finite-time guarantees which are asymptotically optimal.

The case of Markovian rewards and multiple plays, is initiated in the work of Anantharam et al. (1987b). They report an asymptotic lower bound, as well as an upper confidence bound adaptive allocation rule which is proven to be asymptotically optimal. However, it is unclear if the statistics that they use in order to derive the upper confidence bounds, in their equation (4.2), can be recursively computed, and the practical applicability of their results is therefore questionable. In addition, they don’t provide any finite-time analysis, and they use a different type of assumption on their one-parameter family of Markov chains. In particular, they assume that their one-parameter family of transition probability matrices is log-concave in the parameter, equation (4.1) in Anantharam et al. (1987b), while we assume that it is a one-parameter exponential family of transition probability matrices. Tekin and Liu (2010, 2012) extend the UCB adaptive allocation rule of Auer et al. (2002), to the case of Markovian rewards and multiple plays. They provide a finite-time analysis, but their regret bounds are suboptimal. Moreover they impose a different type of assumption on their configuration of Markov chains. They assume that the transition probability matrices are reversible, so that they can apply the Hoeffding bound for Markov chains from the work of Gillman (1993). In a recent work Moulos (2020) developed a Hoeffding bound for Markov chains, which does not assume any conditions other than irreducibility, and using this he extended the analysis of UCB to an even broader class of Markov chains. One of our main contributions is to bridge this gap and provide a KL-UCB adaptive allocation rule, with a finite-time guarantee which is asymptotically optimal.

## 2 Problem Formulation

### 2.1 One-Parameter Family of Markov Chains

We consider a one-parameter family of irreducible Markov chains on a finite state space . Each member of the family is indexed by a parameter , and is characterized by an initial distribution , and an irreducible transition probability matrix , which give rise to a probability law . There are arms, with overall parameter configuration , and each arm evolves internally as the Markov chain with parameter which we denote by . There is a common noncostant real-valued reward function on the state space , and successive plays of arm result in observing samples from the stochastic process , where . In other words, the distribution of the rewards coming from arm is a function of the Markov chain with parameter , and thus the it can have more complicated dependencies. As a special case, if we pick the reward function to be injective, then the distribution of the rewards is Markovian.

For , due to irreducibility, there exists a unique stationary distribution for the transition probability matrix which we denote with . Furthermore, let be the stationary mean reward corresponding to the Markov chain parametrized by . Without loss of generality we may assume that the arms are ordered so that,

 μ(θ1)≥…≥μ(θN)>μ(θN+1)…=μ(θM)=…=μ(θL)>μ(θL+1)≥…≥μ(θK),

for some and , where means that means that , and we set and .

### 2.2 Regret Minimization

We fix a time horizon , and at each round we play a set of distinct arms, where is the same through out the rounds, and we observe rewards given by,

 Zat={YaNa(t),if a∈ϕt0,if a∉ϕt,

where is the number of times we played arm up to time . Using the stopping times , we can also reconstruct the process, from the observed process, via the identity . Our play is based on the information that we have accumulated so far. In other words, the event , for with , belongs to the -field generated by . We call the sequence of our plays an adaptive allocation rule. Our goal is to come up with an adaptive allocation rule , that achieves the greatest possible expected value for the sum of the rewards,

 ST=T∑t=1∑a∈[K]Zat=∑a∈[K]Na(T)∑n=1Yan,

which is equivalent to minimizing the expected regret,

 Rϕθ(T)=TM∑a=1μ(θa)−Eϕθ[ST]. (1)

As a proxy for the regret we will use the following quantity which involves directly the number of times each arm hasn’t been played, and the number of times each arm has been played,

 ~Rϕθ(T)=N∑a=1(μ(θa)−μ(θM))Eϕθ[T−Na(T)]+K∑b=L+1(μ(θM)−μ(θb))Eϕθ[Nb(T)]. (2)

For the IID case , and in the more general Markovian case is just a constant term apart from the expected regret . Note that a feature that makes the case of multiple plays more delicate than the case of single plays, even for IID rewards, is the presence of the first summand in Equation 2. For this we also need to analyze the number of times each of the best hasn’t been played.

###### Lemma 1.
 ∣∣Rϕθ(T)−~Rϕθ(T)∣∣≤K∑a=1Ra⋅∑x∈S|f(x)|,

where .

### 2.3 Asymptotic Lower Bound

A quantity that naturally arises in the study of regret minimization for Markovian bandits is the Kullback-Leibler divergence rate

between two Markov chains, which is a generalization of the usual Kullback-Leibler divergence between two probability distributions. We denote by

the Kullback-Leibler divergence rate between the Markov chain with parameter and the Markov chain with parameter , which is given by,

 D(θ∥λ)=∑x,y∈SlogPθ(x,y)Pλ(x,y)πθ(x)Pθ(x,y), (3)

where we use the standard notational conventions , and . Indeed note that, if and , for all , i.e. in the special case that the Markov chains correspond to IID processes, then the Kullback-Leibler divergence rate is equal to the Kullback-Leibler divergence between and ,

 D(θ∥λ)=∑x,y∈Slogpθ(y)pλ(y)pθ(x)pθ(y)=∑y∈Slogpθ(y)pθ(y)pθ(y)=D(pθ∥pλ).

Under some regularity assumptions on the one-parameter family of Markov chains, Anantharam et al. (1987b) in their Theorem 3.1 are able to establish the following asymptotic lower bound on the expected regret for any adaptive allocation rule which is uniformly good across all parameter configurations,

 K∑b=L+1μ(θM)−μ(θb)D(θb∥θM)≤liminfT→∞Rϕθ(T)logT. (4)

A further discussion of this lower bound, as well as an alternative derivation can be found in Appendix D,

The main goal of this work is to derive a finite time analysis for an adaptive allocation rule which is based on Kullback-Leibler divergence rate indices, that is asymptotically optimal. We do so for the one-parameter exponential family of Markov chains, which forms a generalization of the classic one-parameter exponential family generated by a probability distribution with finite support.

### 2.4 One-Parameter Exponential Family Of Markov Chains

Let be a finite state space, be a nonconstant reward function on the state space, and an irreducible transition probability matrix on , with associated stationary distribution .

will serve as the generator stochastic matrix of the family. Let

be the stationary mean of the Markov chain induced by when is applied. By tilting exponentially the transitions of we are able to construct new transition matrices that realize a whole range of stationary means around and form the exponential family of stochastic matrices. Let , and consider the matrix . Denote by its spectral radius. According to the Perron-Frobenius theory, see Theorem 8.4.4 in the book of Horn and Johnson (2013),

is a simple eigenvalue of

, called the Perron-Frobenius eigenvalue, and we can associate to it unique left and right eigenvectors such that they are both positive, and . Using them we define the member of the exponential family which corresponds to the natural parameter as,

 Pθ(x,y)=vθ(y)vθ(x)exp{θf(y)−Λ(θ)}P(x,y), (5)

where is the log-Perron-Frobenius eigenvalue. It can be easily seen that is indeed a stochastic matrix, and its stationary distribution is given by . The initial distribution associated to the parameter , can be any distribution on , since the KL-UCB adaptive allocation rule that we devise, and its guarantees, will be valid no matter the initial distributions.

Exponential families of Markov chains date back to the work of Miller (1961). For a short overview of one-parameter exponential families of Markov chains, as well as proofs of the following properties, we refer the reader to Section 2 in Moulos and Anantharam (2019). The log-Perron-Frobenius eigenvalue is a convex analytic function on the real numbers, and through its derivative, , we obtain the stationary mean of the Markov chain with transition matrix when is applied, i.e. . When is not the linear function , the log-Perron-Frobenius eigenvalue, , is strictly convex and thus its derivative is strictly increasing, and it forms a bijection between the natural parameter space, , and the mean parameter space, , which is a bounded open interval.

The Kullback-Leibler divergence rate from (3), when instantiated for the exponential family of Markov chains, can be expressed as,

 D(θ∥λ)=Λ(λ)−Λ(θ)−˙Λ(θ)(λ−θ),

which is convex and differentiable over . Since forms a bijection from the natural parameter space, , to the mean parameter space, , with some abuse of notation we will write for , where . Furthermore, can be extended continuously, to a function , where denotes the closure of . This can even further be extended to a convex function on , by setting if or . For fixed , the function is decreasing for and increasing for . Similarly, for fixed , the function is decreasing for and increasing for .

## 3 Concentration Lemmata for Markov Chains

In this section we present our concentration results for Markov chains. We start with a Chernoff bound, which remarkably does not imposes any conditions on the Markov chain other than irreducibility which is though a mandatory requirement for the stationary mean to be well-defined.

###### Lemma 2 (Chernoff bound for irreducible Markov chains).

Let be an irreducible Markov chain over the finite state space with transition probability matrix , initial distribution , and stationary distribution . Let be a nonconstant function on the state space. Denote by the stationary mean when is applied, and by the empirical mean, where . Let be a closed subset of . Then,

where stands for the Kullback-Leibler divergence rate in the exponential family of stochastic matrices generated by and , and is a positive constant depending only on the transition probability matrix , the function and the closed set .

###### Remark 1.

This bound is a variant of Theorem 1 in Moulos and Anantharam (2019), where the authors derive a Chernoff bound under some structural assumptions on the transition probability matrix and the function . In our Lemma 2 we derive a Chernoff bound without any assumptions, relying though on the fact that lies in a closed subset of the mean parameter space.

Next we present an exponential martingale for Markov chains, which in turn leads to a maximal inequality.

###### Lemma 3 (Exponential martingale for Markov chains).

Let be a Markov chain over the finite state space with an irreducible transition matrix and initial distribution . Let be a nonconstant real-valued function on the state space. Fix and define,

 Mθn=vθ(Xn)vθ(X0)exp{θ(f(X1)+…+f(Xn))−nΛ(θ)}. (6)

Then is a martingale with respect to the filtration , where is the -field generated by .

The following definition is the condition that we will use for our maximal inequality to apply.

###### Definition 1 (Doeblin’s type of condition).

Let be a transition probability matrix on the finite state space . For a nonempty set of states , we say that is -Doeblin if, the submatrix of with rows and columns in is irreducible, and for every there exists such that .

###### Remark 2.

Our Definition 1 is inspired by the classic Doeblin’s Theorem, see Theorem 2.2.1 in Stroock (2014). Doeblin’s Thoerem states that, if the transition probability matrix satisfies Doeblin’s condition (namely there exists , and a state such that for all we have ), then has a unique stationary distribution , and for all initial distributions we have geometric convergence to stationarity, i.e. . Doeblin’s condition, according to our Definition 1, corresponds to being -Doeblin for some .

###### Lemma 4 (Maximal inequality for irreducible Markov chains satisfying Doeblin’s condition).

Let be an irreducible Markov chain over the finite state space with transition matrix , initial distribution , and stationary distribution . Let be a non-constant function on the state space. Denote by the stationary mean when is applied, and by the empirical mean, where . Assume that is -Doeblin. Then for all we have

 P(n⋃k=1{μ(0)≥¯Yk and kD(¯Yk∥∥μ(0))≥ϵ})≤C−e⌈ϵlogn⌉e−ϵ,

where is a positive constant depending only on the transition probability matrix and the function .

###### Remark 3.

If we only consider values of from a bounded subset of , then we don’t need to assume that is -Doeblin, and the constant will further depend on this bounded subset. But in the analysis of the KL-UCB adaptive allocation rule we will need to consider values of that increase with the time horizon , therefore we have to impose the assumption that is -Doeblin, so that has no dependencies on .

IID versions of this maximal inequality have found applicability not only in multi-armed bandit problems, but also in the case of context tree estimation

Garivier and Leonardi (2011), indicating that our Lemma 4 may be of interest for other applications as well.

## 4 The KL-UCB Adaptive Allocation Rule for Multiple Plays and Markovian Rewards

### 4.1 The Algorithm

For each arm we define the empirical mean at the global time as,

 ¯Ya(t)=(Ya1+…+YaNa(t))/Na(t), (7)

and its local time counterpart as,

 ¯Yan=(Ya1+…+Yan)/n,

with their link being , where . At each round we calculate an upper confidence bound index,

 Ua(t)=sup{μ∈M:D(¯Ya(t)∥∥μ)≤g(t)Na(t)}, (8)

where is an increasing function, and we denote its local time version by,

It is straightforward to check, using the definition of , the following two relations,

 ¯Yan≤Uan(t) for all n≤t, (9) Uan(t) is increasing in t≥n for % fixed n. (10)

Furthermore, in Appendix B we study the concentration properties of those upper confidence indices and of the sample means, using the concentration results for Markov chains from Section 3. The KL-UCB adaptive allocation rule, and its guarantees are presented below.

###### Proposition 1.

For each we have that , and so Algorithm 1 is well defined.

###### Theorem 1 (Markovian rewards and multiple plays: finite-time guarantees).

Let be an irreducible transition probability matrix on the finite state space , and be a real-valued reward function, such that is -Doeblin. Assume that the arms correspond to the parameter configuration of the exponential family of Markov chains, as described in Equation 5. Without loss of generality assume that the arms are ordered so that,

 μ(θ1)≥…≥μ(θN)>μ(θN+1)…=μ(θM)=…=μ(θL)>μ(θL+1)≥…≥μ(θK).

Fix . The KL-UCB adaptive allocation rule for Markovian rewards and multiple plays, Algorithm 1, with the choice , enjoys the following finite-time upper bound on the regret,

 Rϕθ(T)≤K∑b=L+1μ(θM)−μ(θb)D(μ(θb)∥μ(θM)−ϵ)logT+c1√logT+c2loglogT+c3√loglogT+c4,

where are constants with respect to , which are given more explicitly in the analysis.

###### Corollary 1 (Asymptotic optimality).

In the context of Theorem 1 the KL-UCB adaptive allocation rule, Algorithm 1, is asymptotically optimal, and,

 limT→∞Rϕθ(T)logT=K∑b=L+1μ(θM)−μ(θb)D(μ(θb)∥μ(θM)).

### 4.2 Sketch of the Analysis

Due to Lemma 1, it suffices to upper bound the proxy for the expected regret given in Equation 2. Therefore, we can break the analysis in two parts: upper bounding , for , and upper bounding , for .

For the first part, we show in Appendix C that the expected number of times that an arm hasn’t been played, is of the order of .

###### Lemma 5.

For every arm ,

 Eϕθ[T−Na(T)]≤4eγ2NC⌈2logγlog1δ⌉logγloglogT+γr0+cγ2ηδK(1−η)(1−ηδ)3,

where and are constants with respect to .

For the second part, if , and , then there are three possibilities:

1. , and for some ,

2. , and for all , and ,

3. .

This means that,

 Eϕθ[Nb(T)] ≤M+T−1∑t=KPϕθ(Lt⊆[L], and |¯Ya(t)−μ(θa)|≥ϵ for some a∈Lt) +T−1∑t=KPϕθ(Lt⊆[L], and |¯Ya(t)−μ(θa)|<ϵ for all a∈Lt, and b∈ϕt+1) +T−1∑t=KPϕθ(Lt∩{L+1,…,K}≠∅),

and we handle each of those three terms separately.

We show that the first term is upper bounded by .

###### Lemma 6.
 T−1∑t=KPϕθ(Lt⊆[L], and |¯Ya(t)−μ(θa)|≥ϵ for some a∈Lt)≤cLηδK(1−η)(1−ηδ),

where and are constant with respect to .

The second term is of the order of , and it is the term that causes the overall logarithmic regret.

###### Lemma 7.
 T−1∑t=KPϕθ(Lt⊆[L], and |¯Ya(t)−μ(θa)|<ϵ for all a∈Lt, and b∈ϕt+1) +2√2πσ2μ(θa),μ(θM)−ϵ ⎷˙D(μ(θb)∥μ(θM)−ϵ)2D(μ(θb)∥μ(θM)−ϵ)3(√logT+√3loglogT),

where , and , are constants with respect to .

Finally, we show that the third term is upper bounded by .

###### Lemma 8.

where and are constants with respect to .

This concludes the proof of Theorem 1, modulo the four bounds of this subsection which are established in Appendix C.

## 5 The KL-UCB Adaptive Allocation Rule for Multiple Plays and IID Rewards

As a byproduct of our work in Section 4 we further obtain a finite-time regret bound, which is asymptotically optimal, for the case of multiple plays and IID rewards, from an exponential family of probability densities.

We first review the notion of an exponential family of probability densities, for which the standard reference is Brown (1986). Let be a probability space. A one-parameter exponential family is a family of probability densities with respect to the measure on , of the form,

 pθ(x)=exp{θf(x)−Λ(θ)}h(x), (11)

where is called the sufficient statistic, is -measurable, and there is no such that is called the carrier density, and is a density with respect to , and

is called the log-Moment-Generating-Function and is given by

, which is finite for in the natural parameter space . The log-MGF, , is strictly convex and its derivative forms a bijection between the natural parameters, , and the mean parameters, . The Kullback-Leibler divergence between and , for , can be written as .

For this section, each arm with parameter corresponds to the IID process , where each has density with respect to , which gives rise to the IID reward process , with .

###### Remark 4.

When there is a finite set such that , then the exponential family of probability densities in Equation 11, is just a special case of the exponential family of Markov chains in Equation 5, as can be seen by setting , for all . Then for all , the log-Perron-Frobenius eigenvalue coincides with the log-MGF, and . Therefore, Theorem 1 already resolves the case of multiple plays and IID rewards from an exponential family of finitely supported densities.

###### Theorem 2 (IID rewards and multiple plays: finite-time guarantees).

Let be a probability space, a -measurable function, and a density with respect to . Assume that the arms correspond to the parameter configuration of the exponential family of probability densities, as described in Equation 11. Without loss of generality assume that the arms are ordered so that,

 μ(θ1)≥…≥μ(θN)>μ(θN+1)…=μ(θM)=…=μ(θL)>μ(θL+1)≥…≥μ(θK).

Fix . The KL-UCB adaptive allocation rule for IID rewards and multiple plays, Algorithm 1, with the choice , enjoys the following finite-time upper bound on the regret,

 Rϕθ(T)≤K∑b=L+1μ(θM)−μ(θb)D(μ(θb)∥μ(θM)−ϵ)logT+c1√logT+c2loglogT+c3√loglogT+c4,

where are constants with respect to .

Consequently, the KL-UCB adaptive allocation rule, Algorithm 1, is asymptotically optimal, and,

 limT→∞Rϕθ(T)logT=K∑b=L+1μ(θM)−μ(θb)D(μ(θb)∥μ(θM)).
###### Remark 5.

For the special case of single plays, , such a finite-time regret bound is derived in Cappé et al. (2013), and here we generalize it for multiple plays, . One striking difference between the case of single plays, and multiple plays, is that in the case of multiple plays one needs to further analyze the number of times that the each of the best arms hasn’t been played, as we we do in Lemma 5, and this is inevitable due the decomposition of the regret in Equation 2. In the case of single plays no such analysis is needed due to the fact that there is only one best arm, and hence we can track the number of times it has been played by analyzing the number of times all the other arms have been played. But the KL-UCB adaptive allocation rule proposed in Cappé et al. (2013) is only using KL-UCB indices, which on their own are not enough to analyze the number of times each of the best arms hasn’t been played. In order to achieve this, one needs to combine the KL-UCB indices, Equation 8, with the mean statistics, Equation 7, as performed in Algorithm 1. This indeed results in optimal regret guarantees for the case of multiple plays.

## Acknowledgements

We would like to thank Venkat Anantharam, Jim Pitman and Satish Rao for many helpful discussions. This research was supported in part by the NSF grant CCF-1816861.

## References

• Agrawal (1995) Agrawal, R. (1995). Sample mean based index policies with regret for the multi-armed bandit problem. Adv. in Appl. Probab., 27(4):1054–1078.
• Anantharam et al. (1987a) Anantharam, V., Varaiya, P., and Walrand, J. (1987a). Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays. I. I.I.D. rewards. IEEE Trans. Automat. Control, 32(11):968–976.
• Anantharam et al. (1987b) Anantharam, V., Varaiya, P., and Walrand, J. (1987b). Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays. II. Markovian rewards. IEEE Trans. Automat. Control, 32(11):977–982.
• Auer et al. (2002) Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-time Analysis of the Multiarmed Bandit Problem. Mach. Learn., 47(2-3):235–256.
• Brown (1986) Brown, L. D. (1986). Fundamentals of statistical exponential families with applications in statistical decision theory, volume 9 of Institute of Mathematical Statistics Lecture Notes—Monograph Series. Institute of Mathematical Statistics, Hayward, CA.
• Bubeck and Cesa-Bianchi (2012) Bubeck, S. and Cesa-Bianchi, N. (2012). Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems.

Foundations and Trends® in Machine Learning

, 5(1):1–122.
• Cappé et al. (2013) Cappé, O., Garivier, A., Maillard, O.-A., Munos, R., and Stoltz, G. (2013). Kullback-Leibler upper confidence bounds for optimal sequential allocation. Ann. Statist., 41(3):1516–1541.
• Combes and Proutiere (2014) Combes, R. and Proutiere, A. (2014). Unimodal bandits without smoothness.
• Cover and Thomas (2006) Cover, T. M. and Thomas, J. A. (2006). Elements of information theory. Wiley-Interscience [John Wiley & Sons], Hoboken, NJ, second edition.
• Garivier and Cappé (2011) Garivier, A. and Cappé, O. (2011). The KL-UCB Algorithm for Bounded Stochastic Bandits and Beyond. In Kakade, S. M. and von Luxburg, U., editors, Proceedings of the 24th Annual Conference on Learning Theory, volume 19 of Proceedings of Machine Learning Research, pages 359–376, Budapest, Hungary. PMLR.
• Garivier and Leonardi (2011) Garivier, A. and Leonardi, F. (2011). Context tree selection: A unifying view. Stochastic Processes and their Applications, 121(11):2488 – 2506.
• Gillman (1993) Gillman, D. (1993). A Chernoff bound for random walks on expander graphs. In 34th Annual Symposium on Foundations of Computer Science (Palo Alto, CA, 1993), pages 680–691. IEEE Comput. Soc. Press, Los Alamitos, CA.
• Glynn and Ormoneit (2002) Glynn, P. W. and Ormoneit, D. (2002). Hoeffding’s inequality for uniformly ergodic Markov chains. Statist. Probab. Lett., 56(2):143–146.
• Horn and Johnson (2013) Horn, R. A. and Johnson, C. R. (2013). Matrix analysis. Cambridge University Press, Cambridge, second edition.
• Kaufmann et al. (2016) Kaufmann, E., Cappé, O., and Garivier, A. (2016). On the Complexity of Best-arm Identification in Multi-armed Bandit Models. J. Mach. Learn. Res., 17(1):1–42.
• Lai and Robbins (1985) Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Adv. in Appl. Math., 6(1):4–22.
• Lattimore and Szepesvári (2019) Lattimore, T. and Szepesvári, C. (2019). Bandit Algorithms.
• Maillard et al. (2011) Maillard, O.-A., Munos, R., and Stoltz, G. (2011). A Finite-Time Analysis of Multi-armed Bandits Problems with Kullback-Leibler divergences. In Kakade, S. M. and von Luxburg, U., editors, Proceedings of the 24th Annual Conference on Learning Theory, volume 19 of Proceedings of Machine Learning Research, pages 497–514, Budapest, Hungary. PMLR.
• Miller (1961) Miller, H. D. (1961).

A convexity property in the theory of random variables defined on a finite Markov chain.

Ann. Math. Statist., 32:1260–1270.
• Moulos (2019) Moulos, V. (2019). Optimal Best Markovian Arm Identification with Fixed Confidence. In 33rd Annual Conference on Neural Information Processing Systems.
• Moulos (2020) Moulos, V. (2020). A Hoeffding Inequality for Finite State Markov Chains and its Applications to Markovian Bandits.
• Moulos and Anantharam (2019) Moulos, V. and Anantharam, V. (2019). Optimal Chernoff and Hoeffding Bounds for Finite State Markov Chains.
• Slivkins (2019) Slivkins, A. (2019). Introduction to Multi-Armed Bandits. Foundations and Trends® in Machine Learning, 12(1-2):1–286.
• Stroock (2014) Stroock, D. W. (2014). An introduction to Markov processes, volume 230 of Graduate Texts in Mathematics. Springer, Heidelberg, second edition.
• Tekin and Liu (2010) Tekin, C. and Liu, M. (2010). Online algorithms for the multi-armed bandit problem with Markovian rewards. In 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 1675–1682.
• Tekin and Liu (2012) Tekin, C. and Liu, M. (2012). Online learning of rested and restless bandits. IEEE Trans. Inf. Theor., 58(8):5588–5611.
• Ville (1939) Ville, J. (1939). Étude critique de la notion de collectif. NUMDAM.

## Appendix A Concentration Lemmata for Markov Chains

###### Proof of Lemma 2..

Using the standard exponential transform followed by Markov’s inequality we obtain that for any ,

 P(¯Yn≥μ)≤P(enθ¯Yn≥enθμ)≤exp{−n(θμ−1nlogE[eθ(f(X1)+…+f(Xn))])}.

We can upper bound the expectation from above in the following way,

 E[eθ(f(X1)+…+f(Xn))] =∑x0,…,xn∈Sq(x0)P(x0,x1)eθf(x1)…P(xn−1,xn)eθf(xn) =∑x0,xn∈Sq(x0)~Pnθ(x0,xn) ≤1minx∈Svθ(x)∑x0,xn∈Sq(x0)~Pnθ(x0,xn)vθ(xn) =ρ(θ)nminx∈Svθ(x)∑x0∈Sq(x0)vθ(x0) ≤maxx,y∈Svθ(y)vθ(x)ρ(θ)n,

where in the last equality we used the fact that is a right Perron-Frobenius eigenvector of .

From those two we obtain,

 P(¯Yn≥μ)≤maxx,y∈Svθ(y)vθ(x)exp{−n(θμ−Λ(θ))},

and if we plug in