# Optimal Best Markovian Arm Identification with Fixed Confidence

We give a complete characterization of the sampling complexity of best Markovian arm identification in one-parameter Markovian bandit models. We derive instance specific nonasymptotic and asymptotic lower bounds which generalize those of the IID setting. We analyze the Track-and-Stop strategy, initially proposed for the IID setting, and we prove that asymptotically it is at most a factor of four apart from the lower bound. Our one-parameter Markovian bandit model is based on the notion of an exponential family of stochastic matrices for which we establish many useful properties. For the analysis of the Track-and-Stop strategy we derive a novel concentration inequality for Markov chains that may be of interest in its own right.

## Authors

• 4 publications
02/15/2016

### Optimal Best Arm Identification with Fixed Confidence

We give a complete characterization of the complexity of best-arm identi...
06/26/2021

### The Role of Contextual Information in Best Arm Identification

We study the best-arm identification problem with fixed confidence when ...
05/09/2019

### Non-Asymptotic Sequential Tests for Overlapping Hypotheses and application to near optimal arm identification in bandit models

In this paper, we study sequential testing problems with overlapping hyp...
02/16/2017

### The Simulator: Understanding Adaptive Sampling in the Moderate-Confidence Regime

We propose a novel technique for analyzing adaptive sampling called the ...
11/16/2017

### Bayesian Best-Arm Identification for Selecting Influenza Mitigation Strategies

Pandemic influenza has the epidemic potential to kill millions of people...
05/27/2021

### A Non-asymptotic Approach to Best-Arm Identification for Gaussian Bandits

We propose a new strategy for best-arm identification with fixed confide...
01/23/2020

### Best Arm Identification for Cascading Bandits in the Fixed Confidence Setting

We design and analyze CascadeBAI, an algorithm for finding the best set ...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

This paper is about optimal best Markovian arm identification with fixed confidence. There are independent options which are referred to as arms. Each arm is associated with a discrete time stochastic process, which is characterized by a parameter

and it’s governed by the probability law

. At each round we select one arm, without any prior knowledge of the statistics of the stochastic processes. The stochastic process that corresponds to the selected arm evolves by one time step, and we observe this evolution through a reward function, while the stochastic processes for the rest of the arms stay still. A confidence level is prescribed, and our goal is to identify the arm that corresponds to the process with the highest stationary mean with probability at least , and using as few samples as possible.

### 1.1 Contributions

In the work of Garivier and Kaufmann (2016) the discrete time stochastic process associated with each arm is assumed to be an IID process. Here we go one step further and we study more complicated dependent processes, which allow us to use more expressive models in the stochastic multi-armed bandits framework. More specifically we consider the case that each is the law of an irreducible finite state Markov chain associated with a stationary mean . We establish a lower bound (Theorem 1) for the expected sample complexity, as well as an analysis of the Track-and-Stop strategy, proposed for the IID setting in Garivier and Kaufmann (2016), which shows (Theorem 3) that asymptotically the Track-and-Stop strategy in the Markovian dependence setting attains a sample complexity which is at most a factor of four apart from our asymptotic lower bound. Both our lower and upper bounds extend the work of Garivier and Kaufmann (2016) in the more complicated and more general Markovian dependence setting.

The abstract framework of multi-armed bandits has numerous applications in areas like clinical trials, ad placement, adaptive routing, resource allocation, gambling etc. For more context we refer the interested reader to the survey of Bubeck and Cesa-Bianchi (2012)

. Here we generalize this model to allow for the presence of Markovian dependence, enabling this way the practitioner to use richer and more expressive models for the various applications. In particular, Markovian dependence allows models where the distribution of next sample depends on the sample just observed. This way one can model for instance the evolution of a rigged slot machine, which as soon as it generates a big reward for the gambler, it changes the reward distribution to a distribution which is skewed towards smaller rewards.

Our key technical contributions stem from the large deviations theory for Markov chains Miller (1961); Donsker and Varadhan (1975); Ellis (1984); Dembo and Zeitouni (1998). In particular we utilize the concept of an exponential family of stochastic matrices, first introduced in Miller (1961), in order to model our one-parameter Markovian bandit model. Many properties of the family are established which are then used for our analysis of the Track-and-Stop strategy. The most important one is an optimal concentration inequality for the empirical means of Markov chains (Theorem 2). We are able to establish this inequality for a large class of Markov chains, including those that all the transitions have positive probability. Prior work on the topic, Gillman (1993); Dinwoodie (1995); Lezaud (1998); León and Perron (2004), fails to capture the optimal exponential decay, or introduces a polynomial prefactor, Davisson et al. (1981)

, as opposed to our constant prefactor. This result may be of independent interest due to the wide applicability of Markov chains in many aspects of learning theory such as various aspects of reinforcement learning, Markov chain Monte Carlo and others.

### 1.2 Related Work

The cornerstone of stochastic multi-armed bandits is the seminal work of Lai and Robbins (1985). They considered IID process with the objective being to maximize the expected value of the sum of the observed rewards, or equivalently to minimize the so called regret. In the same spirit Anantharam et al. (1987a, b) examine the generalization where one is allowed to collect multiple rewards at each time step, first in the case that processes are IID Anantharam et al. (1987a), and then in the case that the processes are irreducible and aperiodic Markov chains Anantharam et al. (1987a). A survey of the regret minimization literature is contained in Bubeck and Cesa-Bianchi (2012).

An alternative objective is the one of identifying the process with the highest stationary mean as fast as and as accurately as possible, notions which are made precise in Subsection 2.1. In the IID setting, Even-Dar et al. (2006) establish an elimination based algorithm in order to find an approximate best arm, and Mannor and Tsitsiklis (0304) provide a matching lower bound. Jamieson et al. (2014) propose an upper confidence strategy, inspired by the law of iterated logarithm, for exact best arm identification given some fixed level of confidence. In the asymptotic high confidence regime, the problem is settled by the work of Garivier and Kaufmann (2016), who provide instance specific matching lower and upper bounds. For their upper bound they propose the Track-and-Stop strategy which is further explored in the work of Kaufmann and Koolen (2018).

The earliest reference for the exponential family of stochastic matrices which is being used to model the Markovian arms can be found in the work of Miller (1961). Exponential families of stochastic matrices lie in the heart of the theory of large deviations for Markov processes, which was popularized with the pioneering work of Donsker and Varadhan (1975). A comprehensive overview of the theory can be found in the book Dembo and Zeitouni (1998). Naturally they also show up when one conditions on the second order empirical distribution of a Markov chain, see the work of Csiszár et al. (1987) about conditional limit theorems. A variant of the exponential family that we are going to discuss has been developed in the context of hypothesis testing in Nakagawa and Kanaya (1993). A more recent development by Nagaoka (2005) gives an information geometry perspective to this concept, and the work Hayashi and Watanabe (2016)

examines parameter estimation for the exponential family. Our development of the exponential family of stochastic matrices tries to parallel the development of simple exponential families of probability distributions of

Wainwright and Jordan (2008).

Regarding concentration inequalities for Markov chains one of the earliest works Davisson et al. (1981) is based on counting, and is able to capture the optimal rate of exponential decay dictated by the theory of large deviations, but has a suboptimal polynomial prefactor. More recent approaches follow the line of work started by Gillman (1993), who used matrix perturbation theory to derive a bound for reversible Markov chains. This bound attains a constant prefactor but with a suboptimal rate of exponential decay which depends on the spectral gap of the transition matrix. This work was later extended by Dinwoodie (1995); Lezaud (1998) but still with a sub-optimal rate. The work of León and Perron (2004) reduces the problem to a two state Markov chain, and attains the optimal rate only for the case of a two state Markov chain. Chung et al. (2012) obtain rates that depend on the mixing time of the chain rather than the spectral gap, but which are still suboptimal.

## 2 Problem Formulation

### 2.1 One-parameter family of Markov Chains

In order to model the problem we will use a one-parameter family of Markov chains on a finite state space . Each Markov chain in the family corresponds to a parameter , where is the parameter space, and is completely characterized by an initial distribution , and a stochastic transition matrix , which satisfy the following conditions.

 Pθ is irreducible for all θ∈Θ. (1) Pθ(x,y)>0⇒Pλ(x,y)>0, for % all θ,λ∈Θ, x,y∈S. (2) qθ(x)>0, for all θ∈Θ. (3)

There are Markovian arms with parameters , and each arm evolves as a Markov chain with parameter which we denote by . A non-constant real valued reward function is applied at each state and produces the reward process given by . We can only observe the reward process but not the internal Markov chain. Note that the reward process is a function of the Markov chain and so in general it will have more complicated dependencies than the Markov chain. The reward process is a Markov chain if and only if is injective. For each there is a unique stationary distribution

associated with the stochastic matrix

, due to (1). This allows us to define the stationary reward of the Markov chain corresponding to the parameter as . We will assume that among the Markovian arms there exists precisely one that possess the highest stationary mean, and we will denote this arm by , so in particular

 {a∗(θ)}=argmaxa∈[K]μ(θa).

The set of all parameter configurations that possess a unique highest mean is denoted by

 Θ={θ∈ΘK:∣∣ ∣∣argmaxa∈[K]μ(θa)∣∣ ∣∣=1}.

The Kullback-Leibler divergence rate characterizes the sample complexity of the Markovian identification problem that we are about to study. For two Markov chains of the one-parameter family that are indexed by and respectively it is given by,

 D(θ∥λ)=∑x,y∈SlogPθ(x,y)Pλ(x,y)πθ(x)Pθ(x,y),

where we use the standard notational conventions , and . It is always nonnegative, , with equality occurring if and only if , and so yields that . Furthermore, due to (2).

With some abuse of notation we will also write for the Kullback-Leibler divergence between two probability measures and on the same measurable space, which is defined as

 D(P∥Q)={EP[logdPdQ],if P≪Q∞,otherwise,

where means that is absolutely continuous with respect to , and in that case denotes the Radon-Nikodym derivative of with respect to .

### 2.2 Best Markovian Arm Identification with Fixed Confidence

Let be an unknown parameter configuration for the Markovian arms. Let be a given confidence level. Our goal is to identify with probability at least using as few samples as possible. At each time we select a single arm and we observe the next sample from the reward process , while all the other reward processes stay still. Let be the number of transitions of the Markovian arm up to time . Let be the -field generated by our choices and the observations . A sampling strategy, , is a triple consisting of:

• a sampling rule , which based on the past decisions and observations , determines which arm we should sample next, so is -measurable;

• a stopping rule , which denotes the end of the data collection phase and is a stopping time with respect to the filtration , such that for all ;

• a decision rule , which is -measurable, and determines the arm that we estimate to be the best one.

Sampling strategies need to perform well across all possible parameter configurations in , therefore we need to restrict our strategies to a class of uniformly accurate strategies. This motivates the following standard definition.

###### Definition 1 (δ-Pc).

Given a confidence level , a sampling strategy is called -PC (Probably Correct) if,

 PAδλ(^aτδ≠a∗(λ))≤δ, for % all λ∈Θ.

Therefore our goal is to study the quantity,

 infAδ:δ−PCEAδθ[τδ],

both in terms of finding a lower bound, i.e. establishing that no -PC strategy can have expected sample complexity less than our lower bound, and also in terms of finding an upper bound, i.e. a -PC strategy with very small expected sample complexity. We will do so in the high confidence regime of , by establishing instance specific lower and upper bounds which differ just by a factor of four.

## 3 Lower Bound on the Sample Complexity

Deriving lower bounds in the multi-armed bandits setting is a task performed by change of measure arguments initial introduced by Lai and Robbins (1985). Those change of measure arguments capture the simple idea that in order to identify the best arm we should at least be able to differentiate between two bandit models that exhibit different best arms but are statistically similar. Fix , and define the set of parameter configurations that exhibit as best arm an arm different than by

 Alt(θ)={λ∈Θ:a∗(λ)≠a∗(θ)}.

Then we consider an alternative parametrization and we write their log-likelihood ratio up to time

 log⎛⎝dPAδθ∣FtdPAδλ∣Ft⎞⎠=K∑a=1I{Na(t)≥0}logqθa(Xa0)qλa(Xa0) (4) +K∑a=1∑x,yNa(x,y,0,t)logPθa(x,y)Pλa(x,y),

where . The log-likelihood ratio enables us to perform changes of measure for fixed times , and more generally for stopping times with respect to , which are and -a.s. finite, through the following change of measure formula,

 PAδλ(E)=EAδθ[IEdPλ∣FτdPθ∣Fτ], for any E∈Fτ. (5)

In order to derive our lower bound we use a technique developed for the IID case by Garivier and Kaufmann (2016) which combines several changes of measure at once. To make this technique work in the Markovian setting we need the following inequality which we derive in Appendix A using a renewal argument for Markov chains.

###### Lemma 1.

Let and be two parameter configurations. Let be a stopping time with respect to , with . Then

 D(PAδθ∣Fτ∥∥PAδλ∣Fτ) ≤K∑a=1D(qθa∥qλa)+K∑a=1(EAδθ[Na(τ)]+Rθa−1)D(θa∥λa),

where , the first summand is finite due to (3), and the second summand is finite due to (2).

Combining those ingredients with the data processing inequality we derive our instance specific lower bound for the Markovian bandit identification problem in Appendix A.

###### Theorem 1.

Assume that the one-parameter family of Markov chains on the finite state space satisfies conditions (1), (2), and (3). Fix , let be a nonconstant reward function, let be a -PC sampling strategy, and fix a parameter configuration . Then

 T∗(θ)≤liminfδ→0EAδθ[τδ]log1δ,

where

and denotes the set of all probability distributions on .

As noted in Garivier and Kaufmann (2016) the in the definition of is actually attained uniquely, and therefore we can define as the unique maximizer,

## 4 One-Parameter Exponential Family of Markov Chains

### 4.1 Definition and Basic Properties

In this section we instantiate the abstract one-parameter family of Markov chains from Subsection 2.1, with the one-parameter exponential family of Markov chains. Given the finite state space , and the nonconstant reward function , we define and . Based on we construct two subsets of state space, and , corresponding to states of maximum and minimum -value respectively. Our goal is to create a family of Markov chains which can realize any stationary mean in the interval , which will be later used in order to model the Markovian arms. Towards this goal we use as a generator for our family, an irreducible stochastic matrix which satisfies the following properties.

 The submatrix of P with rows and columns in SM is irreducible. (6) For every x∈S−SM, there is a y∈SM such that P(x,y)>0. (7) The submatrix of P with rows and columns in Sm is irreducible. (8) For every x∈S−Sm, there is a y∈Sm such that P(x,y)>0. (9)

For example, a positive stochastic matrix, i.e. one where all the transition probabilities are positive, satisfies all those properties. Note that in practice this can always be attained by substituting zero transition probabilities with transition probabilities, where is some small constant.

Our parameter space will be the whole real line, . Given a parameter , we pick an arbitrary initial distribution such that for all , and we tilt exponentially all the the transitions of by constructing the matrix . Note that is not a stochastic matrix, but we can normalize it and turn it into a stochastic matrix by invoking the Perron-Frobenius theory. Let be the spectral radius of . From the Perron-Frobenius theory we know that

is a simple eigenvalue of

, called the Perron-Frobenius eigenvalue, associated with unique left and right eigenvectors

such that they are both positive, and , see for instance Theorem 8.4.4 in the book Horn and Johnson (2013). Let

be the log-Perron-Frobenius eigenvalue, a quantity which plays a role similar to that of a log-moment-generating function. From

we can construct an irreducible nonnegative matrix

 Pθ(x,y)=~Pθ(x,y)vθ(y)ρ(θ)vθ(x)=vθ(y)vθ(x)eθϕ(y)−A(θ)P(x,y),

which is stochastic, since

 ∑yPθ(x,y)=1ρ(θ)vθ(x)⋅∑y~Pθ(x,y)vθ(y)=1.

In addition its stationary distributions is given by

 πθ(x)=uθ(x)vθ(x),

since

 ∑xπθ(x)Pθ(x,y)=vθ(y)ρ(θ)⋅∑xuθ(x)~Pθ(x,y)=uθ(y)vθ(y)=πθ(y).

Note that the generator stochastic matrix , is the member of the family that corresponds to , i.e. , and .

The following lemma, whose proof is presented in Appendix B, suggests that the family can be reparametrized using the mean parameters . More specifically is a strictly increasing bijection between the set of canonical parameters and the set of mean parameters. Therefore with some abuse of notation, we will write for , and for .

###### Lemma 2.

Let be an irreducible stochastic matrix stochastic matrix on a finite state space which combined with a real-valued function satisfies (6), (7), (8) and (9). Then the following properties hold true for the exponential family of stochastic matrices generated by and .

1. [label=()]

2. and are analytic functions of on .

3. , for all .

4. is strictly increasing.

5. .

### 4.2 Concentration for Markov Chains

For a Markov chain , driven by an irreducible transition matrix and an initial distribution , the large deviations theory, Miller (1961); Donsker and Varadhan (1975); Ellis (1984); Dembo and Zeitouni (1998), suggests that the probability of the large deviation event , when is greater or equal than the stationary mean , asymptotically is an exponential decay with the rate of the decay given by a Kullback-Leibler divergence rate. In particular Theorem 3.1.2. from Dembo and Zeitouni (1998) in our context can be written as

 limn→∞1nlogP0(f(X1)+…+f(Xn)≥nμ)=−A∗(μ), for any μ≥μ(0),

where is the convex conjugate of the log-Perron-Frobenius eigenvalue and represents a Kullback-Leibler divergence rate as we illustrate in Lemma 10.

In the following theorem we present a concentration inequality for Markov chains which attains the rate of exponential decay prescribed from the large deviations theory, as well as a constant prefactor which is independent from .

###### Theorem 2.

Let be a finite state space, and let be an irreducible stochastic matrix on , which combined with a function satisfies (6), (7), (8), and (9). Fix , and let be a Markov chain on , which is driven by , the stochastic matrix from the exponential family which corresponds to the parameter and has stationary mean . Then

 Pθ(f(X1)+…+f(Xn)≥nμ)≤C2eD(μ∥μ(θ)), % for μ∈[μ(θ),M],

where is a constant depending only on the generator stochastic matrix and the function . In particular, if is a positive stochastic matrix then we can take .

We note that in the special case that the process is an IID process the constant can be taken to be , and thus Theorem 2 generalizes the classic Cramer-Chernoff bound, Chernoff (1952). Observe also that Theorem 2 has a straightforward counterpart for the lower tail as well.

Moreover our inequality is optimal up to the constant prefactor, since the exponential decay is unimprovable due to the large deviations theory, while with respect to the prefactor we can not expect anything better than a constant because otherwise we would contradict the central limit theorem for Markov chains. In particular, when our conditions on

and are met, our bound dominates similar bounds given by Davisson et al. (1981); Gillman (1993); Dinwoodie (1995); Lezaud (1998); León and Perron (2004).

We give a proof of Theorem 2 in  Appendix C, where the main techniques involved are a uniform upper bound on the ratio of the entries of the right Perron-Frobenius eigenvector, as well as an approximation of the log-Perron-Frobenius eigenvalue using the log-moment-generating function.

## 5 Upper Bound on the Sample Complexity: the (α,δ)-Track-and-Stop Strategy

The -Track-and-Stop strategy, which was proposed in Garivier and Kaufmann (2016) in order to tackle the IID setting, tries to track the optimal weights . In the sequel we will also write , with , to denote . Not having access to , the -Track-and-Stop strategy tries to approximate using sample means. Let be the sample means of the Markov chains when samples have been observed overall and the calculation of the very first sample from each Markov chain is excluded from the calculation of its sample mean, i.e.

 ^μa(t)=1Na(t)Na(t)∑s=1Yas.

By imposing sufficient exploration

the law of large numbers for Markov chains will kick in and the sample means

will almost surely converge to the true means , as .

We proceed by briefly describing the three components of the -Track-and-Stop strategy.

### 5.1 Sampling Rule: Tracking the Optimal Proportions

For initialization reasons the first samples that we are going to observe are . After that, for we let and we follow the tracking rule:

 At+1∈⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩argmina∈UtNa(t),if Ut≠∅(forced exploration),argmaxa=1,…,K{w∗a(^μ(t))−Na(t)t},otherwise% (direct tracking).

The forced exploration step is there to ensure that as . Then the continuity of , combined with the direct tracking step guarantees that almost surely the frequencies converge to the optimal weights for all .

### 5.2 Stopping Rule: (α,δ)-Chernoff’s Stopping Rule

For the stopping rule we will need the following statistics. For any two distinct arms if , we define

 Za,b(t) =Na(t)Na(t)+Nb(t)D(^μa(Na(t))∥∥^μa,b(Na(t),Nb(t)))+ Nb(t)Na(t)+Nb(t)D(^μb(Nb(t))∥∥^μa,b(Na(t),Nb(t))),

while if , we define , where

 ^μa,b(Na(t),Nb(t))=Na(t)Na(t)+Nb(t)^μa(Na(t))+Nb(t)Na(t)+Nb(t)^μb(Nb(t)).

Note that the statistics do not arise as the closed form solutions of the Generalized Likelihood Ratio statistics for Markov chains, as it is the case in the IID bandits setting.

For a confidence level , and a convergence parameter we define the -Chernoff stopping rule following Garivier and Kaufmann (2016)

 τα,δ=inf{t∈Z>0:∃a∈{1,…,K} ∀b≠a, Za,b(t)>(0∨βα,δ(t))},

where , and is the constant from Lemma 11. In the special case that is a positive stochastic matrix we can explicitly set . It is important to notice that the constant does not depend on the bandit instance or the confidence level , but only on the generator stochastic matrix and the reward function . In other words it is a characteristic of the exponential family of Markov chains and not of the particular bandit instance, , under consideration.

### 5.3 Decision Rule: Best Sample Mean

For a fixed arm it is clear that, if and only if for all . Hence the following simple decision rule is well defined when used in conjunction with the -Chernoff stopping rule:

 {^aτα,δ}=argmaxa=1,…,K^μa(Na(ττα,δ)).

### 5.4 Sample Complexity Analysis

In this section we establish that the -Track-and-Stop strategy is -PC, and we upper bound its expected sample complexity. In order to do this we use our Markovian concentration bound Theorem 2.

We first use it in order to establish the following uniform deviation bound.

###### Lemma 3.

Let , , and . Let be a sampling strategy that uses an arbitrary sampling rule, the -Chernoff’s stopping rule and the best sample mean decision rule. Then, for any arm ,

 PAδθ(∃t∈Z>0:Na(t)D(^μa(Na(t))∥μa)≥βα,δ(t)/2)≤δK.

With this in our possession we are able to prove in Appendix D that the -Track-and-Stop strategy is -PC.

###### Proposition 1.

Let , and . The -Track-and-Stop strategy is -PC.

Finally, we obtain that in the high confidence regime, , the -Track-and-Stop strategy has a sample complexity which is at most times the asymptotic lower bound that we established in Theorem 1.

###### Theorem 3.

Let , and . The -Track-and-Stop strategy, denoted here by , has its asymptotic expected sample complexity upper bounded by,

 limsupδ→0EAδθ[τα,δ]log1δ≤4αT∗(θ).

## Acknowledgements

We would like to thank Venkat Anantharam, Jim Pitman and Satish Rao for many helpful discussions. This research was supported in part by the NSF grant CCF-1816861.

## References

• Anantharam et al. (1987a) Anantharam, V., Varaiya, P., and Walrand, J. (1987a). Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays. I. I.I.D. rewards. IEEE Trans. Automat. Control, 32(11):968–976.
• Anantharam et al. (1987b) Anantharam, V., Varaiya, P., and Walrand, J. (1987b). Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays. II. Markovian rewards. IEEE Trans. Automat. Control, 32(11):977–982.
• Bubeck and Cesa-Bianchi (2012) Bubeck, S. and Cesa-Bianchi, N. (2012). Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems.

Foundations and Trends in Machine Learning

, 5(1):1–122.
• Chernoff (1952) Chernoff, H. (1952). A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Ann. Math. Statistics, 23:493–507.
• Chung et al. (2012) Chung, K.-M., Lam, H., Liu, Z., and Mitzenmacher, M. (2012). Chernoff-Hoeffding Bounds for Markov Chains: Generalized and Simplified. In STACS.
• Cover and Thomas (2006) Cover, T. M. and Thomas, J. A. (2006). Elements of information theory. Wiley-Interscience [John Wiley & Sons], Hoboken, NJ, second edition.
• Csiszár et al. (1987) Csiszár, I., Cover, T. M., and Choi, B. S. (1987). Conditional limit theorems under Markov conditioning. IEEE Trans. Inform. Theory, 33(6):788–801.
• Davisson et al. (1981) Davisson, L. D., Longo, G., and Sgarro, A. (1981). The error exponent for the noiseless encoding of finite ergodic Markov sources. IEEE Trans. Inform. Theory, 27(4):431–438.
• Dembo and Zeitouni (1998) Dembo, A. and Zeitouni, O. (1998). Large deviations techniques and applications, volume 38 of Applications of Mathematics (New York). Springer-Verlag, New York, second edition.
• Dinwoodie (1995) Dinwoodie, I. H. (1995). A probability inequality for the occupation measure of a reversible Markov chain. Ann. Appl. Probab., 5(1):37–43.
• Donsker and Varadhan (1975) Donsker, M. D. and Varadhan, S. R. S. (1975). Asymptotic evaluation of certain Markov process expectations for large time. I. II. Comm. Pure Appl. Math., 28:1–47; ibid. 28 (1975), 279–301.
• Durrett (2010) Durrett, R. (2010). Probability: theory and examples, volume 31 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge, fourth edition.
• Ellis (1984) Ellis, R. S. (1984).

Large deviations for a general class of random vectors.

Ann. Probab., 12(1):1–12.
• Even-Dar et al. (2006) Even-Dar, E., Mannor, S., and Mansour, Y. (2006). Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. J. Mach. Learn. Res., 7:1079–1105.
• Garivier and Kaufmann (2016) Garivier, A. and Kaufmann, E. (2016). Optimal best arm identification with fixed confidence. Proceedings of the 29th Conference On Learning Theory, 49:1–30.
• Gillman (1993) Gillman, D. (1993). A Chernoff bound for random walks on expander graphs. In 34th Annual Symposium on Foundations of Computer Science (Palo Alto, CA, 1993), pages 680–691. IEEE Comput. Soc. Press, Los Alamitos, CA.
• Hayashi and Watanabe (2016) Hayashi, M. and Watanabe, S. (2016). Information geometry approach to parameter estimation in Markov chains. Ann. Statist., 44(4):1495–1535.
• Horn and Johnson (2013) Horn, R. A. and Johnson, C. R. (2013). Matrix analysis. Cambridge University Press, Cambridge, second edition.
• Jamieson et al. (2014) Jamieson, K. G., Malloy, M., Nowak, R. D., and Bubeck, S. (2014). lil’ UCB : An Optimal Exploration Algorithm for Multi-Armed Bandits. In COLT, volume 35 of JMLR Workshop and Conference Proceedings, pages 423–439.
• Kaufmann and Koolen (2018) Kaufmann, E. and Koolen, W. (2018).

Mixture martingales revisited with applications to sequential tests and confidence intervals.

• Lai and Robbins (1985) Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Adv. in Appl. Math., 6(1):4–22.
• Lax (2007) Lax, P. D. (2007). Linear algebra and its applications. Pure and Applied Mathematics (Hoboken). Wiley-Interscience [John Wiley & Sons], Hoboken, NJ, second edition.
• León and Perron (2004) León, C. A. and Perron, F. (2004). Optimal Hoeffding bounds for discrete reversible Markov chains. Ann. Appl. Probab., 14(2):958–970.
• Lezaud (1998) Lezaud, P. (1998). Chernoff-type bound for finite Markov chains. Ann. Appl. Probab., 8(3):849–867.
• Mannor and Tsitsiklis (0304) Mannor, S. and Tsitsiklis, J. N. (2003/04). The sample complexity of exploration in the multi-armed bandit problem. J. Mach. Learn. Res., 5:623–648.
• Miller (1961) Miller, H. D. (1961).

A convexity property in the theory of random variables defined on a finite Markov chain.

Ann. Math. Statist., 32:1260–1270.
• Nagaoka (2005) Nagaoka, H. (2005). The exponential family of Markov chains and its information geometry. In Proceedings of The 28th Symposium on Information Theory and Its Applications (SITA2005), pages 1091–1095, Okinawa, Japan.
• Nakagawa and Kanaya (1993) Nakagawa, K. and Kanaya, F. (1993).

On the converse theorem in statistical hypothesis testing for Markov chains.

IEEE Trans. Inform. Theory, 39(2):629–633.
• Ortega (1990) Ortega, J. M. (1990). Numerical analysis, volume 3 of Classics in Applied Mathematics. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA, second edition. A second course.
• Wainwright and Jordan (2008) Wainwright, M. J. and Jordan, M. I. (2008). Graphical Models, Exponential Families, and Variational Inference. Found. Trends Mach. Learn., 1(1-2):1–305.

## Appendix A Lower Bound on the Sample Complexity

We first prove Lemma 1, for which we will apply a renewal argument. Using the strong Markov property we can derive the following standard, see Durrett (2010), decomposition of a Markov chain in IID blocks.

###### Fact 1.

Let be an irreducible Markov chain with initial distribution , and transition matrix . Define recursively the -th return time to the initial state as

 {τ0=0τk=inf{n>τk−1:Xn=X0}, for k≥1,

and for let be the residual time. Those random times partition the Markov chain in a sequence of IID random blocks given by

 vk=(rk,Xτk−1,…,Xτk−1), for k≥1.

Let be the number of visits to that occurred from time up to time , and to be the number of transitions from to that occurred from time up to time

 N(x,n,m) =m−1∑s=n1{Xs=x}, N(x,y,n,m) =m−1∑s=n1{Xs=x,Xs+1=y}.

It is well know, see Durrett (2010), that the stationary distribution of the Markov chain is given by

 π(x)=E(q,P)N(x,0,τ1)E(q,P)τ1, for any x∈S. (10)

In the following lemma we establish a similar relation for the invariant distribution over pairs of the Markov chain.

###### Lemma 4.
 π(x)P(x,y)=E(q,P)N(x,y,0,τ1)E(q,P)τ1, for any x,y∈S.
###### Proof.

Using (10) it is enough to show that for any initial state ,

 E(x0,P)N(x,0,τ1)P(x,y)=E(x0,P)N(x,y,0,τ1),

or equivalently that,

 E(x0,P)τ1−1∑n=01{Xn=x}P(x,y)=E(x0,P)τ1−1∑n=01{Xn=x,Xn+1=y}.

Conditioning over the possible values of , and using Fubini’s Theorem we obtain

 E(x0,P)τ1−1∑n=01{Xn=x}P(x,y) =∞∑t=1Px0(τ1=t)t−1∑n=0P(x0,P)(Xn=x∣τ1=t)P(x,y) =∞∑n=0∞∑t=n+1P(x0,P)(Xn=x,τ1=t)P(x,y) =∞∑n=0P(x0,P)(Xn=x,τ1>n)P(x,y) =∞∑n=0P(x0,P)(Xn=x,Xn+1=y)P(x0,P)(τ1>n∣Xn=x) =∞∑n=0P(x0,P)(Xn=x,Xn+1=y,τ1>n) =E(x0,P)τ1−1∑n=01{Xn=x,Xn+1=y},

where the second to last equality holds true due to the reversed Markov property

 P(x0,P)(τ1>n∣Xn=x,Xn+1=y)=P(x0,P)(τ1>n∣Xn=x).

The following Lemma, which is a variant of Lemma 2.1 in Anantharam et al. (1987b), is the place where we use the IID block structure of the Markov chain.

###### Lemma 5.

Define the mean return time of the Markov chain with initial distribution and irreducible transition matrix by

 R=E(q,P)[inf{n>0:Xn=X0}]<∞.

Let be the -field generated by . Let be a stopping time with respect to , with . Then

 E(q,P)N(x,y,0,τ)≤π(x)P(x,y)(E(q,P)τ+R−1), for all x,y∈S.
###### Proof.

Using the -th return times from Fact 1 we decompose in IID summands

 N(x,y,0,τk)=k−1∑i=0N(x,y,τi,τi+1).

Now let , so that is the first return time to the initial state after or at time . By definition of we have that

 τκ−τ≤τκ−τκ−1−1.

Taking expectations we obtain

which also gives that

 E(q,P)[τκ]≤E(q,P)[τ]+R−1<∞.

This allows us to use Wald’s identity, followed by Lemma 4, followed by Wald’s identity again, in order to get

 E(q,P)N(x,y,0,τκ) =E(q,P)κ−1∑i=0N(x,y,τi,τi+1) =E(q,P)[N(x,y,0,τ1)]Eq[κ] =p(x)P(x,y)E(q,P)[τ1]E(q,P)[κ] =p(x)P(x,y)E(q,P)[τκ].

Therefore,

 E(q,P)N(x,y,0,