The classical single player multi-armed bandit problem initially proposed by Robbins  is one of sequential decision making under uncertainty. A player (also referred to as a learner) has to decide between alternatives (or arms) when allowed to select only one at each time. Each arm returns a reward when selected. The arm selected at any time is a function only of the past choices of arms and the corresponding rewards obtained. In this setting, the goal of the learner is to maximise the long-term average reward by exercising a balanced trade-off between the decision to exploit an arm that, based on the current state of knowledge, appears to yield high rewards, versus the decision to explore other arms which could potentially yield high rewards in the future. Thus, every sequential arm selection strategy for the multi-armed bandit problem involves an exploration versus exploitation trade-off.
I-a Prior Work and Motivation
While the work of Robbins considers the setting in which the successive rewards received from any given arm are independent and identically distributed (iid) according to a fixed but unknown distribution, extensions to this work that consider more general structures on the rewards have appeared in the literature. The seminal paper of Gittins  considers the setting in which the rewards from each arm constitute a time homogeneous Markov process with a known transition law. Furthermore, at any given time, only the arm chosen by the learner exhibits a state transition, while rest of the arms remain frozen (or rested) at their previously observed states. Such a structure on the arms is not restrictive, and closely models a host of real-life scenarios, as outlined in [3, Chapter 1]. The central problem in  is one of maximising the long-term average discounted reward obtained by sequentially selecting the arms. For this problem, Gittins proposed and demonstrated the optimality of a simple index-based policy that involves constructing an index for each arm using the knowledge of the transition laws of the arms, and selecting at any given time the arm with the largest index.
Agarwal et al.  consider a similar setting as Gittins’, where the rewards from each arm are Markovian. However, an important feature of the work in  is that, unlike Gittins , the authors do not assume the knowledge of the transition laws of the arms. They then provide a strengthening of the results of . Since the setting and the results of Agarwal et al. will be of relevance to us in this paper, in what follows, we describe their work in some detail. In , a system whose transition law is parameterised by an unknown parameter belonging to a known, finite, parameter space is considered. At any given time, one out of finitely many possible controls (or actions) is applied to the system, resulting in a reward based on the action applied and the system’s current state, followed by a transition of the system’s current state. The choice of action to be applied at any given time is based on the past history of actions and states, and the current state of the system. The performance of any sequence of actions and states (known as a policy) is measured by the infinite horizon expected sum of the rewards generated by the policy.
The authors of 
compare the performance of any given policy in relation to a policy that has knowledge of the true parameter of the system by introducing a loss function that is defined, at any given time, as the total reward generated up to time by the policy that has knowledge of the system parameter minus the expected value of the total reward generated by the given policy up to time . Subsequently, the authors of  define a notion of optimality for a policy as one for which the loss function of the policy grows as for every , and look for policies that are optimal within this regret minimisation framework.
The authors of  then show that the problem described above can be reduced to a problem of a multi-armed bandit with finitely many arms, each generating Markov rewards, as considered by Gittins , but with transition laws parameterised by an unknown parameter coming from a finite parameter set. Exploiting the structure arising from the multi-armed bandit setting, the authors of  then show that for any policy, the ratio of its loss function to that of may be lower bounded in terms of a problem-instance dependent constant that quantifies the effort required by any policy to distinguish the true parameter of the system from its nearest alternative. Furthermore, the authors of  propose a policy and demonstrate that its performance meets the lower bound, hence proving that their policy is optimal.
While the lower bound in  quantifies the growth rate of the loss in performance of a policy in relation to the policy that has full knowledge of the system’s true parameter, it does not reflect the quickness of learning by a policy. That is, the results in  do not answer the question of how many controls (or actions) are needed, on the average, in order to learn the true parameter of the system to a desired level of accuracy. In this paper, we answer this question at the same level of generality of , when one of the Markov arms is anomalous.
I-B Setting and Problem Description
Let us recall our setting. We have a multi-armed bandit problem with finitely many arms in which each arm is identified with a time homogeneous, irreducible and aperiodic discrete time Markov process on a common finite state space, as in . We assume that each arm yields Markov rewards independently of the other arms. Further, we impose the following condition on the arms: the evolution of states on one of the arms is governed by a transition matrix , while those of the other arms is governed by a common transition matrix , where , hence making one of the arms anomalous (hereinafter referred to as the odd arm). A learner, who has no knowledge of the above transition matrices, seeks to identify the index of the odd arm as quickly as possible. In order to do so, the learner devises sequential arm selection schemes in which at each time step, he chooses one of the arms and observes a state transition on the chosen arm. During this time, all the unobserved arms remain frozen at their previously observed states and do not exhibit state transitions, thus making the arms rested as in .
Given a maximum acceptable error probability, the goal of the learner is to devise sequential schemes for identifying the index of the odd arm with (a) the least expected number of arm selections, and (b) the probability of error at stoppage being within the acceptable level. We note here that the unknown parameters of our problem are the transition laws of the odd arm and the non-odd arm Markov processes, and the index of the odd arm, thus making our parameter set a continuum, unlike a finite parameter set in . However, our goal is only to identify the index of the odd arm.
The structure of anomaly imposed on the arms in the context of odd arm identification is not new, and has been dealt with in the recent works of Vaidhiyan et al.  for the case of iid Poisson observations from each arm, and of Prabhu et al.  for the case of iid observations belonging to a generic exponential family. The works  and  can be embedded within the classical works of Chernoff  and Albert , and provide a general framework for the analysis of lower bounds on expected number of samples required for identifying the index of the odd arm. In addition, they also provide explicit schemes that achieve these lower bounds in the asymptotic regime as error probability vanishes. We refer the reader to also [9, 10, 11, 12, 13, 14, 15, 16] for other related works on iid observations.
In this paper, we present results similar in spirit to those of [8, 7, 5, 6], but for the important setting of Markov arms. To the best of our knowledge, there is no prior work on odd arm identification for the case of multi-armed bandits with Markov rewards, and this paper is the first to study this setting.
Below, we highlight the key contributions of our work. Further, we mention the similarities and differences of our work with the aforementioned ones, and also bring out the challenges that we need to overcome in the analysis for the Markov setting.
We derive an asymptotic lower bound on the expected number of arm selections required by any policy that the learner may use to identify the index of the odd arm. Here, the asymptotics is as the error probability vanishes. Similar to the lower bounds appearing in -, our lower bound has a problem-instance dependent constant that quantifies the effort required by any policy to identify the true index of the odd arm by guarding itself against identifying the nearest, incorrect alternative.
We characterise the growth rate of the expected number of arm selections required by any policy as a function of the maximum acceptable error probability, and show that in the regime of vanishingly small error probabilitys, this growth rate is logarithmic in the inverse of the error probability. The analysis of the lower bounds in  and  uses the familiar data processing inequality presented in the work of Kaufmann et al.  that is based on Wald’s identity  for iid processes. However, the Markov setting in our problem does not permit the use of Wald’s identity. Therefore, we derive results for our Markov setting generalising those appearing in , and subsequently use these generalisations to arrive at the lower bound. See Section III for the details.
In the analysis of the lower bound, we bring out the key idea that the empirical proportion of times an arm is observed to exit out of a state is equal, in the long run, to the empirical proportion of times it is observed to enter into the same state. We then replace these common proportions by the probability of observing the arm occupying this state under the arm’s stationary distribution. This is possible due to the rested nature of the arms, and may not hold in a more general setting of “restless” arms where the unobserved arms continue to undergo state transitions.
We propose a sequential arm selection scheme that takes as inputs two parameters, one of which may be chosen appropriately to meet the acceptable error probability, while the other may be tuned to ensure that the performance of our scheme comes arbitrarily close to the lower bound, thus making our scheme near-optimal.
We now contrast the near-optimality of our scheme with the near-optimality of the scheme proposed by Vaidhiyan et al. in , and highlight a key simplification that our scheme entails. The scheme of Vaidhiyan et al. is built around the important fact that each arm is sampled at a non-trivial, strictly positive and optimal rate that is bounded away from zero, as given by the lower bound, thereby allowing for exploration of the arms in an optimal manner. This stemmed from their specific Poisson observations. However, the lower bound presented in Section III may not have this property in the context of Markov observations. Therefore, recognising the requirement of sampling the arms at a non-trivial rate for good performance of our scheme, in this paper, we use the idea of “forced exploration” proposed by Albert in . In particular, we propose a simplified way of sampling the arms by considering a mixture of uniform sampling and the optimal sampling given by the lower bound in Section III. We do this by introducing an appropriately tuneable parameter that controls the probability of switching between uniform sampling and optimal sampling, the latter being given by the lower bound. While this ensures that our policy samples each arm with a strictly positive probability at each step, it also gives us the flexibility to select an appropriate value for this parameter so that the upper bound on the performance of our scheme may be made arbitrarily close to our lower bound. We refer the reader to Section IV for the details.
The rest of the paper is organised as follows. In Section II, we set up some of the basic notations that will be used throughout the paper. In Section III, we present a lower bound on the performance of any policy. In Section IV, we present a sequential arm selection policy and demonstrate its near optimality. We present the main result of this paper in Section V, combining the results of Sections III and IV. In Section VI, we provide some simulation to support the theoretical development, and provide concluding remarks in Section VIII. We present the proofs of the main results in Section VII.
In this section, we set up the notations that will be used throughout the rest of this paper. Let denote the number of arms, and let denote the set of arms. We associate with each arm an irreducible, aperiodic, time homogeneous discrete-time Markov process on a finite state space , where the Markov process of each arm is independent of the Markov processes of the other arms. We denote by the cardinality of . Without loss of generality, we take . Hereinafter, we use the phrase ‘Markov process of arm ’ to refer to the Markov process associated with arm .
At each discrete time instant, one out of the arms is selected and its state is observed. We let denote the arm selected at time , and let denote the state of arm , where . We treat as the zeroth arm selection and as the zeroth observation. Selection of an arm at time is based on the history of past observations and arms selected; here, (resp. ) is a shorthand notation for the sequence (resp. ). We shall refer to such a sequence of arm selections and observations as a policy, which we generically denote by . For each , we denote the Markov process of arm by the collection
of random variables. Further, we denote bythe number of times arm is selected by a policy up to (and including) time , i.e.,
Then, for each , we have the observation
We consider a scenario in which the Markov process of one of the arms (hereinafter referred to as the odd arm) follows a transition matrix , while those of rest of the arms follow a transition matrix , where ; here, denotes the entry in the th row and th column of the matrix . Further, we let and denote the unique stationary distributions of and respectively. We denote by the common distribution for the initial state of each Markov process. In other words, for arm , we have , and this is the same distribution for all arms. We operate in a setting where the transition matrices and their associated stationary distributions are unknown to the learner.
For each and state , we denote by the number of times up to (and including) time the Markov process of arm is observed to exit out of state , i.e.,
Similarly, for each , we denote by the number of times up to (and including) time the Markov process of arm is observed to exit out of state and enter into state , i.e.,
Clearly, then, the following hold:
For each and ,
For each ,
For each ,
We note here that the upper index of the summation in (3) is , and not , since the last observed transition on arm would be an exit out of the state given by and an entry into the state given by . This is further reflected by the summation in (5b).
Fix transition matrices and , where , and let denote the hypothesis that is the index of the odd arm. The transition matrix of arm is ; all other arms have . We refer to the triplet as a configuration. Our problem is one of detecting the true hypothesis among all possible configurations given by
when and are unknown. Let denote the underlying configuration of the arms. For each , we denote by the log-likelihood process of arm under configuration , with being the true index of the odd arm. Using the notations introduced above, we may then express as
where denotes the conditional probability under hypothesis of observing state on arm given that state was observed on arm at the previous sampling instant, and is given by
Then, since the Markov processes of all the arms are independent of one another, for a given sequence of arm selections and observations under a policy and a configuration , denoting by the log-likelihood process under hypothesis of all arm selections and observations up to time , we have
where is as given in (6). On similar lines, for any two configurations and , where and , for each , we define the log-likelihood process of configuration with respect to configuration for arm as
We note that in the above equation, for , we should use (7), and for , we shall use, for all and ,
Finally, we denote by the log-likelihood process of configuration with respect to as
which includes all arm selections and observations.
The observation process and the arm selection process are assumed to be defined on a common probability space . We define the filtration as
We use the convention that the zeroth arm selection is measurable with respect to the sigma algebra , whereas for all , the th arm selection is -measurable. For any stopping time with respect to the filtration in (12), we denote by the -algebra
Our focus will be on policies that identify the index of the odd arm by sequentially sampling the arms, one at every time instant, and learning from the arms selected and observations obtained in the past. Specifically, at any given time, a policy prescribes one of the following alternatives:
Select an arm, based on the history of past observations and arms selected, according to a fixed distribution independent of the underlying configuration of the arms, i.e., for each ,
Stop selecting arms, and declare the index as the odd arm.
Given a maximum acceptable error probability , we denote by the family of all policies whose probability of error at stoppage for any underlying configuration of the arms is at most . That is,
For a policy , we denote its stopping time by . Further, we write and to denote expectations and probabilities given that the underlying configuration of the arms is . In this paper, we characterise the behaviour of for any policy , as approaches zero. We re-emphasise that cannot depend on the knowledge of or , but could attempt to learn these along the way.
Fix an odd arm index , and consider the simpler case when , are known, . Let denote the set of all policies whose probability of error at stoppage is within . From the definition of in (15), it follows that
That is, policies in work for any , with . It is not a priori clear whether the set is nonempty. That it is nonempty for the case of iid observations was established in . In this paper, we show that is nonempty even for the setting of rested and Markov arms.
Iii The Lower Bound
For any two transition probability matrices and of dimension
, and a probability distributionon , define as the quantity
with the convention . The following proposition gives an asymptotic lower bound on the expected stopping time of any policy , as .
Let denote the underlying configuration of the arms. Then,
where is a configuration-dependent constant that is a function only of and , and is given by
In (18), is a transition probability matrix whose entry in the th row and th column is given by
The proof of Proposition 1 broadly follows the outline of the proof of the lower bound in , with necessary modifications for the setting of Markov rewards. We now outline some of the key steps in the proof. For an arbitrary choice of error probability , we first show that for any policy , the expected value of the total sum of log-likelihoods up to the stopping time can be lower bounded by the binary relative entropy function
Next, we express the expected sum of log-likelihoods up to the stopping time in terms of the expected value of the stopping time. It is in obtaining such an expression that works such as ,  and  that are based on iid observations use Wald’s identity, which greatly simplifies their analysis of the lower bound. Our setting of Markov rewards does not permit us to use Wald’s identity. Therefore, we first obtain a generalisation of [10, Lemma 18], a change of measure based argument, to the setting of Markov rewards, and subsequently use this generalisation to obtain the desired relation.
We then show that for any arm , the long run frequency of observing the arm exit out of a state is equal to that of observing arm enter into the state , and note that this common frequency is the stationary probability of observing the arm in state . This explains the appearance of the unique stationary distributions and of the odd arm and the non-odd arms respectively in the expression (18). We wish to emphasise that this step in the proof is possible due to the rested nature of the arms. The lower bound in the more general setting of “restless” arms in which the unobserved arms continue to undergo state transitions is still open.
The right-hand side of (18) is a function only of the transition matrices and , and does not depend on the index of the odd arm. This is due to symmetry in the structure of arms, and we deduce that does not depend on . However, we include the index of the odd arm for the sake of consistency with the notation used to denote arm configurations.
Going further, we let denote the value of that achieves the maximum in (18). We then define as the probability distribution on given by
In the next section, we construct a policy that, at each time step, chooses arms with probabilities that match with those in (21) in the long run, in an attempt to reach the lower bound. While it is not a priori clear that this yields an asymptotically optimal policy, we show that this is indeed the case.
In this section, we propose a scheme that asymptotically achieves the lower bound of Section III, as the probability of error vanishes. Our policy is a modification of the policy proposed by Prabhu et al.  for the case of iid processes. We denote our policy by , where and are the parameters of the policy.
Our policy is based on a modification of the classical generalised likelihood ratio (GLR) test in which we replace the maximum that appears in the numerator of the classical GLR statistic by an average computed with respect to a carefully constructed artificial prior over the space of all probability distributions on the state space . We describe this modified GLR statistic in the next section.
Iv-a The Modified GLR Statistic
We revisit (8), and suppose that each arm is selected once in the first time slots. Note that this does not affect the asymptotic performance. Then, under configuration , the log-likelihood process may be expressed for any as
from which the likelihood process under , denoted by , may be written as
We now introduce an artificial prior on the space of all transition probability matrices for the state space . Let denote a Dirichlet distribution with parameters , where for all . Then, denoting by the space of all transition probability matrices of size , we specify a prior on using the above Dirichlet distribution as follows: for any , is chosen according to the above Dirichlet distribution, independently of for all . Further, for any two matrices , the rows of are independent of those of . Then, it follows that under this prior, the joint density at (, ) for is
where denotes the normalisation factor for the distribution , and the second line above follows by substituting , .
). From the property that the Dirichlet distribution is the appropriate conjugate prior for the observation process,
where in the above expression, denotes the normalisation factor for a Dirichlet distribution with parameters . It can be shown that is also the expected value of the likelihood in (23) computed with respect to the prior in (24), i.e.,
where in the above set of equations, the random vectorsand
are independent with independent components, and jointly distributed according to (24), and the expectation is also with respect to this joint density.
denote the maximum likelihood estimates of transition matricesand respectively, under hypothesis . Taking partial derivatives of the right-hand side (23) with respect to and for each , and setting each of these derivatives to zero, we get
We now define our modified GLR statistic. Let and be any two hypotheses, with . Let be a policy whose sequence of arm selections and observations up to (and including) time is . Then, the modified GLR statistic of with respect to up to time is denoted by , and is defined as
where the terms appearing in (29) are as follows.
The term is given by
The term is given by
The term is given by
The term is given by
The term is given by
Note that , the distribution of the initial state of any arm, is irrelevant since it appears in both (25) and (28), and thus cancels out in writing (29). Let us emphasise that our modified GLR statistic is one in which the maximum in the numerator of the usual GLR statistic is replaced by an average in (25) computed with respect to the artificial prior over the space introduced in (24).
We wish to mention here that the expression on the right-hand side of (23) for represents the likelihood of all observations up to (and including) time “conditioned on” the actions up to (and including) time . In other words, a more precise expression for is as follows:
where represents the probability of selecting arm at time when the true hypothesis is (i.e., when is the index of the odd arm), with the convention that at time , this term represents . Note that for any policy (see description in the paragraph containing (14) and (15)), this must be independent of the true hypothesis , and is thus the same for any two hypotheses and , where .
Iv-B The Policy
With the above ingredients in place, we now describe our policy based on the modified GLR statistic of (29). For each , let
denote the modified GLR of hypothesis with respect to its nearest alternative.
Fix parameters and . Let be a sequence of iid Bernoulli( random variables such that is independent of the sequence for all . We choose each of the arms once in the first time steps . For each , at time , we follow the procedure described below:
Let be the index with the largest modified GLR after time steps. We resolve ties uniformly at random.
If , then we choose the next arm based on the sequence of observations and arms selected until time as per the following rule:
If , then we choose an arm uniformly at random.
If , then we choose according to the distribution .
If , then we stop selecting arms and declare as the true index of the odd arm.
In the above policy, provides the best estimate of the odd arm at time . If the modified GLR statistic of arm is sufficiently larger than that of its nearest incorrect alternative (), then this indicates that the policy is confident that is the odd arm. At this stage, the policy stops taking further samples and declares as the index of the odd arm. If not, the policy continues to obtain further samples.
We refer to the rule in item (2) above as forced exploration with parameter . A similar rule also appears in . Based on the description in items (2(a)) and (2(b)) above, it follows that for each ,
As we will see, the strictly positive lower bound in (37) will ensure that the policy selects each arm at a non-trivial frequency so as to allow for sufficient exploration of all arms. Also, we will show that the parameters and may be selected so that our policy achieves a desired target error probability, while also ensuring that the normalised expected stopping time of the policy is arbitrarily close to the lower bound in (17).
Evaluating the distribution in step (2(a)) of the policy involves solving the maximisation problem in (18) with the transition matrices and replaced by their corresponding ML estimates and respectively at each time until stoppage. In the event when any of the rows of the estimated matrices has all its entries as zero, we substitute the corresponding zero row by a row with a single ‘1’ in one of the positions picked uniformly at random. Since the ML estimates converge to their respective true values as more observations are accumulated, we note that such a substitution operation (or any modification thereof that replaces the all-zero rows by an arbitrary probability vector) needs to be carried out only for finitely many time slots, and does not affect the asymptotic performance of the policy.
Iv-C Performance of
In this subsection, we show that the expected number of samples required by policy to identify the index of the odd arm can be made arbitrarily close to that in (17) in the regime of vanishing error probabilities. We show that this can be achieved by choosing the parameters and carefully. We organise this subsection as follows:
First, we show that when the true index of the odd arm is , the modified GLR of hypothesis with respect to its nearest alternative has a strictly positive drift under our policy. We then use this to show that our policy stops in finite time with probability .
For any fixed target error probability , we show that for an appropriate choice of the threshold parameter , our policy belongs to the family , i.e., its probability of error at stoppage is within .
We obtain an upper bound on the expected stopping time of our policy, and demonstrate that this upper bound may be made arbitrarily close to the lower bound in (17) by choosing an appropriate value of .
Iv-C1 Strictly Positive Drift of the Modified GLR Statistic
The main result on the strictly positive drift of the modified GLR statistic is as described in the following proposition.
Fix , , and consider a version of the policy that never stops. Let be the underlying configuration of the arms. Then, for all , under the non-stopping version of our policy, we have
The proof is based on the key idea that forced exploration with parameter (see items (2(a)) and (2(b)) of policy ) results in sampling each arm with a strictly positive rate that grows linearly. It is in showing an analogue of Proposition 2 for iid Poisson observations that the authors of  use their result of [5, Proposition 3] on guaranteed exploration at a strictly positive rate. Since it is not clear if the analogue of [5, Proposition 3] holds in general, we use the idea in  of forced exploration. We present the details in Section VII-B. We refer the reader to  on how to make do with forced exploration at a sublinear rate.
As an immediate consequence of the above proposition, we have the following: suppose is the underlying configuration of the arms. Then, a.s.,
The result in (39) has the following implication. For any , we have the following set of inequalities holding a.s.:
From the above set of inequalities, it follows that under policy ,
for all sufficiently large values of .
We note here that when is the underlying configuration of the arms, (41) seems to suggest that policy a.s. outputs as the true index of the odd arm at the time of stopping, thereby implying that it commits no error a.s. However, we wish to remark that this is not true, and recognise the possibility of the event that and for some , in which case the policy stops at time and outputs as the index of the odd arm, thereby making error. While we shall soon demonstrate that the probability of occurrence of such an error event under our policy is small, we leverage the implication of (41) to define a version of our policy that, under the underlying configuration , waits until the event occurs, at which point it stops and declares as the index of the odd arm. We denote this version by . Thus, stops only at declaration .
It then follows that the stopping times of policies and are a.s. related as , as a consequence of which we have the following set of inequalities holding a.s.:
where the last line follows as a consequence of Proposition 2. This establishes that a.s. policy stops in finite time.
Iv-C2 Probability of Error of Policy
We now show that for policy , the threshold parameter may be chosen to achieve any desired target error probability. This is formalised in the proposition below.
Fix . Then, for , we have . ∎