Consider a constrained multi-armed bandit (CMAB)  problem where there is a finite set of arms, , and a single arm in needs to be sequentially played. When in is played at discrete time , the player not only obtains a sample bounded reward drawn from an unknown reward-distribution associated with
, whose unknown finite expectation and finite variance areand , respectively, but also obtains a sample bounded cost drawn from an unknown cost-distribution associated with , whose unknown expectation and variance are and , respectively. Sample rewards and costs across arms are all independent for all time steps. That is, , and are independent for all and all . For any fixed in , ’s and ’s for are identically distributed, respectively. We define the feasible set of arms such that for a constant known to the player and assume that . Our goal is to find an optimal feasible arm that achieves the optimal value . (For the sake of simplicity, we consider one constraint case. It is straightforward to extend our results into multiple-constraints case.)
The model of unconstrained MAB has been used for studying many (practical) problems (see, e.g.,    for in depth cover of the topic and examples). However, there also exist related MAB problems that are involved with one or more of conflicting objective functions with the main objective functions. These conflicting objective functions play the roles of constraints for optimizing the main objective functions in CMAB problems. For example, a trade-off exists between achieving a “small” delay (or “high” throughput) and “low” power consumption in wireless communication networks. To maximize the throughput (or to minimize the delay) we need to transmit with the highest available power level because it will increase the probability of successful transmission. On the other hand, to minimize the power consumption, we need to transmit with the lowest power level available. We can consider the problem of selecting an optimal feasible power level among all available powers that keeps the delay cost below some given bound. In fact, in many scheduling and queueing control problems, there exist certain trade-offs between “throughput” and “delay” in general.
We define an algorithm as a sequence of mappings such that maps from the set of past plays and rewards and costs, if and if , to the set . We denote the set of all possible such deterministic algorithms as . The asymptotic optimality introduced by Robbins  for the optimality in terms of transient behavior of an algorithm will be used as the measure of the performance. Let and let denote the arm selected by at time . Given in , we say that is asymptotically optimal if as .
This note begins with presenting an algorithm, called “Constrained Anytime Parameter-free Thresholding (CAPT),” in order to show that by construction there exists an asymptotically optimal index-based deterministic algorithm in . It has been conjectured that devising such an algorithm is difficult  because the question of how to “mix” a process of estimating the feasibility of each arm into the exploration (and possibly exploitation) process of estimating the reward-optimality of each feasible arm by one deterministic index needs to be answered. This note provides an affirmative report to the question.
To approach a CMAB problem, instead of searching a proper index for each action, one can consider a methodology that “separates” the two associated problems of CMAB “in time” by solving the cost-feasibility problem first and then solves the reward-optimality problem conditioned on the results about the feasibility, eventually achieving the asymptotic optimality. The immediate open questions are firstly if the method is in (even with putting aside an index-based selection) because a strong negative argument is that the method would provide only a probabilistic judgement in order to change the tune and secondly if the algorithm is analyzable in terms of the asymptotic optimality. A randomized strategy recently studied by Chang  extends the -greedy MAB strategy  and works with the underpinning exploration method of uniform random-selection. It is worthwhile to note that the arguments in the analysis regarding the asymptotic optimality of the strategy were provided with the very above idea of separating the estimation process in the bandit in time by the probabilistic estimation result for the feasibility. This is from the ground-level process of the uniform selection that allows conditioning a “guaranteed level” of the feasibility estimation for a given time. With controlling the values of used over time by the strategy, it was proved that the strategy achieves the asymptotic optimality after a sufficiently large horizon. The strategy is simple even if randomized but not index-based. Still, the strategy provides an important direction towards designing solution methods for CMAB problems.
In addition, a potentially profitable functionality missed by the constrained -greedy strategy is the usage of problem-characteristics about the reward-optimality and the cost-feasibility. Let us define and where . The role of is to provide some tolerance in what we measure. These values can guide us for determining degrees of allocating samples over arms. The more closely competitive and feasible arms there are, the more difficult problem is in general. For example, if the second best arm becomes very competitive with the best, the harder distinguishing the best from the second best is. If becomes closer to zero, checking the feasibility of becomes more difficult. More sampling efforts need to be put in distinguishing closely feasible and competitive arms. Indeed, the probabilities related with the convergences have been expressed in terms of a function of ’s, called “complexity of the problem” (see, e.g.,    , etc. and also confer with, e.g.,  about related but different complexity measures). Noticeably, Locatelli et al.  developed an index-based deterministic algorithm, called “Anytime Parameter-free Thresholding (APT)” for “thresholding bandit” problems by using ’s. It turns out that the problem considered exactly coincides with the cost-feasibility problem in CMAB. In particular, the index of in is given by where is an estimate of obtained by replacing by the sample mean up to time by cost-samples of playing . The number of times has been played up to time is denoted by . The index measures “to what degree needs to be sampled” at time of APT and APT plays an arm in the argument set that achieves the minimum index. We can see that the index has a structure that the arm selection is affected by the values of and together.
The CAPT algorithm presented in Section III employs the same form of the index of APT but with some extension: the index of is given as
The term is an estimate of . Similar to , the sample mean of up to time by reward-samples is used in place of in . The idea is simple. We pose the reward-optimality problem as another cost-feasibility problem. Each index for the two feasibility problems are combined into a new index by the minimum operator. The index then measures not only to what degree needs to be sampled for cost-feasibility but also for reward-optimality. We prove that CAPT constructed from this simple fusion achieves the asymptotic optimality. We provide a finite-time lower bound to the probability of finding an optimal feasible action with for a given finite horizon .
CAPT works with the crucial assumption that the optimal value is known because needs to be computed. However, we argue that this theoretical study is an important step towards understanding the solvability and the complexity of CMAB. In fact, the procedures of some algorithms for MAB in the literature were given with the optimal value (or a functional value of it or a known bound to it) as an input parameter and accordingly analyzed (see, e.g., Theorem 3 for the -greedy algorithm in , Theorem 3 for APT in , Theorem 1 for UCB-E in , Theorem 2 and the related work section in , etc.).
In Section IV, we study an algorithm in a general form, called CAPT-E (CAPT with Estimation), where the index of CAPT is replaced by such that
where is an estimate of that substitutes into . We discuss a sufficient condition that makes CAPT-E achieve the asymptotic optimality after a sufficiently large and some examples of .
The main goal of this note is to establish the existence of an asymptotically optimal index-based deterministic algorithm in and to provide a theoretical characterization about the CMAB problems (as in Theorem 2 of Lai and Robbins ). The performance of CAPT would be a baseline for comparison or improvement for an index-based algorithm in with the criterion of the asymptotic optimality. Finally, we show that the critical assumption can be relaxed and open some direction for further research in developing algorithms for solving CMAB problems.
Ii Related Works
The model of CMAB is a special case of constrained Markov decision process (CMDP) , in which we assume that all of the distributions of rewards and costs associated with all arms are unknown
to the decision maker. Because of the assumption, the exact solution method, e.g., linear programming, is not applicable for solving CMAB problems.
Much attention has been paid recently to the model, “Budgeted MAB (BMAB),” that adds a certain constraint for optimality (see, e.g., 
). In our terms, consider a random variable that takes the value of the sum of the random costs obtained by running an algorithmin over horizon, i.e., . Let the stopping time where is a problem parameter called budget. The player stops playing at once it consumes up all of the budget given by . Take the expected value of the sum of the random rewards obtained by following over the sample path of length . We wish to maximize the expected value over all possible . In other words, the goal is to obtain or an algorithm that achieves it. The key difference from CMAB is that in BMAB, the budget constraint is put on the played arm sequence. Furthermore, while CMAB is a special case of CMDP as we mentioned before, it seems that BMAB is not directly related with CMDP.
Constrained simulation optimization, under the topic of “constrained ranking and selection,” considers a similar simulation setting where the values of objective and constraint functions can be obtained only by a sequential sampling process. However, we do not draw multiple samples of reward and cost at a single time step. No particular assumptions on the reward and the cost distributions (e.g., normality) are made. Sampling plan or sampling allocation is not computed in advance as these or subset of these are common assumptions and approaches in the literature (see, e.g.,    and the references therein).
Various measures of studying the behaviours of the MAB algorithms exist (see, e.g., a discussion in ). The most notable ones are probably the expected regret  for average behaviour and the asymptotic optimality for transient behaviour. Auer et al.  relates the asymptotic optimality with “instantaneous” regret given as and note that the instantaneous regret is a stronger notion than the expected regret in the convergence. The asymptotic optimality is directly related with the probability of identifying a best arm    in the so-called “pure exploration” problem. In the simulation optimization literature, the probability has been often referred to as the probability of correct selection (see, e.g., , etc.). The different reference for the probability seems to depend on the context of the problem topic under study in the relevant literature.
The literature in MAB has rather focused on the expected regret since the work of Lai and Robbins  and more particularly since Auer et al.’s finite-time analysis on index-based algorithms  (see, e.g.,  and the references therein). It is difficult to find a work that studies the instantaneous behaviour of the existing MAB algorithms designed for the expected regret, e.g., UCB  or its variants . That is, even if the expected behaviour of an algorithm relative to the best algorithm has been extensively studied in the literature, the expected behavior of the algorithm itself seems to be not known yet. Note that obtaining the expected behavior of , for UCB essentially requires analyzing the transient behavior of UCB, i.e., the probability of .
Defining the expected regret within CMAB is not straightforward. If we try a definition given by the expected loss relative to the cumulative expected reward of taking an optimal feasible arm due to the fact that the algorithm does not always play an optimal feasible arm, for in , the loss can be negative. The problem of minimizing the regret is no longer meaningful because this is like having a negative cycle in a shortest-path problem. In some cases, the minimum is simply achieved by an algorithm that always plays an infeasible arm whose reward average is higher than . A possible leverage would be introducing a function over that penalizes an infeasibility to some degree inside the summation. Defining the expected “regret” and design and analysis of proper algorithms will depend on the definition. The study on the expected regret in CMAB is beyond the scope of this note and is left as a future research.
Iii Constrained APT Algorithm
Once in is played by CAPT (referred to as wherever possible) at time ,
a sample reward of and a sample cost of are obtained independently.
We let where denotes the indicator function, i.e., if and 0 otherwise.
The sample average-reward for in is then given such that
if and 0 otherwise,
Similarly, for in is given such that if and 0 otherwise.
Note that and for all .
A pseudocode for CAPT is provided below.
The Constrained APT (CAPT) algorithm
From to , play each once and obtain and independently.
Set for all and .
Obtain and independently and and .
Obtain and .
Iii-B Asymptotic Optimality
To analyze the behavior of CAPT, we start with the definition of a set of approximately feasible arms: For a given , . Given , any set in is referred to as an -feasible set of arms if , where is the power set of . An arm in is -feasible if is an -feasible set. We also define a set of competing (optimality-candidate) arms: For a given , . Given , any set in is referred to as a -competing set of arms if . An arm in a -competing set is -competing. Note that a -competing arm is not necessarily feasible and that for any given -feasible set and -competing set , . The set in the previous identity is said to be a -optimal set and an arm in is -optimal.
In the sequel, we consider the case where and refer to an -optimal set as just an -optimal set. An arm in an -optimal set is -optimal. If , the -feasible set corresponds to and the -competing set is equal to , and the intersection of the two sets is equal to the solution set of .
The theorem below states about a finite-time lower bound to the probability that produced by CAPT at in the Output step is an -optimal set for some general conditions. The bound is given in terms of a problem-complexity denoted by . This complexity must be very intuitive: The performance of CAPT depends on the sum of the degrees of the hardness of each action between the cost-feasibility problem and the reward-optimality problem. Note that if , becomes infinity because for some . If the problem contains an arm that satisfies the constraint by equality such that , become infinity again. Therefore, we exclude such cases by requiring that but can be arbitrarily close to zero.
The assumption that in the theorem statement is due to a technical reason:
Obviously, to make CAPT run, the condition that is necessary due to the Initialization step.
We further observe that
there always exists in such that .
Suppose not. Then , which is a contradiction.
By then, we can fix an action that satisfies the bound of and that has been played at least two times by and can use the inequality in “cleaning up” some terms to eventually obtain a bound on .
In addition, we impose the condition that and are in for any and for the better exposition.
Assume that the reward and the cost distributions associated with all arms in have the support in . Then for any and , the output by CAPT at satisfies
Before presenting the proof, we remark that
the idea of the proof basically follows the reasoning in the proof of Theorem 2 by Locatelli et al.  since CAPT is built upon APT.
But the proof here requires the more thoughts due to the different index to be manipulated. We also polish some arguments given in .
In particular, the simpler Hoeffding inequality  is applied in a place where
a lower bound to some probability is obtained instead of nonidentifiable “Sub-Gaussian
martingale inequality” referred by Locatelli et al.
The lower bound with the term of in our result is looser than the stated
lower bound with to a related probability by Locatelli et al.
However, the arguments of Locatelli et al. for the tighter -bound seems
incomplete at the steps of applying the Union bound.
In fact, Wang and Ahmed  provide a related result for the cost-feasibility problem
that has the same order of .
Their approach is within the context of the “sample average approximation” .
Thus the method is not index-based and not adaptive.
In our terms, they analyzed the probability that is
an -feasible set when each action in is played times equally, that is,
for all .
It is not clear how the adaptive index-based approach of APC makes a jump from
to in the order in Locatelli et al.’s proof.
Besides, Locatelli et al.’s theorem statement includes
the case of , which will lead to the non-asymptotic optimality.
Define an event such that with a given ,
Fix any in such that and fix as the smallest in such that . In other words, is the last time was played and satisfies that .
On we have that for all ,
This implies that for all ,
because for all , where we recall and
Similarly, for all ,
where we recall and
Because was played at , achieves the value of the minimum index, i.e., for all . Recall that
From the two inequalities of and , it follows that
Thus we have that
Multiplying both sides of the above inequality by and using and rearranging the terms leads to a lower bound to :
We now upper bound for any . From the inequality for the bound of , we have that
for all . Rearranging the terms in the above inequality and from ,
In sum, implies that for any ,
Set by letting . We show that the event implies that
For any such that , . By ,
making . On the other hand, for any such that , and this results in .
We next consider the cost-feasibility case. By the same method as in (2), on we have that for any ,
Then following the similar arguments as in the reward-optimality case leads to the inequality of
With , for any such that , and it follows that because
by . Furthermore, for any such that , . Because
. It follows that
Putting the reward-optimality and the cost-feasibility arguments together (by independence), implies that
By applying the Union bound (Boole’s inequality) and Hoeffding inequality , the probability of is lower bounded as follows:
Iv CAPT with Estimation Algorithm
In this section, we provide an algorithm in a general form that replaces with where denotes the estimate of at . We call the algorithm “CAPT with Estimation” (CAPT-E) and refer to it as wherever possible. We discuss two examples below for the estimation.
The procedure is the same as that of CAPT except that the role of is replaced by . In particular, in CAPT is changed with where . The set in the Output step of CAPT is also changed with the set .
We abuse the notations used in the previous section.
The CAPT with Estimation (CAPT-E) algorithm
Initialization of CAPT
Obtain and independently and and .
Obtain and .
In the next section, we discuss a general sufficient condition that makes CAPT-E achieve the asymptotic optimality and some example methods for estimation.
We reassume that and . Fix in and fix as the last time was played. Notice that as .
Obviously, in order for CAPT-E to achieve the asymptotic optimality, the following condition is sufficient: the relative distance to the optimal value from the reward sample-mean of each action at the horizon , , approaches the true value as approaches infinity. More precisely, if for all as , then the probability that the output by CAPT-E at is an -competing set converges to one as .
We argue now that the above statement is indeed true. As in the proof of Theorem III.1, let us define an event (from CAPT-E) such that with given and ,
On , because for all , for all where we define and It follows that
Multiplying both sides by and using and rearranging the terms lead to
An upper bound on for is obtained by
Combining the lower and the upper bounds, we have that for all ,