 # An Index-based Deterministic Asymptotically Optimal Algorithm for Constrained Multi-armed Bandit Problems

For the model of constrained multi-armed bandit, we show that by construction there exists an index-based deterministic asymptotically optimal algorithm. The optimality is achieved by the convergence of the probability of choosing an optimal feasible arm to one over infinite horizon. The algorithm is built upon Locatelli et al.'s "anytime parameter-free thresholding" algorithm under the assumption that the optimal value is known. We provide a finite-time bound to the probability of the asymptotic optimality given as 1-O(|A|Te^-T) where T is the horizon size and A is the set of the arms in the bandit. We then study a relaxed-version of the algorithm in a general form that estimates the optimal value and discuss the asymptotic optimality of the algorithm after a sufficiently large T with examples.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Consider a constrained multi-armed bandit (CMAB)  problem where there is a finite set of arms, , and a single arm in needs to be sequentially played. When in is played at discrete time , the player not only obtains a sample bounded reward drawn from an unknown reward-distribution associated with

, whose unknown finite expectation and finite variance are

and , respectively, but also obtains a sample bounded cost drawn from an unknown cost-distribution associated with , whose unknown expectation and variance are and , respectively. Sample rewards and costs across arms are all independent for all time steps. That is, , and are independent for all and all . For any fixed in , ’s and ’s for are identically distributed, respectively. We define the feasible set of arms such that for a constant known to the player and assume that . Our goal is to find an optimal feasible arm that achieves the optimal value . (For the sake of simplicity, we consider one constraint case. It is straightforward to extend our results into multiple-constraints case.)

The model of unconstrained MAB has been used for studying many (practical) problems (see, e.g.,    for in depth cover of the topic and examples). However, there also exist related MAB problems that are involved with one or more of conflicting objective functions with the main objective functions. These conflicting objective functions play the roles of constraints for optimizing the main objective functions in CMAB problems. For example, a trade-off exists between achieving a “small” delay (or “high” throughput) and “low” power consumption in wireless communication networks. To maximize the throughput (or to minimize the delay) we need to transmit with the highest available power level because it will increase the probability of successful transmission. On the other hand, to minimize the power consumption, we need to transmit with the lowest power level available. We can consider the problem of selecting an optimal feasible power level among all available powers that keeps the delay cost below some given bound. In fact, in many scheduling and queueing control problems, there exist certain trade-offs between “throughput” and “delay” in general.

We define an algorithm as a sequence of mappings such that maps from the set of past plays and rewards and costs, if and if , to the set . We denote the set of all possible such deterministic algorithms as . The asymptotic optimality introduced by Robbins  for the optimality in terms of transient behavior of an algorithm will be used as the measure of the performance. Let and let denote the arm selected by at time . Given in , we say that is asymptotically optimal if as .

This note begins with presenting an algorithm, called “Constrained Anytime Parameter-free Thresholding (CAPT),” in order to show that by construction there exists an asymptotically optimal index-based deterministic algorithm in . It has been conjectured that devising such an algorithm is difficult  because the question of how to “mix” a process of estimating the feasibility of each arm into the exploration (and possibly exploitation) process of estimating the reward-optimality of each feasible arm by one deterministic index needs to be answered. This note provides an affirmative report to the question.

To approach a CMAB problem, instead of searching a proper index for each action, one can consider a methodology that “separates” the two associated problems of CMAB “in time” by solving the cost-feasibility problem first and then solves the reward-optimality problem conditioned on the results about the feasibility, eventually achieving the asymptotic optimality. The immediate open questions are firstly if the method is in (even with putting aside an index-based selection) because a strong negative argument is that the method would provide only a probabilistic judgement in order to change the tune and secondly if the algorithm is analyzable in terms of the asymptotic optimality. A randomized strategy recently studied by Chang  extends the -greedy MAB strategy  and works with the underpinning exploration method of uniform random-selection. It is worthwhile to note that the arguments in the analysis regarding the asymptotic optimality of the strategy were provided with the very above idea of separating the estimation process in the bandit in time by the probabilistic estimation result for the feasibility. This is from the ground-level process of the uniform selection that allows conditioning a “guaranteed level” of the feasibility estimation for a given time. With controlling the values of used over time by the strategy, it was proved that the strategy achieves the asymptotic optimality after a sufficiently large horizon. The strategy is simple even if randomized but not index-based. Still, the strategy provides an important direction towards designing solution methods for CMAB problems.

In addition, a potentially profitable functionality missed by the constrained -greedy strategy is the usage of problem-characteristics about the reward-optimality and the cost-feasibility. Let us define and where . The role of is to provide some tolerance in what we measure. These values can guide us for determining degrees of allocating samples over arms. The more closely competitive and feasible arms there are, the more difficult problem is in general. For example, if the second best arm becomes very competitive with the best, the harder distinguishing the best from the second best is. If becomes closer to zero, checking the feasibility of becomes more difficult. More sampling efforts need to be put in distinguishing closely feasible and competitive arms. Indeed, the probabilities related with the convergences have been expressed in terms of a function of ’s, called “complexity of the problem” (see, e.g.,    , etc. and also confer with, e.g.,  about related but different complexity measures). Noticeably, Locatelli et al.  developed an index-based deterministic algorithm, called “Anytime Parameter-free Thresholding (APT)” for “thresholding bandit” problems by using ’s. It turns out that the problem considered exactly coincides with the cost-feasibility problem in CMAB. In particular, the index of in is given by where is an estimate of obtained by replacing by the sample mean up to time by cost-samples of playing . The number of times has been played up to time is denoted by . The index measures “to what degree needs to be sampled” at time of APT and APT plays an arm in the argument set that achieves the minimum index. We can see that the index has a structure that the arm selection is affected by the values of and together.

The CAPT algorithm presented in Section III employs the same form of the index of APT but with some extension: the index of is given as

 Ka(t):=min(¯Δϵa(t),¯Φϵa(t))√Ta(t).

The term is an estimate of . Similar to , the sample mean of up to time by reward-samples is used in place of in . The idea is simple. We pose the reward-optimality problem as another cost-feasibility problem. Each index for the two feasibility problems are combined into a new index by the minimum operator. The index then measures not only to what degree needs to be sampled for cost-feasibility but also for reward-optimality. We prove that CAPT constructed from this simple fusion achieves the asymptotic optimality. We provide a finite-time lower bound to the probability of finding an optimal feasible action with for a given finite horizon .

CAPT works with the crucial assumption that the optimal value is known because needs to be computed. However, we argue that this theoretical study is an important step towards understanding the solvability and the complexity of CMAB. In fact, the procedures of some algorithms for MAB in the literature were given with the optimal value (or a functional value of it or a known bound to it) as an input parameter and accordingly analyzed (see, e.g., Theorem 3 for the -greedy algorithm in , Theorem 3 for APT in , Theorem 1 for UCB-E in , Theorem 2 and the related work section in , etc.).

In Section IV, we study an algorithm in a general form, called CAPT-E (CAPT with Estimation), where the index of CAPT is replaced by such that

 κa(t):=min(¯Δϵ,∗a(t),¯Φϵa(t))√Ta(t),

where is an estimate of that substitutes into . We discuss a sufficient condition that makes CAPT-E achieve the asymptotic optimality after a sufficiently large and some examples of .

The main goal of this note is to establish the existence of an asymptotically optimal index-based deterministic algorithm in and to provide a theoretical characterization about the CMAB problems (as in Theorem 2 of Lai and Robbins ). The performance of CAPT would be a baseline for comparison or improvement for an index-based algorithm in with the criterion of the asymptotic optimality. Finally, we show that the critical assumption can be relaxed and open some direction for further research in developing algorithms for solving CMAB problems.

## Ii Related Works

The model of CMAB is a special case of constrained Markov decision process (CMDP)

 , in which we assume that all of the distributions of rewards and costs associated with all arms are unknown

to the decision maker. Because of the assumption, the exact solution method, e.g., linear programming, is not applicable for solving CMAB problems.

Much attention has been paid recently to the model, “Budgeted MAB (BMAB),” that adds a certain constraint for optimality (see, e.g., 

). In our terms, consider a random variable that takes the value of the sum of the random costs obtained by running an algorithm

in over horizon, i.e., . Let the stopping time where is a problem parameter called budget. The player stops playing at once it consumes up all of the budget given by . Take the expected value of the sum of the random rewards obtained by following over the sample path of length . We wish to maximize the expected value over all possible . In other words, the goal is to obtain or an algorithm that achieves it. The key difference from CMAB is that in BMAB, the budget constraint is put on the played arm sequence. Furthermore, while CMAB is a special case of CMDP as we mentioned before, it seems that BMAB is not directly related with CMDP.

Constrained simulation optimization, under the topic of “constrained ranking and selection,” considers a similar simulation setting where the values of objective and constraint functions can be obtained only by a sequential sampling process. However, we do not draw multiple samples of reward and cost at a single time step. No particular assumptions on the reward and the cost distributions (e.g., normality) are made. Sampling plan or sampling allocation is not computed in advance as these or subset of these are common assumptions and approaches in the literature (see, e.g.,    and the references therein).

Various measures of studying the behaviours of the MAB algorithms exist (see, e.g., a discussion in ). The most notable ones are probably the expected regret  for average behaviour and the asymptotic optimality for transient behaviour. Auer et al.  relates the asymptotic optimality with “instantaneous” regret given as and note that the instantaneous regret is a stronger notion than the expected regret in the convergence. The asymptotic optimality is directly related with the probability of identifying a best arm    in the so-called “pure exploration” problem. In the simulation optimization literature, the probability has been often referred to as the probability of correct selection (see, e.g., , etc.). The different reference for the probability seems to depend on the context of the problem topic under study in the relevant literature.

The literature in MAB has rather focused on the expected regret since the work of Lai and Robbins  and more particularly since Auer et al.’s finite-time analysis on index-based algorithms  (see, e.g.,  and the references therein). It is difficult to find a work that studies the instantaneous behaviour of the existing MAB algorithms designed for the expected regret, e.g., UCB  or its variants . That is, even if the expected behaviour of an algorithm relative to the best algorithm has been extensively studied in the literature, the expected behavior of the algorithm itself seems to be not known yet. Note that obtaining the expected behavior of , for UCB essentially requires analyzing the transient behavior of UCB, i.e., the probability of .

Defining the expected regret within CMAB is not straightforward. If we try a definition given by the expected loss relative to the cumulative expected reward of taking an optimal feasible arm due to the fact that the algorithm does not always play an optimal feasible arm, for in , the loss can be negative. The problem of minimizing the regret is no longer meaningful because this is like having a negative cycle in a shortest-path problem. In some cases, the minimum is simply achieved by an algorithm that always plays an infeasible arm whose reward average is higher than . A possible leverage would be introducing a function over that penalizes an infeasibility to some degree inside the summation. Defining the expected “regret” and design and analysis of proper algorithms will depend on the definition. The study on the expected regret in CMAB is beyond the scope of this note and is left as a future research.

## Iii Constrained APT Algorithm

### Iii-a Algorithm

Once in is played by CAPT (referred to as wherever possible) at time , a sample reward of and a sample cost of are obtained independently. We let where denotes the indicator function, i.e., if and 0 otherwise. The sample average-reward for in is then given such that if and 0 otherwise, Similarly, for in is given such that if and 0 otherwise. Note that and for all . We let and for A pseudocode for CAPT is provided below.
The Constrained APT (CAPT) algorithm

• Initialization:

• Select .

• From to , play each once and obtain and independently.

• Set for all and .

• Loop while

• Play .

• Obtain and independently and and .

• Output:

• Obtain and .

• Output .

### Iii-B Asymptotic Optimality

To analyze the behavior of CAPT, we start with the definition of a set of approximately feasible arms: For a given , . Given , any set in is referred to as an -feasible set of arms if , where is the power set of . An arm in is -feasible if is an -feasible set. We also define a set of competing (optimality-candidate) arms: For a given , . Given , any set in is referred to as a -competing set of arms if . An arm in a -competing set is -competing. Note that a -competing arm is not necessarily feasible and that for any given -feasible set and -competing set , . The set in the previous identity is said to be a -optimal set and an arm in is -optimal.

In the sequel, we consider the case where and refer to an -optimal set as just an -optimal set. An arm in an -optimal set is -optimal. If , the -feasible set corresponds to and the -competing set is equal to , and the intersection of the two sets is equal to the solution set of .

The theorem below states about a finite-time lower bound to the probability that produced by CAPT at in the Output step is an -optimal set for some general conditions. The bound is given in terms of a problem-complexity denoted by . This complexity must be very intuitive: The performance of CAPT depends on the sum of the degrees of the hardness of each action between the cost-feasibility problem and the reward-optimality problem. Note that if , becomes infinity because for some . If the problem contains an arm that satisfies the constraint by equality such that , become infinity again. Therefore, we exclude such cases by requiring that but can be arbitrarily close to zero.

The assumption that in the theorem statement is due to a technical reason: Obviously, to make CAPT run, the condition that is necessary due to the Initialization step. We further observe that there always exists in such that . Suppose not. Then , which is a contradiction. By then, we can fix an action that satisfies the bound of and that has been played at least two times by and can use the inequality in “cleaning up” some terms to eventually obtain a bound on . In addition, we impose the condition that and are in for any and for the better exposition.

###### Theorem III.1

Assume that the reward and the cost distributions associated with all arms in have the support in . Then for any and , the output by CAPT at satisfies

 Pr{Aϵ∗∩A−ϵf⊆A∗T(ϵ)∩AfT(ϵ)⊆A−ϵ∗∩Aϵf}≥1−2|A|Te−T/16H(ϵ).

Before presenting the proof, we remark that the idea of the proof basically follows the reasoning in the proof of Theorem 2 by Locatelli et al.  since CAPT is built upon APT. But the proof here requires the more thoughts due to the different index to be manipulated. We also polish some arguments given in . In particular, the simpler Hoeffding inequality  is applied in a place where a lower bound to some probability is obtained instead of nonidentifiable “Sub-Gaussian martingale inequality” referred by Locatelli et al. The lower bound with the term of in our result is looser than the stated lower bound with to a related probability by Locatelli et al. However, the arguments of Locatelli et al. for the tighter -bound seems incomplete at the steps of applying the Union bound. In fact, Wang and Ahmed  provide a related result for the cost-feasibility problem that has the same order of . Their approach is within the context of the “sample average approximation” . Thus the method is not index-based and not adaptive. In our terms, they analyzed the probability that is an -feasible set when each action in is played times equally, that is, for all . It is not clear how the adaptive index-based approach of APC makes a jump from to in the order in Locatelli et al.’s proof. Besides, Locatelli et al.’s theorem statement includes the case of , which will lead to the non-asymptotic optimality.
Define an event such that with a given ,

 ξ={∀a∈A,∀Ta(T)∈{1,2,...,T}, |¯XTa(T)−μa|≤√Tδ2H(ϵ)Ta(T)⋀|¯YTa(T)−Ca|≤√Tδ2H(ϵ)Ta(T)}.

Fix any in such that and fix as the smallest in such that . In other words, is the last time was played and satisfies that .

On we have that for all ,

 |¯XTi(t)−μi|≤√Tδ2H(ϵ)Ti(t).

This implies that for all ,

 Δϵi−√Tδ2H(ϵ)Ti(t)≤¯Δϵi(t)≤Δϵi+√Tδ2H(ϵ)Ti(t)

because for all , where we recall and

Similarly, for all ,

 Φϵi−√Tδ2H(ϵ)Ti(t)≤¯Φϵi(t)≤Φϵi+√Tδ2H(ϵ)Ti(t)

where we recall and

Because was played at , achieves the value of the minimum index, i.e., for all . Recall that

 Ka∗(t)=min(¯Δϵa∗(t),¯Φϵa∗(t))√Ta∗(t).

From the two inequalities of and , it follows that

 min(Δϵa∗−√Tδ2H(ϵ)Ta∗(t),Φϵa∗−√Tδ2H(ϵ)Ta∗(t))≤min(¯Δϵa∗(t),¯Φϵa∗(t)).

Thus we have that

 min(Δϵa∗,Φϵa∗)−√Tδ2H(ϵ)Ta∗(t)≤min(¯Δϵa∗(t),¯Φϵa∗(t)).

Multiplying both sides of the above inequality by and using and rearranging the terms leads to a lower bound to :

 (1√2−δ)√TH(ϵ)≤Ka∗(t). (1)

We now upper bound for any . From the inequality for the bound of , we have that

 Ki(t)=min(¯Δϵi(t),¯Φϵi(t))√Ti(t)≤¯Δϵi(t)√Ti(t)≤(Δϵi+√Tδ2H(ϵ)Ti(t))√Ti(t). (2)

Combining (1) and (2) results in

 (1√2−δ)√TH(ϵ)≤Δϵi√Ti(t)+δ√TH(ϵ)

for all . Rearranging the terms in the above inequality and from ,

 (1−2√2δ)2T2H(ϵ)(Δϵi)2≤Ti(T).

In sum, implies that for any ,

 μi−Δϵi×√2δ1−2√2δ≤¯XTi(T)≤μi+Δϵi×√2δ1−2√2δ. (3)

Set by letting . We show that the event implies that

 Aϵ∗⊆{j∈A|¯XTj(T))≥μ∗}⊆A−ϵ∗.

For any such that , . By ,

 ¯XTi(T)−μ∗≥μi−12Δϵi−μ∗=μi−12(μi−μ∗+ϵ)−μ∗≥0

making . On the other hand, for any such that , and this results in .

We next consider the cost-feasibility case. By the same method as in (2), on we have that for any ,

 Ki(t)≤¯Φϵi(t)√Ti(t)≤(Φϵi+√Tδ2H(ϵ)Ti(t))√Ti(t).

Then following the similar arguments as in the reward-optimality case leads to the inequality of

 Ci−Φϵi×√2δ1−2√2δ≤¯YTi(T)≤Ci+Φϵi×√2δ1−2√2δ.

With , for any such that , and it follows that because

 ¯YTi(T))−C≥Ci−12Φϵi−C=Ci−12(Ci−C+ϵ)−C=12(Ci−C−ϵ)>0

by . Furthermore, for any such that , . Because

 ¯YTj(T)−C≤Ci+12Φϵi−C=Ci+12(C−Ci+ϵ)−C=12(Ci−C+ϵ)≤0,

. It follows that

Putting the reward-optimality and the cost-feasibility arguments together (by independence), implies that

 Aϵ∗⊆{j∈A|¯XTj(T))≥μ∗}⊆A−ϵ∗ and A−ϵf⊆{j∈A|¯YTj(T))≤C}⊆Aϵf.

By applying the Union bound (Boole’s inequality) and Hoeffding inequality , the probability of is lower bounded as follows:

 Pr(ξ)=1−Pr(ξc) ≥1−∑a∈AT∑Ta(T)=1(Pr{|¯XTa(T)−μa|>√Tδ2H(ϵ)Ta(T)} ≥1−|A|Te−2Tδ2/H(ϵ)−|A|Te−2Tδ2/H(ϵ)=1−2|A|Te−2T/16H(ϵ).

## Iv CAPT with Estimation Algorithm

In this section, we provide an algorithm in a general form that replaces with where denotes the estimate of at . We call the algorithm “CAPT with Estimation” (CAPT-E) and refer to it as wherever possible. We discuss two examples below for the estimation.

### Iv-a Algorithm

The procedure is the same as that of CAPT except that the role of is replaced by . In particular, in CAPT is changed with where . The set in the Output step of CAPT is also changed with the set . We abuse the notations used in the previous section.
The CAPT with Estimation (CAPT-E) algorithm

• Initialization of CAPT

• Loop while

• Play .

• Obtain and independently and and .

• Output:

• Obtain and .

• Output .

In the next section, we discuss a general sufficient condition that makes CAPT-E achieve the asymptotic optimality and some example methods for estimation.

### Iv-B Convergence

We reassume that and . Fix in and fix as the last time was played. Notice that as .

Obviously, in order for CAPT-E to achieve the asymptotic optimality, the following condition is sufficient: the relative distance to the optimal value from the reward sample-mean of each action at the horizon , , approaches the true value as approaches infinity. More precisely, if for all as , then the probability that the output by CAPT-E at is an -competing set converges to one as .

We argue now that the above statement is indeed true. As in the proof of Theorem III.1, let us define an event (from CAPT-E) such that with given and ,

 ξ={∀a∈A,∀Ta(T)∈{1,2,...,T}, |¯XTa(T)−μa|≤√Tδ2H(ϵ)Ta(T)∧|¯YTa(T)−Ca|≤ ⎷Tδ2fH(ϵ)Ta(T)}.

On , because for all , for all where we define and It follows that

 min(Δϵ,∗a∗(t),Φϵa∗)−√Tδ2H(ϵ)Ta∗(t)≤min(¯Δϵ,∗a∗(t),¯Φϵa∗(t)).

Multiplying both sides by and using and rearranging the terms lead to

 ⎛⎜ ⎜⎝1√2×min(¯Δϵ,∗a∗(t),¯Φϵa∗(t))min(Δϵa∗,Φϵa∗)−δ⎞⎟ ⎟⎠√TH(ϵ)≤κa∗(t). (4)

An upper bound on for is obtained by

 κi(t)=min(¯Δϵ,∗i(t),¯Φϵi(t))√Ti(t)≤¯Δϵ,∗i(t)√Ti(t)≤(Δϵ,∗i(t)+√Tδ2H(ϵ)Ti(t))√Ti(t).

Let

 fa∗(t)=min(¯Δϵ,∗i(t),¯Φϵi(t))min(Δϵa∗,Φϵa∗).

Combining the lower and the upper bounds, we have that for all ,

 (1√2f