# Continuous Time Bandits With Sampling Costs

We consider a continuous-time multi-arm bandit problem (CTMAB), where the learner can sample arms any number of times in a given interval and obtain a random reward from each sample, however, increasing the frequency of sampling incurs an additive penalty/cost. Thus, there is a tradeoff between obtaining large reward and incurring sampling cost as a function of the sampling frequency. The goal is to design a learning algorithm that minimizes regret, that is defined as the difference of the payoff of the oracle policy and that of the learning algorithm. CTMAB is fundamentally different than the usual multi-arm bandit problem (MAB), e.g., even the single-arm case is non-trivial in CTMAB, since the optimal sampling frequency depends on the mean of the arm, which needs to be estimated. We first establish lower bounds on the regret achievable with any algorithm and then propose algorithms that achieve the lower bound up to logarithmic factors. For the single-arm case, we show that the lower bound on the regret is Ω((log T)^2/μ), where μ is the mean of the arm, and T is the time horizon. For the multiple arms case, we show that the lower bound on the regret is Ω((log T)^2 μ/Δ^2), where μ now represents the mean of the best arm, and Δ is the difference of the mean of the best and the second-best arm. We then propose an algorithm that achieves the bound up to constant terms.

## Authors

• 18 publications
• 21 publications
09/05/2019

### Optimal UCB Adjustments for Large Arm Sizes

The regret lower bound of Lai and Robbins (1985), the gold standard for ...
05/30/2018

### Infinite Arms Bandit: Optimality via Confidence Bounds

The infinite arms bandit problem was initiated by Berry et al. (1997). T...
02/29/2020

### Budget-Constrained Bandits over General Cost and Reward Distributions

We consider a budget-constrained bandit problem where each arm pull incu...
12/07/2020

### Online Model Selection: a Rested Bandit Formulation

Motivated by a natural problem in online model selection with bandit inf...
06/01/2016

### Contextual Bandits with Latent Confounders: An NMF Approach

Motivated by online recommendation and advertising systems, we consider ...
10/12/2019

### Thompson Sampling in Non-Episodic Restless Bandits

Restless bandit problems assume time-varying reward distributions of the...
08/12/2015

### No Regret Bound for Extreme Bandits

Algorithms for hyperparameter optimization abound, all of which work wel...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The classical discrete-time multi-arm bandit (DMAB) is a versatile learning problem (Bubeck et al., 2012; Lattimore and Szepesvári, 2019), that has been extensively studied in literature. By discrete-time, we mean that there are total discrete slots, and in each slot, a learning algorithm can choose to ‘play’ any one of the possible arms. If the learning algorithm plays arm in slot , it gets a random reward with mean independent of the slot index. The total reward of the learning algorithm at the end of the slots is the accumulated random reward over the slots, which is compared against the reward of the oracle policy which knows the true means of the arms throughout, and thus always plays the best arm (arm with the highest mean).

The typically considered performance metric is regret, that is defined as the expected difference of the total reward of the oracle policy and that of a learning algorithm. The optimal regret for the DMAB problem is known to be

, when the rewards are Bernoulli distributed or follow any sub-Gaussian distribution, and

is the sub-optimality gap. Multiple algorithms such as UCB (Auer et al., 2002)(Agrawal and Goyal, 2012) are known to achieve the asymptotic optimal regret.

In this paper, we consider a continuous-time multi-arm bandit problem (CTMAB) that is well motivated from pertinent applications discussed later in this section. In particular, in CTMAB, the total time horizon is , and there are arms. The main distinction between the DMAB and the CTMAB is that with the CTMAB, an arm can be sampled/played at any (continuous) time before . Once the sampling/playing time is selected, similar to the DMAB problem, any one of arms can be played, and if arm is played at time , the learning algorithm gets a random reward with mean independent of the time . Without loss of generality, we let .

Without any other restriction, per se, any algorithm can play infinite number of arms in time horizon by repeatedly playing arms at arbitrarily small intervals. Thus, to make the problem meaningful, we account for the sampling (arm playing) cost that depends on how often any arm is sampled. Specifically, if two consecutive plays (of any two arms) are made at time and , then the sampling cost for interval is , where is a decreasing function. The total sampling cost is defined as the sum of sampling cost over the total time horizon . Sampling cost penalizes high frequency sampling i.e., the higher the frequency of consecutive plays, the higher is the sampling cost. Considering sampling cost depending on consecutive plays of a specific arm results in multiple decoupled single arm problems, thus a special case of the considered problem.

The overall payoff in CTMAB is then defined as the accumulated random reward obtained from each sample minus the total sampling cost over the time horizon . The regret is defined as usual, the expected difference between the overall payoff of the oracle policy and that of a learning algorithm. There is a natural tradeoff in CTMAB, higher frequency sampling increases the accumulated reward but also increases the sampling cost at the same time. Compared to DMAB, where an algorithm has to decide which arm to play next at each discrete time, in CTMAB, there are two decisions to be made, given the history of decisions and the current payoff (reward minus the sampling cost); i) which arm to play next, and ii) at what time.

It is worth noting that CTMAB cannot be posed as a special case of DMAB, since in DMAB with time slots, the number of samples that can be obtained is at most . Whereas, in CTMAB, there is no exogenous restriction on the number of samples an algorithm can obtain in a time horizon , and we get a general model to capture the tradeoff between the reward and the sampling cost. In fact, DMAB can be posed as a special case of CTMAB by putting appropriate restrictions on the rate of sampling and ignoring the sampling cost.

We next discuss some practical motivations for considering the CTMAB. We first motivate the CTMAB in the single arm case () which itself is non-trivial. Consider that there is a single agent (human/machine) that processes same types of jobs (e.g. working in call center, data entry operation, sorting/classification job etc.) with an inherent random quality of work , and random per-job (unknown) utility . The agent demands a payment depending on how frequently it is asked to accomplish a job (Gong and Shroff, 2019). Alternatively, as argued in (Gopalakrishnan et al., 2016), the quality of agents’ work suffers depending on the speed of operation/load experienced. Thus, the payoff is the total accumulated utility minus the payment or the speed penalty (that depends on the frequency of work), and the objective is to find the optimal speed of operation, to maximize the overall payoff.

To motivate the CTMAB with multiple arms, the natural extension of this model is to consider a platform or aggregator that has multiple agents, each having a random work quality and corresponding per-job utility. The random utility of any agent can be estimated by assigning jobs to it and observing its outputs. The platform charges a cost depending on the speed (frequency) of job arrivals to the platform, that is indifferent to the actual choice of the agent being requested for processing a particular job. Given a large number of jobs to be processed in a finite time, the objective is to maximize the overall payoff; the total accumulated utility minus the payment made to the platform. When the platform cost is the sum of the cost across agents, where each agent cost depends on the rate at which it processes jobs, the problem decouples into multiple single arm CTMAB problems.

Another example where CTMAB is relevant comes from the supply chain management paradigm, where retailers aim to offer good quality items while keeping their operational costs low. In particular, consider a retailer who can source an item from multiple suppliers. The quality offered by each supplier is random with an unknown mean. The retailer gets to observe the quality offered by a supplier each time the order is fulfilled. By ordering the item more frequently, the retailer gets more information about the quality offered by each supplier. However, more frequent orders increases the available stock and which in turn increases the holding cost for the retailer. Thus the cost incurred by the retailer increases with the frequency of order. Then to maximize its revenues, the retailer has to not only identify which supplier offers the best quality but also needs to decide at what rate to order the items. There are few other settings, such as communication networks with multiple parallel servers (Krishnasamy et al., 2016), age of information (Rajaraman et al., 2021), and pricing under uncertain demands (Lobo and Boyd, 2003), where CTMAB is relevant.

In this paper, for ease of exposition, we assume that the sampling cost function is , i.e., if two consecutive plays are made at time and , then the sampling cost for interval is , which is intuitively appealing and satisfies natural constraints on . How to extend results for general functions is discussed in Remark 4. Under this sampling cost function, assuming arm has the highest mean , it turns out (Proposition 2) that the oracle policy always plays the best arm (arm with the highest mean) times at equal intervals in interval . Importantly, the number of samples (or sampling frequency) obtained by the oracle policy depends on the mean of the best arm.

This dependence of the oracle policy’s choice of the sampling frequency on the mean of the best arm, results in two main distinguishing features of the CTMAB compared to the DMAB problem, described as follows.

• CTMAB is non-trivial even when there is only a single arm unlike the DMAB problem, where it is trivial. The non-triviality arises since the learning algorithm for the CTMAB has to choose the time at which to obtain the next sample that depends on the mean of that arm, which is also unknown.

• In CTMAB, it is not enough to identify the optimal arm, the quality of the estimates also matters as the sampling cost depends on the quality of the estimates. In contrast, with DMAB, it is sufficient for an algorithm to identify the right ordering of the arms.

Essentially, CTMAB is a special case of a multi-objective problem, where in addition to having a penalty for misidentifying the best arm, there is additional cost of the form , where is the estimate of the mean of the best arm at time , and is some convex increasing function.

Using the fact that the oracle policy always plays the best arm times at equal intervals in time at a frequency of , it follows that the optimal payoff of the oracle policy is . Thus, in order to capture the regret of any learning algorithm, we represent the regret of an algorithm in the form for some , and try to establish tight bounds on , where can depend on any parameter of the problem, e.g, . Recall that we have assumed . The problem when means is also of interest but is easier than when , since estimates of have to be accurately estimated in CTMAB, and that becomes harder as decreases. We leave the problem with for future work.

Our contributions for the CTMAB are as follows.

1. For the single arm CTMAB, where , on the lower bound side, we show that if the regret of any online algorithm is , then , and hence the regret of any algorithm is . Exact result is little more involved and presented in Theorem 3.1.

Next, we propose an algorithm whose regret is at most . Thus, upto constant terms, the proposed algorithm has the optimal regret. The result has an intuitive appeal since as decreases, the regret increases, since for the CTMAB, has to be estimated, and that becomes harder as decreases. The fact that regret scales with as depends of the choice of .

2. For the general CTMAB with multiple arms, we show a lower bound that if the regret of any online algorithm for the CTMAB is , then , where , and hence the regret of any algorithm is . Exact result is presented in Theorem 5.1.

Next, we propose an algorithm whose regret is at most . Thus, order-wise, the proposed algorithm has the optimal regret. Compared to DMAB, where the optimal regret is , with CTMAB we have an extra order term in the regret.

The interpretation of the result in this case is similar to the single arm case, and the optimal regret (neglecting term) is , where the additional factor of results because of the additional requirement of identification of the best arm. Since , the regret with multiple arms is larger than the single arm case.

### 1.1 Related Works

In prior work, various cost models have been considered for the bandit learning problems. The cost could be related to the consumption of limited resources, operational, or quality of information required.

Cost of resources: In many applications (e.g., routing, scheduling) resource could be consumed as actions are applied. Various models have been explored to study learning under limited resources or cost constraints. The authors in Badanidiyuru et al. (2018) introduce Bandits with Knapsack that combines online learning with integer programming for learning under constraints. This setting has been extended to various other settings like linear contextual bandits (Agrawal and Devanur, 2016), combinatorial semi-bandits (Abinav and Slivkins, 2018), adversarial setting (Immorlica et al., 2019), cascading bandits (Zhou et al., 2018). The authors in (Combes et al., 2015) establish lower bound for budgeted bandits and develop algorithms with matching upper bounds. The case where the cost is not fixed but can vary is studied in (Ding et al., 2013).

Switching Cost: Another set of works study Bandit with Switching Costs where cost is incurred when learner switches from one arm to another arm (Dekel et al., 2014; Cesa-Bianchi et al., 2013). The extension to the case where partial information information about the arms is available through feedback graph is studied in (Arora et al., 2019). For a detailed survey on bandits with switching cost we refer to (Jun, 2004).

Information cost: In many application the quality of information acquired depends on the associated costs (e.g., crowd-sourcing, advertising). While there is no bound on the cost incurred in these settings, the goal is to learn optimal action incurring minimum cost. (Hanawal et al., 2015b, a) trade-offs between cost and information in linear bandits exploiting the smoothness properties of the rewards. Several works consider the problem of arm selection in online settings (e.g., (Trapeznikov and Saligrama, 2013; Seldin et al., 2014)) involving costs in acquiring labels (Zolghadr et al., 2013).

Variants of bandits problems where rewards of arm are delay dependent are studied in (Cella and Cesa-Bianchi, 2020; Pike-Burke and Grunewalder, 2019; Kleinberg and Immorlica, 2018). In these works, the mean reward of each arm is characterized as some unknown function of time. These setups differ from the CMAB problem considered in this paper, as they deal with discrete time setup, and do not capture the cost associated with sampling rate of arms. Rested and restless bandit setups (Whittle, 1988) consider that distribution of each arm changes in each round or when it is played, but do not assign any penalty on rate of sampling.

In this work, our cost accounting is different from the above referenced prior work. The cost is related to how frequently the information/reward is collected. Higher the frequency, higher is the cost. Also, unlike in the DMAB problem, there is no limit on the number of samples collected in a given time interval, however, increasing the sampling frequency also increases the cost.

A multi-arm bandit problem, where pulling an arm excludes the pulling of any arm in future for a random amount of time (called delay) similar to our inter-sampling time has been considered in (György et al., 2007). However, in (György et al., 2007)

the delay experienced (inter-sampling time) is an exogenous random variable, while it is

a decision variable in our setup. Moreover, the problem considered in (György et al., 2007) is trivial with a single arm similar to the usual DMAB, while it is non-trivial in our case as accuracy of the mean estimates play a crucial role.

## 2 The Model

There are arms and total time horizon is . At any time , any one of the arms can be played/sampled. On sampling arm at any time , a random binary reward is obtained which follows a Bernoulli distribution with mean . We consider Bernoulli distribution here, however, all results will hold for bounded distributions. If the time difference between any two consecutive samples is , then the sampling cost for interval is . We make this choice for to keep the exposition simple, and more general convex functions can be analysed similarly, see Remark 4. The learning algorithm is aware of . The ordered arms are denoted by , where .

Let the consecutive instants at which any arm is sampled by a learning algorithm, denoted as , be where , and the inter-sampling time be . Let denote the arm sampled at time . Then the instantaneous expected payoff of from the sample is given as . where is the trade-off parameter between the sampling cost and the reward. The cumulative expected payoff of the algorithm is given by

 PA(T)=NT∑i=1pi, (1)

where is the total number of samples obtained by over the horizon .

The oracle policy that knows the mean values always samples the best arm . For the oracle policy, let the total number of samples obtained over the time horizon be , where the sample is obtained at time , with for , and . Then the optimal cumulative payoff obtained by the oracle policy over the horizon is given as

 P⋆(T)=max(Δtoi)NoTi=1μ[1]NoT−NoT∑i=1λΔti. (2)

such that . A simple property of the cost function described next, follows from the convexity of .

###### Proposition 1

If samples are obtained in time at times with , then the cumulative sampling cost over time horizon , where is minimized if the samples are obtained at equal intervals in for any , i.e., .

Proof of Proposition 1 is immediate by noticing that is a convex function, and the fact that for a convex function , is the optimal solution to , and Using Proposition 1, we have that the payoff of the oracle policy is

 P⋆(T)=maxNoTμ[1]NoT−NoTλT/NoT=maxNoTμ[1]NoT−(NoT)2λT. (3)

Directly optimizing (3) over , we obtain that the optimal number of samples obtained by the oracle policy and the corresponding optimal payoff is given by Proposition 2, assuming to be an integer. 111If not an integer, then we check whether the floor or ceiling is optimal and use that as the value of .

###### Proposition 2

The oracle policy always samples arm , times in time horizon at equal intervals, i.e., at uniform frequency of . With , the optimal payoff (3) is given by

Note that the sampling frequency of the oracle policy depends on the mean of the best arm, which distinguishes the CTMAB from the well studied DMAB.

Thus, the regret for an algorithm is defined as

 RA(T)=P⋆(T)−PA(T), (4)

and the objective of the algorithm is to minimize . Since , we characterize the regret of any algorithm as for some that could depend on parameters of the problem etc.

## 3 CTMAB with Single Arm

In this section, we consider the CTMAB, when there is only a single arm with true mean , where for technical reasons. With the single arm, we denote the binary random reward obtained by sampling at time as , and . As described before, even the single arm case is non-trivial with the CTMAB. First, we derive a lower bound on the regret of any algorithm, and then propose an algorithm that achieves the lower bound upto logarithmic terms.

### 3.1 Lower Bound

To lower bound the regret of any algorithm for the single arm CTMAB, we will need the following preliminaries.

Prediction Problem : Consider two Bernoulli distributions and with means and , respectively, where

. A coin is tossed repeatedly with probability of heads distributed according to either

or , where repeated tosses are independent. From the observed samples, the problem is to predict the correct distribution or such that the success probability of the prediction is at least .

Let be fixed. Consider any algorithm that obtains samples and solves the prediction problem with success probability at least . Then for .

We use Lemma 3.1 for deriving a lower bound on the regret (4) of any algorithm in the following way. Let an algorithm obtain samples in time . Let be the empirical average of the reward obtained by using the samples, where . Let be the error in estimating by at time . Let be the probability that the error in estimating with algorithm at time is at least .

Notice that in defining we are comparing the error (in estimating ) against itself which is not usual, however, it is useful because of Lemma 3.1, where, we show that if any algorithm has at time , then the payoff (defined in (1)) of until time is at most .

We next present Lemma 3.1 and Lemma 3.1 that are useful for deriving a lower bound on the regret in Theorem 3.1. All missing proofs are provided in the appendices. For the single arm CTMAB, if the number of samples obtained till time when , and and are constants.

If for an algorithm , the error in estimating is greater than or equal to at time , i.e., , then the payoff (1) of for the time period is at most .

Using Lemma 3.1, we next derive a lower bound on the regret of any algorithm for the single arm CTMAB. We will characterize the regret of any algorithm as for some , where this choice is driven by the oracle payoff being . Note that can depend on any parameter of the problem, e.g., , hence, this does not limit the generality. Next, given that an algorithm has regret , we will derive a lower bound on .

Let . If the regret of any online algorithm for the single arm CTMAB is , then must satisfy, where

 p1=min{k≥0:Tk+α≥2c1λμ3log(c2Tα)∀ α>0.∀ α>0},and (5)
 p2=min{k≥0:1Tk+α(c1μ2log(c2Tα))2=1λmax{μ2Tk,c3μ}, ∀ α>0}, (6)

. Ignoring lower order terms, and satisfy and . Thus, the regret of any algorithm is .

The basic idea used to derive the lower bound is that if suppose the regret of any algorithm is , then it must be that at time , the probability that the (error in estimating at time is greater than ) is at most , i.e., . This condition is necessary, since otherwise, from Lemma 3.1, the payoff obtained in time is zero with probability contradicting the regret of for . To ensure that , the number of samples to obtain in time is lower bounded by Lemma 3.1. Accounting for the sampling cost resulting out of this lower bound on in time gives us the required lower bound.

## 4 Algorithm With a Matching Upper Bound

In this section, we propose an algorithm that achieves a regret within constant terms of the lower bound derived in Theorem 3.1.

Algorithm CTSAB: Divide the total time horizon in two periods: learning and exploit. Pick . The choice of will determine the speed of the algorithm, and the regret guarantee. The smaller the , the better is the regret but slower the speed. The algorithm works in phases, where phase starts at time and ends at time . Subsequently, phase ( is defined in (8)) starts at time and ends at with duration . For each phase , , the algorithm obtains samples in phase equally spaced in time, i.e., at uniform frequency in that phase. At the end of phase , the total number of samples obtained is , and let

 ^μi=1NiNi∑k=1Xk, (7)

be the empirical average of all the sample rewards obtained until the end of phase . With abuse of notation, we interchangeably use to denote the error after samples or after phase or time . Thus, the error in estimating at the end of phase is .

We next define , and the algorithm to follow after phase . For a given (input to the algorithm), let be the earliest phase at which

 √log(2/δ)Ni⋆<^μi⋆2, (8)

where . If no such is found, then we define that the algorithm fails.

The learning period ends at phase , and the exploit period starts from the next phase as follows. Each of the phases is of the same time duration till the total time horizon is reached. Starting from phase and for all subsequent phases , the algorithm assuming (7) to be the true value of , obtains samples in phase , equally spaced in time, and is updated at the end of each phase using all the samples obtained so far since time . The pseudo code for the algorithm is given by Algorithm 1.

The proposed algorithm CTSAB follows the usual approach of exploration and exploitation, however, there are two non-trivial problems being addressed, whose high level idea is as follows. The aim of the learning period is to obtain sufficient number of samples , such that . Since otherwise, the payoff obtained in phases after the learning period is over will be at most zero, following Lemma 3.1. So the first problem is a stopping problem, checking for , which is non-trivial, since is unknown. For this purpose, a surrogate condition (8) is defined, and the learning period is terminated as soon as (8) is satisfied for a particular choice of . Choosing , using Corollary A and Lemma F, we show that whenever (8) is satisfied, with probability at least .

The second problem then is to bound the time by which the learning period ends, i.e., (8) is satisfied. We need this bound, since non-zero payoff can be guaranteed only for phases that belong to the exploit period that starts after the learning period. Towards that end, we show that the length of learning period is with high probability in Lemma F, where is defined as follows.

 p⋆=min{k≥0:1μ3Tk=O(1)}. (9)

The expected regret of algorithm CTSAB, while choosing is

 O(μ2Tp⋆+4/3ϵln(2T2)ln(2T))(1−1T1+ϵ)+μ2/(4λTϵ),

for any , where as defined in (9). Since , using (9), we get that upto constant terms, the regret bound of the CTSAB algorithm (Theorem 4) matches the lower bound derived in Theorem 3.1, and the regret scales as .

In the proof of Theorem 4, we show that the learning period ends in at most time in Lemma F for which we count zero payoff for the algorithm. Thus, the regret of the CTSAB algorithm in the learning period is at most . To complete the proof, we show that the payoff obtained in the remaining time defined as the exploit period differs from that of the oracle policy by only constant terms.

Extension of Theorem 3.1 and Theorem 4 (with the same algorithm CTSAB where the sampling frequency is chosen so as to optimize (3)) to general convex functions for the sampling cost, other than is readily possible. Specifically, Prop. 1 remains unchanged as long as is convex, while the optimal payoff in Prop. 2 will depend on the exact function . Moreover, the arguments made in Lemma 3.1 and 3.1 directly follow for general convex functions , and so does the lower bound Theorem 3.1 and upper bound Theorem 4, where the expressions will depend on .

Unlike the DMAB setting, where algorithms like UCB or Thompson sampling can work without the knowledge of , in the current setting, the CTSAB algorithm we propose, crucially uses the information about to define phases and its decisions. Developing an algorithm without the knowledge of for the CTMAB appears challenging and is part of ongoing work.

## 5 CTMAB with Multiple Arms

In this section, we consider arms with means and , and the objective is as before, to minimize the regret defined in (4). In the CTMAB, any algorithm has to identify the best arm (arm 1) together with closely estimating the mean . We exploit this dual requirement to derive a lower bound in Theorem 5.1, and then propose an algorithm that achieves the lower bound upto logarithmic terms.

### 5.1 Lower Bound

Exploration Problem with K arms of unknown means: Let there be arms, with i.i.d. Bernoulli distribution for arm with mean . Let the product distribution over the arms be denoted as . An arm is called -optimal if . An algorithm is -correct if for the arm that it outputs as the best arm, we have .

(Mannor and Tsitsiklis, 2004) There exist positive constants such that for there exists distribution such that for every , the sample complexity of any -correct algorithm is at least , where and .

Using Lemma 5.1, next, we derive a lower bound on the regret of any online algorithm for the CTMAB.

If the regret (4) of any algorithm for the CTMAB is , then

 pm≥max{pm1,pm2},

where

 pm1=min{k≥0:Tk+α=2λK(Δ/2)2μ[1]log(Tα8)∀ α>0},and (10)
 pm2=min⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩k≥0:(KΔ2log(Tα8))2Tk+α=18max{μ2Tk,μ[1]cm2(Δ/2)2} ∀ α>0⎫⎪ ⎪ ⎪⎬⎪ ⎪ ⎪⎭. (11)

Ignoring lower order terms, and , satisfy and . Hence the regret of any algorithm is

The main idea used to derive this lower bound is as follows. Let the regret of any algorithm be . Then consider time for any . We show that if the probability of correctly identifying the best arm with algorithm is less than at time , then the regret of is for , contradicting the assertion that the regret of is . Thus, the probability of identifying the best arm with at time must be greater than . Next, using Lemma 5.1, we get a lower bound on the number of samples to be obtained in time , so that the probability of identifying the best arm at time must be greater than . Therefore, we get a lower bound on the frequency of sampling () to be employed in time . Accounting for the sampling cost resulting out of this lower bound on the frequency of sampling, gives us the lower bound of Theorem 5.1.

### 5.2 Upper Bound - Algorithm CTMAB

We propose an algorithm for CTMAB, called the CTMAB algorithm, that is neither aware of or the actual means , and show that its regret is within logarithmic terms of the lower bound (Theorem 5.1). The algorithm first tries to identify the best arm with high enough probability (called the identification phase), and thereafter uses the Exploit Period of Algorithm 1 for the single arm problem with the identified best arm. There are multiple algorithms Jamieson et al. (2013); Auer et al. (2002); Karnin et al. (2013) known in literature for identifying the best arm (called the pure exploration problem), that require the number of samples within a logarithmic gap from the lower bound (Lemma 5.1). More recently, an algorithm called the Track and Stop Garivier and Kaufmann (2016) has been shown to achieve the lower bound (order-wise).

For our algorithm, we can use any algorithm for the pure exploration problem, and we generically call it Pure-Exp. Once the Pure-Exp algorithm terminates, the Exploit Period of algorithm CTSAB (Algorithm 1) is executed only on the arm identified as the best by the Pure-Exp algorithm. The pseudo code of the CTMAB algorithm is given in Algorithm 2.

For the continuous time CTMAB, we use a Pure-Exp algorithm, that is designed for discrete setting, by obtaining the number of samples suggested by the Pure-Exp algorithm to identify the best arm are obtained at frequency one sample per unit time.

We next define a quantity that is useful for deriving an upper bound on the regret of the CTMAB algorithm.

 p⋆m=min{r≥0:KΔ2μ[1]2Tr=O(1)}. (12)

The main result of this subsection is as follows.

With the choice of , and using Track and Stop algorithm Garivier and Kaufmann (2016) as the Pure-Exp algorithm, the expected regret of the CTMAB algorithm is

 O(μ[1]2Tp⋆mlog(T2)log(T))(1−TTνm1T2)(1−1T2)+o(T)

where is defined in (12), and is the width of each phase after the identification phase is over. Using (11), we can conclude that order-wise, the regret of the CTMAB algorithm is same as the lower bound derived in Theorem 5.1.

The sketch of the proof is as follows. The first phase (called identification) ensures that the best arm is identified with probability at least using the Track and Stop algorithm as the Pure-Exp algorithm, and equally importantly, we show that the identification phase is complete by time . Moreover, the number of samples (for the best arm) obtained by the Pure-Exp algorithm in the identification phase for the best identified arm is which is more than , since . Thus, at the end of identification phase, for the best arm with high probability as required by the end of the Learning Period of Algorithm 1 for the single arm case. Since the CTMAB algorithm executes the exploit period of Algorithm 1 after the end of the identification phase, it ensures that the payoff of the algorithm differs from that of the oracle policy only in logarithmic terms after the identification phase (Theorem 4). Thus, the loss of payoff of the algorithm with respect to the oracle policy is only during the identification phase, which is of length , incurring a regret of .

## 6 Numerical Results

In this section, we compare the performance of our algorithms against the oracle policy and a baseline policy that does not adapt to the estimates of the arm means. The baseline policy samples the optimal arm at a fixed interval of , where is a constant that determines the rate of sampling. The payoff of the baseline policy over a period is , and the payoff is positive and increasing for all achieving maxima at . We compare the performance of this baseline policy, the CTSAB algorithm, and that of the oracle policy for the single arm case in Fig. 1. We set the parameters to satisfy the relations as discussed in Thm. 3.1. The total payoff over the time horizon of the oracle policy, the baseline and that obtained from the CTSAB algorithm are plotted in Fig. 1 for different values of . The two baseline policies corresponds to the case where the sampling is uniform at rate and . As seen from Fig. 1, the total reward from the CTSAB algorithm is close to optimal, while with , we see that reward for the baseline with is negative as . Further, note that the gap between the total payoff of the oracle policy and the CTSAB algorithm is increasing as decreases. This is natural as the learning problem gets harder as becomes smaller. This is explicitly depicted in Fig. 2, where we plot the regret of the CTSAB algorithm as a function of . As seen, the regret has a inverse relation with in agreement with Thm.  3.1.

Next, we consider the multiple arms case, and evaluate the performance of the CTMAB algorithm with arms. As the number of samples to collect prescribed by the Median Elimination Algorithm is overly pessimistic, instead of taking number of samples for arms (that are not eliminated), we fix the number of samples at a constant value of

as done while executing pure explorations algorithms in practice. We simulate the CTMAB algorithm on three sets of mean vectors

, with values and , and plot the cumulative payoff for the oracle policy and the CTMAB algorithms in Fig. 3. As seen, the CTMAB performance is close to that of the oracle policy, and the regret degrades with reducing value of .

## 7 Conclusions

In this paper, we have a introduced a new continuous time multi-arm bandit model (CMAB), that is well motivated from applications in crowdsourcing and inventory management systems. The CMAB is fundamentally different than the popular DMAB, and to the best of our knowledge has not been considered before. The distinguishing feature of the CMAB is that the oracle policy’s decision depends on the mean of the best arm, and this makes even the single arm problem non-trivial. To keep the model simple, we considered a simple sampling cost function, and derived almost tight upper and lower bounds on the optimal regret for any learning algorithm.

## References

• K. Abinav and S. A. Slivkins (2018) Combinatorial semi-bandits with knapsacks. In

International Conference on Artificial Intelligence and Statistics (AISTATS)

,
Cited by: §1.1.
• S. Agrawal and N. R. Devanur (2016) Linear contextual bandits with knapsacks. In Neural Information Processing Systems (NIPS 2016), Cited by: §1.1.
• S. Agrawal and N. Goyal (2012) Analysis of thompson sampling for the multi-armed bandit problem. In Conference on learning theory, pp. 39–1. Cited by: §1.
• R. Arora, T. V. Marinov, and M. Mohr (2019) Bandits with feedback graphs and switching costs. In Advances in Neural Information Processing Systems(NIPS), Cited by: §1.1.
• P. Auer, N. Cesa-Bianchi, and P. Fischer (2002) Finite-time analysis of the multiarmed bandit problem. Machine learning 47 (2-3), pp. 235–256. Cited by: §1, §5.2.
• A. Badanidiyuru, R. Kleinberg, and A. Slivkins (2018) Bandits with knapsacks. Journal of ACM (13). Cited by: §1.1.
• S. Bubeck, N. Cesa-Bianchi, et al. (2012) Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning 5 (1), pp. 1–122. Cited by: §1.
• L. Cella and N. Cesa-Bianchi (2020) Stochastic bandits with delay-dependent payoffs. In International Conference on Artificial Intelligence and Statistics, pp. 1168–1177. Cited by: §1.1.
• N. Cesa-Bianchi, O. Dekel, and O. Shamir (2013) Online learning with switching costs and other adaptive adversaries. In Advances in Neural Information Processing Systems(NIPS), Cited by: §1.1.
• R. Combes, C. Jiang, and R. Srikant (2015) Bandits with budgets: regret lower bounds and optimal algorithms. In International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), Cited by: §1.1.
• O. Dekel, J. Ding, J. Ding, T. Koren, and Y. Peres (2014) Bandits with switching costs: regret. In

ACM Symposium on Theory of computing (STOC)

,
pp. 459 – 467. Cited by: §1.1.
• W. Ding, T. Q. X. Zhang, and T. Liu (2013) Multi-armed bandit with budget constraint and variable costs. In Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence, Cited by: §1.1.
• A. Garivier and E. Kaufmann (2016) Optimal best arm identification with fixed confidence. In Conference on Learning Theory, pp. 998–1027. Cited by: Appendix H, §5.2, §5.2.
• X. Gong and N. B. Shroff (2019) Truthful data quality elicitation for quality-aware data crowdsourcing. IEEE Transactions on Control of Network Systems 7 (1), pp. 326–337. Cited by: §1.
• R. Gopalakrishnan, S. Doroudi, A. R. Ward, and A. Wierman (2016) Routing and staffing when servers are strategic. Operations research 64 (4), pp. 1033–1050. Cited by: §1.
• A. György, L. Kocsis, I. Szabó, and C. Szepesvári (2007) Continuous time associative bandit problems.. In IJCAI, pp. 830–835. Cited by: §1.1.
• M. K. Hanawal, A. Leshem, and V. Saligrama (2015a) Cost effective algorithms for spectral bandits. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1323–1329. Cited by: §1.1.
• M. K. Hanawal, V. Saligrama, M. Valko, and R. Munos (2015b) Cheap bandits. In International Conference on Machine Learning (ICML), Cited by: §1.1.
• N. Immorlica, K. A. Sankararaman, R. Schapire, and A. Slivkins (2019) Adversarial bandits with knapsacks. In Annual Symposium on Foundations of Computer Science (FOCS), Cited by: §1.1.
• K. Jamieson, M. Malloy, R. Nowak, and S. Bubeck (2013) On finding the largest mean among many. arXiv preprint arXiv:1306.3917. Cited by: §5.2.
• T. Jun (2004) A survey on the bandit problem with switching costs. In De Economist, pp. 513–541. Cited by: §1.1.
• Z. Karnin, T. Koren, and O. Somekh (2013) Almost optimal exploration in multi-armed bandits. In International Conference on Machine Learning, pp. 1238–1246. Cited by: §5.2.
• S. M. Kay (1993) Fundamentals of statistical signal processing. Prentice Hall PTR. Cited by: Appendix D.
• R. Kleinberg and N. Immorlica (2018) Recharging bandits. In 2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS), pp. 309–319. Cited by: §1.1.
• S. Krishnasamy, R. Sen, R. Johari, and S. Shakkottai (2016) Regret of queueing bandits. In Advances in Neural Information Processing Systems, pp. 1669–1677. Cited by: §1.
• T. Lattimore and C. Szepesvári (2019) Bandit algorithms. Cited by: §1.
• T. Lattimore and C. Szepesvári (2020) Bandit algorithms. Cambridge University Press. Cited by: Appendix B.
• M. S. Lobo and S. Boyd (2003) Pricing and learning with uncertain demand. In INFORMS Revenue Management Conference, Cited by: §1.
• S. Mannor and J. N. Tsitsiklis (2004) The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research 5 (Jun), pp. 623–648. Cited by: §5.1.
• C. Pike-Burke and S. Grunewalder (2019) Recovering bandits. In Advances in Neural Information Processing Systems, pp. 14122–14131. Cited by: §1.1.
• N. Rajaraman, R. Vaze, and G. Reddy (2021) Not just age but age and quality of information. IEEE Journal on Selected Areas in Communications 39 (5), pp. 1325–1338. Cited by: §1.
• Y. Seldin, P. L. Bartlett, K. Crammer, and Y. Abbasi-Yadkori (2014) Prediction with limited advice and multiarmed bandits with paid observations.. In ICML, pp. 280–287. Cited by: §1.1.
• K. Trapeznikov and V. Saligrama (2013) Supervised sequential classification under budget constraints. In Artificial Intelligence and Statistics, pp. 581–589. Cited by: §1.1.
• P. Whittle (1988) Restless bandits: activity allocation in a changing world. Journal of applied probability, pp. 287–298. Cited by: §1.1.
• R. Zhou, C. Gan, J. Yang, and C. Shen (2018) Cost-aware cascading bandits. In International Joint Conference on Artificial Intelligence (IJCAI), Cited by: §1.1.
• N. Zolghadr, G. Bartók, R. Greiner, A. György, and C. Szepesvári (2013) Online learning with costly features and labels.. In Advances in Neural Information Processing Systems, pp. 1241–1249. Cited by: §1.1.

## Appendix A Preliminaries

Let ’s be independent and identically Bernoulli distributed random variables with mean , and . (Chernoff Bound) Choosing , we get that with probability at least .

## Appendix B Proof of Lemma 3.1

Let the product distribution over samples derived from and be and . Let be the event that the algorithm outputs as the correct distribution, and be its complement. Then from Theorem 14.2 Lattimore and Szepesvári (2020), we have that

 PP(Ec)+PQ(E)≥12exp(−D(Pn||Qn)), (13)

where is the Kullback-Liebler distance between and . Since the probability of success for is , we have that both and . Thus, from (13), we get

 2γ>12exp(−D(Pn||Qn)). (14)

Moreover, we have that , and for for and being Bernoulli with means and .

Therefore, we get that

 4γ>exp(−n8β2ln2), (15)

which implies the result.

## Appendix C Proof of Lemma 3.1

Recall that . For the single arm CTMAB problem, let any algorithm know that either the mean is or , when the true mean is . This can only reduce the regret of any algorithm.

From Proposition 2, the sampling frequency of depends on the estimate of the true value of mean . Thus, algorithm has to distinguish whether the true mean is or . Let be successful in identifying the true mean with probability at least by some time using samples. Then Lemma 3.1 lower bounds the number of samples , since the difference between the two means for Lemma 3.1. Therefore, if then the algorithm makes a mistake in predicting the true mean and declares it to be with probability at least .

## Appendix D Proof of Lemma 3.1

For an algorithm , let the (sample mean) estimate of at time be such that . We want to upper bound the expected payoff of in . Towards that end, we bound the expected payoff if knew at time itself.

With Bernoulli distribution, the empirical estimate is a sufficient and complete statistic Kay (1993) for

, and the minimum variance unbiased estimator (MVUE).

Knowing at time , let be the number of samples (at equal intervals) obtained by algorithm in time . Then maximizing the expected payoff of in interval is equivalent to minimizing the expected regret (4) of in , given by

 R([0,t]) =minN(^μt)E{μ2t4λ−μN(^μt)+λtN(^μt)2}, =minN(^μt)E⎧⎨⎩(√t4λμ−√λtN(^μt))2⎫⎬⎭, =minN(^μt)E{t4λ(μ−2λtN(^μt))2}.

Since is MVUE for , the number of samples