Multi-armed Bandits with Cost Subsidy

11/03/2020
by   Deeksha Sinha, et al.
0

In this paper, we consider a novel variant of the multi-armed bandit (MAB) problem, MAB with cost subsidy, which models many real-life applications where the learning agent has to pay to select an arm and is concerned about optimizing cumulative costs and rewards. We present two applications, intelligent SMS routing problem and ad audience optimization problem faced by a number of businesses (especially online platforms) and show how our problem uniquely captures key features of these applications. We show that naive generalizations of existing MAB algorithms like Upper Confidence Bound and Thompson Sampling do not perform well for this problem. We then establish fundamental lower bound of Ω(K^1/3 T^2/3) on the performance of any online learning algorithm for this problem, highlighting the hardness of our problem in comparison to the classical MAB problem (where T is the time horizon and K is the number of arms). We also present a simple variant of explore-then-commit and establish near-optimal regret bounds for this algorithm. Lastly, we perform extensive numerical simulations to understand the behavior of a suite of algorithms for various instances and recommend a practical guide to employ different algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

08/10/2020

Lenient Regret for Multi-Armed Bandits

We consider the Multi-Armed Bandit (MAB) problem, where the agent sequen...
10/20/2020

Bayesian Algorithms for Decentralized Stochastic Bandits

We study a decentralized cooperative multi-agent multi-armed bandit prob...
02/27/2020

Online Learning for Active Cache Synchronization

Existing multi-armed bandit (MAB) models make two implicit assumptions: ...
08/22/2019

Online Inference for Advertising Auctions

Advertisers that engage in real-time bidding (RTB) to display their ads ...
11/17/2021

Max-Min Grouped Bandits

In this paper, we introduce a multi-armed bandit problem termed max-min ...
03/13/2019

Online Budgeted Learning for Classifier Induction

In real-world machine learning applications, there is a cost associated ...
09/13/2021

Machine Learning for Online Algorithm Selection under Censored Feedback

In online algorithm selection (OAS), instances of an algorithmic problem...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the traditional (stochastic) MAB  problem (Robbins (1952)), the learning agent has access to a set of actions (arms) with unknown, but fixed reward distributions and has to repeatedly select an arm to maximize the cumulative reward. Here, the challenge is designing a policy that balances the tension between acquiring information about arms with little historical observations and exploiting the most rewarding arm based on existing information. The aforementioned exploration-exploitation trade-off has been extensively studied leading to a number of simple but extremely effective algorithms like Upper Confidence Bound (Auer et al. (2002a)) and Thompson Sampling (Thompson (1933); Agrawal and Goyal (2017a)), which have been further generalized and applied in a wide range of application domains including online advertising (Langford and Zhang (2008); Cesa-Bianchi et al. (2014), Oliver and Li (2011)), recommendation systems (Li et al. (2015, 2011); Agrawal et al. (2016)), social networks and crowd sourcing (Anandkumar et al. (2011); Sankararaman et al. (2019), Slivkins and Vaughan (2014)); see Bubeck and Cesa-Bianchi (2012) and Slivkins (2019) for a detailed review. However, most of these approaches cannot be generalized to settings involving multiple metrics (for example reward and cost) when the underlying trade-offs between these metrics are not known a priori.

In many real-world applications of MAB, some of which we will elaborate below, it is common for the agent to incur costs to play an arm, with high performing arms costing more. Though, one can model this in the traditional MAB framework by considering cost subtracted from the reward as the modified objective, such a modification is not always meaningful, particularly in settings where the reward and cost associated with an arm represent different quantities (for example click rate and cost of an ad). In such problems, it is natural for the learning agent to optimize for both the metrics, typically trying to avoid incurring exorbitant costs for a marginal increase in cumulative reward. Motivated by the aforementioned scenario, in this paper, we consider a variant of the MAB problem, where the agent is not only concerned about balancing the exploration-exploitation trade-offs to maximize the cumulative reward but also balance the trade-offs associated with multiple objectives that are intrinsic to several practical applications. More specifically, in this work, we study a stylized problem, where to manage costs, the agent is willing to tolerate a small loss from the highest reward measured as the reward that could be obtained by the traditional MAB problem in absence of costs. We refer to this problem as MAB problem with a cost subsidy (see Sec 1.1 for exact problem formulation), where the subsidy refers to the amount of reward the learning agent is willing to forgo to improve costs. Before we explain our problem and technical contributions in detail, we will elaborate on the applications that motivate this problem.

Intelligent SMS Routing. Many businesses such as banks, delivery services, airlines, hotels, and various online platforms send SMSes (text messages) to their users for a variety of reasons including two-factor authentication, order confirmations, appointment reminders, transaction alerts, and as a direct marketing line (see Twilio and Uber (2020)). These text messages referred to as Application-to-Person (A2P) messages constitute a significant portion of all text messages sent through cellular networks today. In fact, A2P messages are forecasted to be a $86.3 billion business by 2025 (MarketWatch (2020)).

To deliver these messages, businesses typically enlist the support of telecom aggregators, who have private agreements with mobile operators. Each aggregator offers a unique combination of quality, as measured by the fraction of text messages successfully delivered by them and price per message. Surprisingly, it is common for delivery rates of text messages to not be very high (see Canlas et al. (2010); Meng et al. (2007); Zerfos et al. (2006); Osunade and Nurudeen for QoS analysis in different geographies) and for aggregator’s quality to fluctuate with time due to various reasons ranging from network outage to traffic congestion. Therefore, the platform’s problem of balancing the tension between inferring aggregator’s quality through exploration and exploiting the current best performing aggregator to maximize the number of messages delivered to users leads to a standard MAB formulation. However, given the large volume of messages that need to be dispatched, an MAB based solution that focuses exclusively on quality of the aggregator could result in exorbitant spending for the business. A survey of businesses shows that the number of text messages they are willing to send will have a significant drop if the cost per SMS is increased by a few cents per SMS (Ovum (2017)). Moreover, in many situations platforms have back up communication channels such as email based authentication or notifications via in-app/website features, which though not as effective as a text message in terms of read rate, can be used if guaranteeing the text message delivery proves to be very costly. Therefore, it is natural for businesses to prefer an aggregator with lower costs as long as their quality is comparable to the aggregator with the best quality.

Ad-audience Optimization. We now describe another real-world application in the context of online advertisements. Many advertisers (especially small-to-medium scale businesses) have increasingly embraced the notion of auto-targeting where they let the advertising platform identify a high-quality audience group (e.g., Koningstein (2006); Amazon (2019); Facebook (2016); Google (2014)). To enable this, the platform explores the audience-space to identify cheaper opportunities that also give high click-through-rate (ctr) and conversion rate. Here, it is possible for different audience groups to have different yields i.e. quality (CTR/conversion rate) for a specific ad. However, it may require vastly different bids to reach different audiences due to auction overlap with other ad campaigns with smaller audience targeting. Thus, the algorithm is faced with a similar trade-off; as long as a particular audience-group gives a high-yield, the goal is to find the cheapest one.

We now present a novel formulation of a multi-armed bandit problem that captures key features of these applications, where our goal is to develop a cost sensitive MAB algorithm that balances both the exploration-exploitation trade-offs as well as the tension between conflicting metrics in a multi-objective setting.

1.1 Problem formulation

To formally state our problem, given an instance , in every round the agent chooses an arm and realizes a reward sampled independently from a fixed, but unknown distribution with mean (or ) and incurs a cost (or ), which is known a priori. Here, in order to manage costs, we allow the agent to be agnostic between arms, whose expected reward is greater than fraction of the highest expected reward, for a fixed and known value of , which we refer to as the subsidy factor. The agent’s objective is to learn and pull the cheapest arm among these high quality arms as frequently as possible.

More specifically, let denote the arm with highest expected mean, i.e., and be the set of arms whose expected reward is within factor of the highest expected reward, i.e., We refer to the quantity as the smallest tolerated reward. Without loss of generality, we assume the reward distribution has support . The goal of the agent is to design a policy (algorithm) that will learn the cheapest arm whose expected reward is at least as large as the smallest tolerated reward. In other words, the agent needs to learn the identity and simultaneously maximize the number of plays of arm Since in the SMS application, the reward is the quality of the chosen aggregator, we will use the terms reward and quality interchangeably.

To measure the performance of any policy , we propose two notions of regret - quality and cost regret, with the agent’s goal being minimizing both of them:

(1)

where and the expectation is over the randomness in policy . Equivalently, the cost and quality regret of policy on an instance of the problem is denoted as and where the instance is defined by the distributions of the reward and cost of each arm. The objective then is to design a policy that simultaneously minimizes both the cost and quality regret for all possible choices of and (equivalently all instances I).

Remark 1 (Choice of objective function).

Note that a parametrized linear combination of reward and cost metrics, i.e. for an appropriately chosen is a popular approach to balance cost-reward trade-off. However, the setting considered in this paper is not equivalent to this approach. In particular, for any specified subsidy factor , the value required in the linear objective function, for to be the optimal arm would depend on the cost and reward distributions of the arms. Therefore, using a single value of and relying on standard MAB algorithms would not lead to the desired outcome for our problem. Further, from an application stand-point there are two important considerations. First, in a real-world system we need an explicit control over the parameter that is not instance-dependent to understand and defend the trade-off between the various objectives. Second, for the intelligent SMS routing application discussed earlier, different sets of aggregators operate in different regions. Thus, separate values would need to be configured for each region, making the process cumbersome.

1.2 Related Work

Our problem is closely related to the MAB with multiple objectives line of work, which has attracted considerable attention in recent times. The existing literature on multi-objective MAB

can be broadly classified into three different categories.

Bandits with Knapsacks (BwK). Bandits with knapsacks (BwK), introduced in the seminal work of Badanidiyuru et al. (2018) is a general framework that considers the standard MAB problem under the presence of additional budget/resource constraints. The BwK problem encapsulates a large number of constrained bandit problems that naturally arise in many application domains including dynamic pricing, auction bidding, routing and scheduling ((see Tran-Thanh et al. (2012); Agrawal and Devanur (2014); Immorlica et al. (2019)). In this formulation, the agent has access to a set of finite resources and arms, each associated with a reward distribution. Upon playing arm at time , the agent realizes a reward of and incurs a penalty of for resource , all drawn from a fixed, but unknown distribution corresponding to the arm. The objective of the agent is to maximize the cumulative reward before one of the resources is completely depleted. Although appealing in many applications, BwK formulation requires hard constraint on resources (cost in our setting) and hence, cannot be easily generalized to our problem. In particular, in the cost subsidized MAB problem, the equivalent budget limits depend on the problem instance and therefore cannot be determined a priori.

Pareto Optimality and Composite Objective. The second formulation is focused on identifying Pareto optimal alternatives and uniformly choosing among these options (see Drugan and Nowe (2013); Yahyaa et al. (2014); Paria et al. (2018); Yahyaa and Manderick (2015)). These approaches do not apply to our problem, since some of the Pareto alternatives could have extreme values for one of the metrics, for example having very low cost and low quality or extremely high cost and quality, making them undesirable for the applications discussed earlier. Closely related to this line of work is the set of works that focus on a composite objective by appropriately weighting the different metrics (see Paria et al. (2018); Yahyaa and Manderick (2015)). Such formulations also do not immediately apply for our problem, since in the SMS and ad applications discussed earlier, it is not acceptable to drop the quality beyond the allowed level irrespective of the cost savings we could obtain. Furthermore, in the SMS application, the trade-offs between quality and costs could vary from region to region, making it hard to identify a good set of weights for the composite objective (see Remark 1).

Conservative Bandits and Bandits with Safety Constraints. Two other lines of work that are recently receiving increased attention, particularly from practitioners are bandits with safety constraints (e.g., Daulton et al. (2019); Amani et al. (2020); Galichet et al. (2013)) and conservative bandits (e.g., Wu et al. (2016); Kazerouni et al. (2017) ). In both these formulation, the algorithm chooses one of the arms and receives a reward and a cost associated with it. The goal of the algorithms is to maximize the total reward obtained while ensuring that either the chosen arm is within a pre-specified threshold (when costs of arms are unknown a priori) or reward of the arm is at least a specified fraction of a known benchmark arm. Neither of these models exactly capture the requirements of our applications: a) we do not have a hard constraint on the acceptable cost of a pulled arm. In particular, choosing low quality aggregators to avoid high costs (even for a few rounds) could be disastrous since it leads to bad user experience on the platform and eventual churn. b) the equivalent benchmark arm in our case i.e. the arm with the highest mean reward is not known a priori.

Best Arm Identification. Apart from the closely related works mentioned above, our problem of identifying the cheapest arm whose expected reward is within an acceptable margin from the highest reward can be formulated as a stylized version of the best-arm identification problem (Katz-Samuels and Scott (2019); Jamieson and Nowak (2014); Chen et al. (2014); Cao et al. (2015); Chen et al. (2016)). However, in many settings and particularly applications discussed earlier, the agent’s objective is optimizing cumulative reward and not just identifying the best arm.

1.3 Our Contributions

Novel Problem Formulation. In this work, we propose a stylized model, MAB with a cost subsidy and introduce new performance metrics that uniquely capture the salient features of many real world online learning problems involving multiple objectives. For this problem, we first show that naive generalization of popular algorithms like Upper Confidence Bound (UCB) and Thompson Sampling (TS) could lead to poor performance on the metrics. In particular, we show that the naive generalization of TS for this problem would lead to a linear cost regret for some problem instances.

Lower Bound. We establish a fundamental limit on the performance of any online algorithm for our problem. More specifically, we show that any online learning algorithm will incur a regret of on either the cost or the quality metric (refer to (1)), further establishing the hardness of our problem relative to the standard MAB  problem, for which it is possible to design algorithms that achieve worst case regret bound of . We introduce a novel reduction technique to derive the above lower bound, which is of independent interest.

Cost Subsidized Explore-Then-Commit. We present a simple algorithm, based on the explore-then-commit (ETC) principle and show that it achieves near-optimal performance guarantees. In particular, we establish that our algorithm achieves a worst-case bound of for both cost and quality regret. A key challenge in generalizing the ETC algorithm for this problem arises from having to balance between two asymmetric objectives. We also discuss generalizations of the algorithm for settings where cost of the arms is not known a priori. Furthermore, we consider a special scenario of bounded costs, where naive generalizations of TS and UCB work reasonably well and establish worst case regret bounds.

Numerical Simulation. Lastly, we perform extensive simulations to understand various regimes of the problem parameters and compare different algorithms. More specifically, we consider scenarios where naive generalizations of UCB and TS, which have been adapted in real life implementations (see Daulton et al. (2019)) perform well and settings where they perform poorly, which should be of interest to practitioners.

1.4 Outline

The rest of this paper is structured as follows. In Section 2, we show that the naive generalization to TS or UCB algorithms perform poorly and in Section  3, we establish lower bounds on performance of any algorithm for MAB with cost subsidy problem. In Section 4, we present a variation of the ETC algorithm, and show that it achieves a near-optimal regret bound of for both the metrics. In section 5, we show that with additional assumptions it is possible to show improved performance bounds for naive generalization of existing algorithms. Finally, in section 6 we perform numerical simulations to explore various regimes of the instance-space.

2 Performance of Existing MAB Algorithms

In this section, we consider a natural extension of two popular MAB algorithms, TS and UCB for our problem and show that such adaptations perform poorly. This highlights the challenges involved in developing good algorithms for the MAB problem with cost subsidy. In particular, we establish theoretically that for some problem instances the TS variant incurs a linear cost regret and observe similar performance for the UCB variant empirically. Our key focus on TS in this section is primarily motivated by the superior performance that have been observed over a stream of recent papers in the context of TS versus more traditional approaches such as UCB (see Scott (2010); Oliver and Li (2011); May et al. (2012); Agrawal et al. (2017)).

We present the details of TS and UCB adaptations in Algorithm 1, which we will refer to as Cost-Subsidized TS(CS-TS)  and Cost-Subsidized UCB(CS-UCB) respectively. These extensions are inspired by Daulton et al. (2019), which demonstrates empirical efficacy on a related (but different) problem. Briefly, in the CS-TS(CS-UCB) variation, we follow the standard TS (UCB) algorithm and obtain a quality score which is a sample from the posterior distribution (upper confidence bound) for each arm. We then construct a feasible set of arms consisting of arms whose quality scores are greater than fraction of the highest quality score. Finally, we pull the cheapest arm among the feasible set of arms.

Result: Arm to be pulled in each round
Input : , prior distribution for mean rewards of all arms , reward likelihood function
;
for  do
       ;
       Play arm and observe reward ;
       ;
      
end for
for  do
       for  do
             ;
             ;
             UCB: ;
             TS: Sample from the posterior distribution of arm , ;
            
       end for
      ;
       ;
       ;
       Play arm and observe reward ;
       ;
      
end for
Algorithm 1 Cost Subsidized TS and UCB Algorithms

We will now show that CS-TS  with Gaussian priors and posteriors (i.e. Gaussian distribution with mean

and variance

) described in Algorithm 1 incurs a linear cost regret in the worst case. More precisely, we prove the following result.

Theorem 1.

For any given there exists an instance of problem such that is .

Proof Sketch.

The proof closely follows the lower bound argument in Agrawal and Goyal (2017b). We briefly describe the intuition behind the result. Consider a scenario where the highest reward arm is expensive arm while all other arms are cheap and have rewards marginally above the smallest tolerated reward. In the traditional MAB problem, the anti-concentration property of the Gaussian distribution (see Agrawal and Goyal (2017b)) ensures samples from good arm would be large enough with sufficient frequency, ensuring appropriate exploration and good performance. However, in our problem, the anti-concentration property would result in playing the expensive arm too often since the difference in the mean qualities is small, incurring a linear cost regret while achieving zero quality regret. A complete proof of the theorem is provided in Appendix C. ∎

The poor performance of the algorithm is not limited only to the above instance and usage of Gaussian prior. More generally, the CS-TS and CS-UCB algorithms seem to perform poorly whenever the mean reward of the optimal arm is very close to the smallest tolerated reward. We illustrate this through another empirical example. Consider the following instance with two arms each having Bernoulli rewards and . The costs of the two arms are and . The expected qualities are with . The prior of the mean reward of both the arms is a Beta(1,1) distribution. Here, the quality regret will be zero irrespective of which arm is played. But both CS-TS and CS-UCB incur significant cost regret as shown in Figure 1. (In the figure, we also plot the performance of the key algorithm we propose in the paper (Algorithm 2) and note that it has much superior performance as compared to CS-TS and CS-UCB.)

Figure 1:

Cost regret of various algorithms for an instance where the mean reward of the optimal arm is very close to the smallest tolerated reward. CS-TS and CS-UCB incur significant regret. But CS-ETC attains low cost regret. The width of the error bands is two standard deviations based on 50 runs of the simulation.

3 Lower Bound

In this section, we establish that any policy must incur a regret of on at least one of the regret metrics. More precisely, we prove the following result.

Theorem 2.

For any given and (possibly randomized) policy , there exists an instance of problem (1) with arms such that is when and .

3.1 Proof Overview

We consider the following families of instances to establish the lower bound. More specifically, we first prove the result for and then establish a reduction for to the special case of .

Definition 1 (Family of instances ).

Define a family of instances consisting of instances each with Bernoulli arms indexed by . For the instance , the costs and mean reward of the -th arm are
for . For the instance with , the costs and mean rewards of the -th arm are

for , where and .

Lemma 1.

For any given and any (possibly randomized) policy , there exists an instance (from the family ) such that is when and .

Lemma 1 establishes that when , any policy must incur a regret of on an instance from the family To prove Lemma 1, we argue that any online learning algorithm will not be able to differentiate the instance from the instance for and therefore, must either incur a high cost regret if the algorithm does not select arm frequently or high quality regret if the algorithm selects arm frequently. More specifically, any online algorithm would require samples or rounds to distinguish instance from instance for . Hence, any policy can avoid high quality regret by exploring sufficiently for rounds, incurring a cost regret of or incur zero cost regret at the expense of regret on the reward metric. This suggests a trade-off between and , which are of the same magnitude at resulting in the aforementioned lower bound. The complete proof generalizes techniques from the standard MAB lower bound proof and is provided in Appendix B. ∎

Now, we generalize the above result for to any for . The main idea in our reduction is to show that if there exists an algorithm for such that is on every instance in the family , then we can use as a subroutine to construct an algorithm for problem (1) such that is on every instance in , thus contradicting the lower bound of Lemma 1. This will prove Theorem 2 by contradiction. In order to construct the aforementioned sub-routine, we leverage techniques from Bernoulli factory (Keane and O’Brien (1994); Huber (2013)

) to generate a sample from a Bernoulli random variable with parameter

using samples from a Bernoulli random variable with parameter , for any We provide the exact sub-routine and complete proof in Appendix B.

4 Explore-then-commit based algorithm

We propose an explore-then-commit algorithm, named Cost-Subsidized Explore-Then-Commit

 (CS-ETC), to have better worst case performance guarantees as compared to the extensions of the TS and UCB algorithms. As the name suggests, first this algorithm plays each arm for a specified number of rounds. After sufficient exploration, the algorithm continues in a UCB-like fashion. In every round, based on the upper and lower confidence bounds on the reward of each arm, a feasible set of arms is constructed as an estimate of all arms having mean reward greater than the smallest tolerated reward. The lowest cost arm in this feasible set is then pulled. This is detailed in Algorithm

2. The key question that arises in this algorithm is how many exploration rounds are needed before exploitation can begin. We establish that rounds are sufficient for exploration in the following result (proof in Appendix C).

Result: Arm to be pulled in each round
Input : , no. of exploration pulls per arm
Pure exploration phase:
for  do
       ;
       Pull arm to obtain reward ;
       ;
      
end for
UCB phase:
for  do
       ;
       ;
       ;
       ;
       ;
       ;
       ;
       Pull arm to obtain reward ;
       ;
      
end for
Algorithm 2 Cost-Subsidized Explore-Then-Commit
Theorem 3.

For an instance with arms, when the number of exploration pulls of each arm , then the sum of cost and quality regret incurred by CS-ETC(Algorithm 2) on any instance i.e. is .

The key reason that sufficient exploration is needed for our problem is that there can be arms with mean rewards very close to each other but significantly different costs. If cost regret were not of concern, then playing either arm would have led to satisfactory performance by giving low quality regret. But the need for performing well on both cost and quality regrets necessitates differentiating between the two arms and finding the one with the cheapest cost among the arms with mean reward above the smallest tolerated reward.

The regret guarantee mainly stems from the exploration phase of the algorithm. In fact, an algorithm which estimates the optimal arm only once after the exploration phase and pulls that arm for the remaining time will have the same regret upper bound as CS-ETC. But we empirically observed that the non-asymptotic performance of this algorithm is worse as compared to Algorithm 2.

5 Performance With Constraints on Costs and Rewards

In this section, we present some extensions of the previous results.

5.1 Consistent Cost and Quality

The lower bound result in Theorem 2 is motivated by an extreme instance where arms with very similar mean rewards have very different costs. This raises the following question - can better performing algorithms be obtained if the rewards and costs are consistent with each other? We show that this is indeed the case. Motivated by the instance which led to the worst case performance, we consider a constraint which gives an upper bound on the difference in costs of every pair of arms by a multiple of the difference in the qualities of these arms. Under this constraint, CS-UCB has good performance as per the following result with the proof in Appendix C.

Theorem 4.

If for an instance with arms, and any (possibly unknown) , then is .

Note that, in general, can be unknown. Hence, even with the above assumption on consistency of cost and quality, a priori any algorithm cannot get a bound on the quality difference between arms, only by virtue of knowing their costs.

5.2 Unknown Costs

In some applications, it is possible that the costs of the arms are also unknown and random. Hence, in addition to the mean reward, the mean costs also need to be estimated. Without loss of generality, we assume that the distribution of the random cost of each arm has support [0,1]. Not knowing the cost of the arm, does not fundamentally change the regret minimization problem we have discussed in the above sections. Clearly, the lower bound result is still valid. Algorithm 2 can be generalized to the unknown costs setting with a minor modification in the UCB phase of the algorithm. The modified UCB phase is described in Algorithm 4 in Appendix D. In this algorithm, we maintain confidence bounds on the costs of each arm. Instead of picking the arm with the lowest cost among all feasible arms, the algorithm now picks the arm with the lowest lower confidence bound on cost. Theorem 3 holds for this modified algorithm also.

Similarly, when costs and quality are consistent as described in Section 5.1, the CS-UCB algorithm can be modified to pick the arm with the lowest lower confidence bound on cost and Theorem 4 holds.

6 Numerical Experiments

In the previous sections, we have shown theoretical results on the worst case performance of different algorithms for (1). Now, we illustrate the empirical performance of these algorithms. We shed light on which algorithm performs better in what regime of parameter values. The key quantity which differentiates the performance of different algorithms is how close the mean rewards of different arms are to each other. We consider a setting with two Bernoulli arms and vary the mean reward of one arm (the cheaper arm) while keeping the other quantities (reward distribution of the other arm and costs of both arms) fixed. The values of these parameters are described in Table 1

. The reward in each round follows a Bernoulli distribution whereas the cost is a known fixed value. The cost and quality regret at time

of the different algorithms are plotted in Figure 2.

Parameter Value
Mean reward of arm 1 0.5
Mean reward of arm 2 0.3-0.6
Cost of arm 1 1
Cost of arm 2 0
Subsidy factor 0.1
Time horizon 5000
Table 1: Parameter values

We observe that the performance of the CS-TS and CS-UCB are close to each other for the entire range of mean reward values. To compare the performance of these algorithms with CS-ETC, we focus on how close the mean reward of the lower mean reward arm is to the smallest tolerated reward. When (), the lowest tolerated reward is 0.45 (). In terms of quality regret, when is much smaller than 0.45, CS-TS and CS-UCB perform much better than CS-ETC. This is because the number of exploration rounds in the CS-ETC algorithm is fixed (independent of the difference in mean rewards of the two arms) leading to higher quality regret when is much smaller than 0.45. On the other hand, because of the large difference in and 0.45, CS-TS and CS-UCB algorithms are easily able to find the optimal arm and incur low quality regret. The cost regret of all algorithms is 0 because the optimal arm is the expensive arm.

When is close to (and less than) 0.45, CS-TS and CS-UCB incur much higher cost regret as compared to CS-ETC. This is in line with the intuition established in Section 2. Here, CS-TS and CS-UCB are unable to effectively conclude that the second (cheaper) arm is optimal. Thus, they end up pulling the first (expensive) arm many times leading to high cost regret. On the other hand, CS-ETC, after the exploration rounds is able to correctly identify the second arm as the optimal arm.

Figure 2: Performance of algorithms with varying mean reward of the cheaper arm. The length of the error bars correspond to two standard deviations in regret obtained by running the experiment 50 times.

Thus, we recommend using the CS-TS/CS-UCB algorithm when the mean rewards of arms are well differentiated and CS-ETC when the mean rewards are close to one another (as is often the case in the SMS application). This is in line with the notion that algorithms which perform well in the worst case might not have good performance for an average case.

6.1 Conclusion and Future Work

In this paper, we have proposed a new variant of the MAB problem which factors costs associated with playing an arm and introduces new metrics that uniquely capture the features of multiple real world applications. We argue about the hardness of this problem by establishing fundamental limits on performance of any online algorithm and also demonstrating that traditional MAB algorithms perform poorly from a both theoretical and empirical standpoint. We present a simple near-optimal algorithm and through numerical simulations, we prescribe ideal algorithmic choice for different problem regimes.

An important question that naturally arises from this work is developing an algorithm for the adversarial variant of the MAB with cost subsidy problem. In particular, it is not immediately clear if EXP3 (Auer et al. (2002b)) family of algorithms, that are popular for non-stochastic MAB problem can be generalized to setting where the reward distribution is not stationary.

Acknowledgements

We would like to thank Nicolas Stier-Moses for introducing us to the SMS routing application and helping us in formulating it as an MAB problem.

References

  • Abramowitz and Stegun (1948) M. Abramowitz and I. A. Stegun. Handbook of mathematical functions with formulas, graphs, and mathematical tables, volume 55. US Government printing office, 1948.
  • Agrawal and Devanur (2014) S. Agrawal and N. R. Devanur. Bandits with concave rewards and convex knapsacks. In Proceedings of the Fifteenth ACM Conference on Economics and Computation, EC ’14, page 989–1006, New York, NY, USA, 2014. Association for Computing Machinery. ISBN 9781450325653.
  • Agrawal and Goyal (2017a) S. Agrawal and N. Goyal. Near-optimal regret bounds for thompson sampling. J. ACM, 64(5), 2017a.
  • Agrawal and Goyal (2017b) S. Agrawal and N. Goyal. Near-optimal regret bounds for thompson sampling. Journal of the ACM (JACM), 64(5):1–24, 2017b.
  • Agrawal et al. (2016) S. Agrawal, V. Avadhanula, V. Goyal, and A. Zeevi. A near-optimal exploration-exploitation approach for assortment selection. Proceedings of the 2016 ACM Conference on Economics and Computation (EC), pages 599–600, 2016.
  • Agrawal et al. (2017) S. Agrawal, V. Avadhanula, V. Goyal, and A. Zeevi. Thompson sampling for the mnl-bandit. In Conference on Learning Theory, pages 76–78, 2017.
  • Amani et al. (2020) S. Amani, M. Alizadeh, and C. Thrampoulidis. Generalized linear bandits with safety constraints. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3562–3566. IEEE, 2020.
  • Amazon (2019) Amazon. Amazon auto-targeting, 2019. URL https://tinyurl.com/yx9lyfwq.
  • Anandkumar et al. (2011) A. Anandkumar, N. Michael, A. K. Tang, and A. Swami. Distributed algorithms for learning and cognitive medium access with logarithmic regret. IEEE Journal on Selected Areas in Communications, 29(4):731–745, 2011.
  • Auer et al. (2002a) P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48–77, 2002a.
  • Auer et al. (2002b) P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002b.
  • Badanidiyuru et al. (2018) A. Badanidiyuru, R. Kleinberg, and A. Slivkins. Bandits with knapsacks. J. ACM, 65(3):13:1–13:55, Mar. 2018. ISSN 0004-5411. doi: 10.1145/3164539. URL http://doi.acm.org/10.1145/3164539.
  • Bubeck and Cesa-Bianchi (2012) S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems.

    Foundations and Trends® in Machine Learning

    , 5(1):1–122, 2012.
  • Canlas et al. (2010) M. Canlas, K. P. Cruz, M. K. Dimarucut, P. Uyengco, G. Tangonan, M. L. Guico, N. Libatique, and C. Pineda. A quantitative analysis of the quality of service of short message service in the philippines. In 2010 IEEE International Conference on Communication Systems, pages 710–714. IEEE, 2010.
  • Cao et al. (2015) W. Cao, J. Li, Y. Tao, and Z. Li. On top-k selection in multi-armed bandits and hidden bipartite graphs. In Advances in Neural Information Processing Systems, pages 1036–1044, 2015.
  • Cesa-Bianchi et al. (2014) N. Cesa-Bianchi, C. Gentile, and Y. Mansour. Regret minimization for reserve prices in second-price auctions. IEEE Transactions on Information Theory, 61(1):549–564, 2014.
  • Chen et al. (2016) L. Chen, A. Gupta, and J. Li. Pure exploration of multi-armed bandit under matroid constraints. In Conference on Learning Theory, pages 647–669, 2016.
  • Chen et al. (2014) S. Chen, T. Lin, I. King, M. R. Lyu, and W. Chen. Combinatorial pure exploration of multi-armed bandits. In Advances in Neural Information Processing Systems, pages 379–387, 2014.
  • Daulton et al. (2019) S. Daulton, S. Singh, V. Avadhanula, D. Dimmery, and E. Bakshy. Thompson sampling for contextual bandit problems with auxiliary safety constraints. arXiv preprint arXiv:1911.00638, 2019.
  • Drugan and Nowe (2013) M. M. Drugan and A. Nowe. Designing multi-objective multi-armed bandits algorithms: A study. In

    The 2013 International Joint Conference on Neural Networks (IJCNN)

    , pages 1–8. IEEE, 2013.
  • Facebook (2016) Facebook. Facebook targeting expansion, 2016. URL https://tinyurl.com/y3ss2j8g.
  • Galichet et al. (2013) N. Galichet, M. Sebag, and O. Teytaud. Exploration vs exploitation vs safety: Risk-aware multi-armed bandits. In Asian Conference on Machine Learning, pages 245–260, 2013.
  • Google (2014) Google. Google auto-targeting, 2014. URL https://tinyurl.com/y3c4bdaj.
  • Huber (2013) M. Huber. Nearly optimal bernoulli factories for linear functions. arXiv preprint arXiv:1308.1562, 2013.
  • Immorlica et al. (2019) N. Immorlica, K. A. Sankararaman, R. Schapire, and A. Slivkins. Adversarial bandits with knapsacks. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 202–219, 2019.
  • Jamieson and Nowak (2014) K. Jamieson and R. Nowak. Best-arm identification algorithms for multi-armed bandits in the fixed confidence setting. In 2014 48th Annual Conference on Information Sciences and Systems (CISS), pages 1–6. IEEE, 2014.
  • Katz-Samuels and Scott (2019) J. Katz-Samuels and C. Scott. Top feasible arm identification. In

    The 22nd International Conference on Artificial Intelligence and Statistics

    , pages 1593–1601, 2019.
  • Kazerouni et al. (2017) A. Kazerouni, M. Ghavamzadeh, Y. A. Yadkori, and B. Van Roy. Conservative contextual linear bandits. In Advances in Neural Information Processing Systems, pages 3910–3919, 2017.
  • Keane and O’Brien (1994) M. Keane and G. L. O’Brien. A bernoulli factory. ACM Transactions on Modeling and Computer Simulation (TOMACS), 4(2):213–219, 1994.
  • Koningstein (2006) R. Koningstein. Suggesting and/or providing targeting information for advertisements, July 6 2006. US Patent App. 11/026,508.
  • Langford and Zhang (2008) J. Langford and T. Zhang.

    The epoch-greedy algorithm for multi-armed bandits with side information.

    In Advances in neural information processing systems, pages 817–824, 2008.
  • Li et al. (2011) L. Li, W. Chu, J. Langford, and X. Wang. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In Proceedings of the fourth ACM international conference on Web search and data mining, pages 297–306, 2011.
  • Li et al. (2015) L. Li, S. Chen, J. Kleban, and A. Gupta. Counterfactual estimation and optimization of click metrics in search engines: A case study. In Proceedings of the 24th International Conference on World Wide Web, pages 929–934, 2015.
  • MarketWatch (2020) MarketWatch. Marketwatch a2p report, 2020. URL https://rb.gy/0w96oi.
  • May et al. (2012) B. C. May, N. Korda, A. Lee, and D. S. Leslie. Optimistic bayesian sampling in contextual-bandit problems. Journal of Machine Learning Research, (13):2069–2106, 2012.
  • Meng et al. (2007) X. Meng, P. Zerfos, V. Samanta, S. H. Wong, and S. Lu. Analysis of the reliability of a nationwide short message service. In IEEE INFOCOM 2007-26th IEEE International Conference on Computer Communications, pages 1811–1819. IEEE, 2007.
  • Oliver and Li (2011) C. Oliver and L. Li. An empirical evaluation of thompson sampling. In Advances in Neural Information Processing Systems (NIPS), 24:2249?2257, 2011.
  • (38) O. Osunade and S. O. Nurudeen. Route optimization for delivery of short message service in telecommunication networks.
  • Ovum (2017) Ovum. Sustaining a2p sms growth while securing mobile network, 2017. URL https://rb.gy/qqonzd.
  • Paria et al. (2018) B. Paria, K. Kandasamy, and B. Póczos. A flexible framework for multi-objective bayesian optimization using random scalarizations. arXiv preprint arXiv:1805.12168, 2018.
  • Robbins (1952) H. Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527–535, 1952.
  • Sankararaman et al. (2019) A. Sankararaman, A. Ganesh, and S. Shakkottai. Social learning in multi agent multi armed bandits. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 3(3):1–35, 2019.
  • Scott (2010) S. L. Scott. A modern bayesian look at the multi-armed bandit. Applied Stochastic Models in Business and Industry, 26(6):639–658, 2010.
  • Slivkins (2019) A. Slivkins. Introduction to multi-armed bandits. Foundations and Trends® in Machine Learning, 12(1-2):1–286, 2019. ISSN 1935-8237. URL http://dx.doi.org/10.1561/2200000068.
  • Slivkins and Vaughan (2014) A. Slivkins and J. W. Vaughan. Online decision making in crowdsourcing markets: Theoretical challenges. ACM SIGecom Exchanges, 12(2):4–23, 2014.
  • Thompson (1933) W. Thompson.

    On the likelihood that one unknown probability exceeds another in view of the evidence of two samples.

    Biometrika, 25(3/4):285–294, 1933.
  • Tran-Thanh et al. (2012) L. Tran-Thanh, A. Chapman, A. Rogers, and N. R. Jennings. Knapsack based optimal policies for budget–limited multi–armed bandits. In Twenty-Sixth AAAI Conference on Artificial Intelligence, 2012.
  • Twilio and Uber (2020) Twilio and Uber. Uber built a great ridesharing experience with sms & voice, 2020. URL https://customers.twilio.com/208/uber/.
  • Wu et al. (2016) Y. Wu, R. Shariff, T. Lattimore, and C. Szepesvári. Conservative bandits. In International Conference on Machine Learning, pages 1254–1262, 2016.
  • Yahyaa and Manderick (2015) S. Yahyaa and B. Manderick. Thompson sampling for multi-objective multi-armed bandits problem. In Proceedings, page 47. Presses universitaires de Louvain, 2015.
  • Yahyaa et al. (2014) S. Q. Yahyaa, M. M. Drugan, and B. Manderick. Annealing-pareto multi-objective multi-armed bandit algorithm. In

    2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL)

    , pages 1–8. IEEE, 2014.
  • Zerfos et al. (2006) P. Zerfos, X. Meng, S. H. Wong, V. Samanta, and S. Lu. A study of the short message service of a nationwide cellular network. In Proceedings of the 6th ACM SIGCOMM conference on Internet measurement, pages 263–268, 2006.

Multi-armed Bandits with Cost Subsidy:

Supplementary Material

Outline

The supplementary material of the paper is organized as follows.

  • Appendix A contains technical lemmas used in subsequent proofs.

  • Appendix B contains a proof of the lower bound.

  • Appendix C contains proofs related to the performance of various algorithms presented in the paper.

  • Appendix D gives a detailed description of the CS-ETCalgorithm when the costs of the arms are unknown and random.

References

  • Abramowitz and Stegun (1948) M. Abramowitz and I. A. Stegun. Handbook of mathematical functions with formulas, graphs, and mathematical tables, volume 55. US Government printing office, 1948.
  • Agrawal and Devanur (2014) S. Agrawal and N. R. Devanur. Bandits with concave rewards and convex knapsacks. In Proceedings of the Fifteenth ACM Conference on Economics and Computation, EC ’14, page 989–1006, New York, NY, USA, 2014. Association for Computing Machinery. ISBN 9781450325653.
  • Agrawal and Goyal (2017a) S. Agrawal and N. Goyal. Near-optimal regret bounds for thompson sampling. J. ACM, 64(5), 2017a.
  • Agrawal and Goyal (2017b) S. Agrawal and N. Goyal. Near-optimal regret bounds for thompson sampling. Journal of the ACM (JACM), 64(5):1–24, 2017b.
  • Agrawal et al. (2016) S. Agrawal, V. Avadhanula, V. Goyal, and A. Zeevi. A near-optimal exploration-exploitation approach for assortment selection. Proceedings of the 2016 ACM Conference on Economics and Computation (EC), pages 599–600, 2016.
  • Agrawal et al. (2017) S. Agrawal, V. Avadhanula, V. Goyal, and A. Zeevi. Thompson sampling for the mnl-bandit. In Conference on Learning Theory, pages 76–78, 2017.
  • Amani et al. (2020) S. Amani, M. Alizadeh, and C. Thrampoulidis. Generalized linear bandits with safety constraints. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3562–3566. IEEE, 2020.
  • Amazon (2019) Amazon. Amazon auto-targeting, 2019. URL https://tinyurl.com/yx9lyfwq.
  • Anandkumar et al. (2011) A. Anandkumar, N. Michael, A. K. Tang, and A. Swami. Distributed algorithms for learning and cognitive medium access with logarithmic regret. IEEE Journal on Selected Areas in Communications, 29(4):731–745, 2011.
  • Auer et al. (2002a) P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48–77, 2002a.
  • Auer et al. (2002b) P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002b.
  • Badanidiyuru et al. (2018) A. Badanidiyuru, R. Kleinberg, and A. Slivkins. Bandits with knapsacks. J. ACM, 65(3):13:1–13:55, Mar. 2018. ISSN 0004-5411. doi: 10.1145/3164539. URL http://doi.acm.org/10.1145/3164539.
  • Bubeck and Cesa-Bianchi (2012) S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems.

    Foundations and Trends® in Machine Learning

    , 5(1):1–122, 2012.
  • Canlas et al. (2010) M. Canlas, K. P. Cruz, M. K. Dimarucut, P. Uyengco, G. Tangonan, M. L. Guico, N. Libatique, and C. Pineda. A quantitative analysis of the quality of service of short message service in the philippines. In 2010 IEEE International Conference on Communication Systems, pages 710–714. IEEE, 2010.
  • Cao et al. (2015) W. Cao, J. Li, Y. Tao, and Z. Li. On top-k selection in multi-armed bandits and hidden bipartite graphs. In Advances in Neural Information Processing Systems, pages 1036–1044, 2015.
  • Cesa-Bianchi et al. (2014) N. Cesa-Bianchi, C. Gentile, and Y. Mansour. Regret minimization for reserve prices in second-price auctions. IEEE Transactions on Information Theory, 61(1):549–564, 2014.
  • Chen et al. (2016) L. Chen, A. Gupta, and J. Li. Pure exploration of multi-armed bandit under matroid constraints. In Conference on Learning Theory, pages 647–669, 2016.
  • Chen et al. (2014) S. Chen, T. Lin, I. King, M. R. Lyu, and W. Chen. Combinatorial pure exploration of multi-armed bandits. In Advances in Neural Information Processing Systems, pages 379–387, 2014.
  • Daulton et al. (2019) S. Daulton, S. Singh, V. Avadhanula, D. Dimmery, and E. Bakshy. Thompson sampling for contextual bandit problems with auxiliary safety constraints. arXiv preprint arXiv:1911.00638, 2019.
  • Drugan and Nowe (2013) M. M. Drugan and A. Nowe. Designing multi-objective multi-armed bandits algorithms: A study. In

    The 2013 International Joint Conference on Neural Networks (IJCNN)

    , pages 1–8. IEEE, 2013.
  • Facebook (2016) Facebook. Facebook targeting expansion, 2016. URL https://tinyurl.com/y3ss2j8g.
  • Galichet et al. (2013) N. Galichet, M. Sebag, and O. Teytaud. Exploration vs exploitation vs safety: Risk-aware multi-armed bandits. In Asian Conference on Machine Learning, pages 245–260, 2013.
  • Google (2014) Google. Google auto-targeting, 2014. URL https://tinyurl.com/y3c4bdaj.
  • Huber (2013) M. Huber. Nearly optimal bernoulli factories for linear functions. arXiv preprint arXiv:1308.1562, 2013.
  • Immorlica et al. (2019) N. Immorlica, K. A. Sankararaman, R. Schapire, and A. Slivkins. Adversarial bandits with knapsacks. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 202–219, 2019.
  • Jamieson and Nowak (2014) K. Jamieson and R. Nowak. Best-arm identification algorithms for multi-armed bandits in the fixed confidence setting. In 2014 48th Annual Conference on Information Sciences and Systems (CISS), pages 1–6. IEEE, 2014.
  • Katz-Samuels and Scott (2019) J. Katz-Samuels and C. Scott. Top feasible arm identification. In

    The 22nd International Conference on Artificial Intelligence and Statistics

    , pages 1593–1601, 2019.
  • Kazerouni et al. (2017) A. Kazerouni, M. Ghavamzadeh, Y. A. Yadkori, and B. Van Roy. Conservative contextual linear bandits. In Advances in Neural Information Processing Systems, pages 3910–3919, 2017.
  • Keane and O’Brien (1994) M. Keane and G. L. O’Brien. A bernoulli factory. ACM Transactions on Modeling and Computer Simulation (TOMACS), 4(2):213–219, 1994.
  • Koningstein (2006) R. Koningstein. Suggesting and/or providing targeting information for advertisements, July 6 2006. US Patent App. 11/026,508.
  • Langford and Zhang (2008) J. Langford and T. Zhang.

    The epoch-greedy algorithm for multi-armed bandits with side information.

    In Advances in neural information processing systems, pages 817–824, 2008.
  • Li et al. (2011) L. Li, W. Chu, J. Langford, and X. Wang. Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms. In Proceedings of the fourth ACM international conference on Web search and data mining, pages 297–306, 2011.
  • Li et al. (2015) L. Li, S. Chen, J. Kleban, and A. Gupta. Counterfactual estimation and optimization of click metrics in search engines: A case study. In Proceedings of the 24th International Conference on World Wide Web, pages 929–934, 2015.
  • MarketWatch (2020) MarketWatch. Marketwatch a2p report, 2020. URL https://rb.gy/0w96oi.
  • May et al. (2012) B. C. May, N. Korda, A. Lee, and D. S. Leslie. Optimistic bayesian sampling in contextual-bandit problems. Journal of Machine Learning Research, (13):2069–2106, 2012.
  • Meng et al. (2007) X. Meng, P. Zerfos, V. Samanta, S. H. Wong, and S. Lu. Analysis of the reliability of a nationwide short message service. In IEEE INFOCOM 2007-26th IEEE International Conference on Computer Communications, pages 1811–1819. IEEE, 2007.
  • Oliver and Li (2011) C. Oliver and L. Li. An empirical evaluation of thompson sampling. In Advances in Neural Information Processing Systems (NIPS), 24:2249?2257, 2011.
  • (38) O. Osunade and S. O. Nurudeen. Route optimization for delivery of short message service in telecommunication networks.
  • Ovum (2017) Ovum. Sustaining a2p sms growth while securing mobile network, 2017. URL https://rb.gy/qqonzd.
  • Paria et al. (2018) B. Paria, K. Kandasamy, and B. Póczos. A flexible framework for multi-objective bayesian optimization using random scalarizations. arXiv preprint arXiv:1805.12168, 2018.
  • Robbins (1952) H. Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527–535, 1952.
  • Sankararaman et al. (2019) A. Sankararaman, A. Ganesh, and S. Shakkottai. Social learning in multi agent multi armed bandits. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 3(3):1–35, 2019.
  • Scott (2010) S. L. Scott. A modern bayesian look at the multi-armed bandit. Applied Stochastic Models in Business and Industry, 26(6):639–658, 2010.
  • Slivkins (2019) A. Slivkins. Introduction to multi-armed bandits. Foundations and Trends® in Machine Learning, 12(1-2):1–286, 2019. ISSN 1935-8237. URL http://dx.doi.org/10.1561/2200000068.
  • Slivkins and Vaughan (2014) A. Slivkins and J. W. Vaughan. Online decision making in crowdsourcing markets: Theoretical challenges. ACM SIGecom Exchanges, 12(2):4–23, 2014.
  • Thompson (1933) W. Thompson.

    On the likelihood that one unknown probability exceeds another in view of the evidence of two samples.

    Biometrika, 25(3/4):285–294, 1933.
  • Tran-Thanh et al. (2012) L. Tran-Thanh, A. Chapman, A. Rogers, and N. R. Jennings. Knapsack based optimal policies for budget–limited multi–armed bandits. In Twenty-Sixth AAAI Conference on Artificial Intelligence, 2012.
  • Twilio and Uber (2020) Twilio and Uber. Uber built a great ridesharing experience with sms & voice, 2020. URL https://customers.twilio.com/208/uber/.
  • Wu et al. (2016) Y. Wu, R. Shariff, T. Lattimore, and C. Szepesvári. Conservative bandits. In International Conference on Machine Learning, pages 1254–1262, 2016.
  • Yahyaa and Manderick (2015) S. Yahyaa and B. Manderick. Thompson sampling for multi-objective multi-armed bandits problem. In Proceedings, page 47. Presses universitaires de Louvain, 2015.
  • Yahyaa et al. (2014) S. Q. Yahyaa, M. M. Drugan, and B. Manderick. Annealing-pareto multi-objective multi-armed bandit algorithm. In

    2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL)

    , pages 1–8. IEEE, 2014.
  • Zerfos et al. (2006) P. Zerfos, X. Meng, S. H. Wong, V. Samanta, and S. Lu. A study of the short message service of a nationwide cellular network. In Proceedings of the 6th ACM SIGCOMM conference on Internet measurement, pages 263–268, 2006.

Appendix A Technical Lemmas

Lemma 2 (Taylor’s Series Approximation).

For .

Proof.

For ,

Lemma 3 (Taylor’s Series Approximation).

For .

Proof.

For ,

Lemma 4 (Pinsker’s inequality).

Let denote a Bernoulli distribution with mean where . Then, where , and and the KL divergence between two Bernoulli distributions with mean and is given as .

Proof.

Thus,

Appendix B Proof of Lower Bound

Proof of Lemma 1.

In the family of instances , the costs of the arms are same across instances. Arm 0 is the cheapest arm in all the instances. With this, we define a modified notion of quality regret which penalizes the regret only when this cheap arm is pulled as

(2)

An equivalent notation for denoting the modified regret of policy on an instance of the problem is . This modified quality regret is at most equal to the quality regret. For proving the lemma, we will show a stronger result that there exists an instance such that is which will imply the required result.

Let us first consider any deterministic policy (or algorithm) . For a deterministic algorithm, the number of times an arm is pulled is a function of the observed rewards. Let the number of times arm is played be denoted by and let the total number of times any arm with cost 1 i.e. an expensive arm is played be . For any such that , we can use the proof of Lemma A.1 in Auer et al. (2002b), with function to get

where

is the expectation operator with respect to the probability distribution defined by the random rewards in instance

. Thus, using Lemma 4, we get,

(3)

Now, let us look at the regret of the algorithm for each instance in the family . We have

  1. .

Now, define randomized instance as the instance obtained by randomly choosing from the family of instances such that with probability and with probability for . The expected regret of this randomized instance is

Taking , we get is when .

Using Yao’s principle, for any randomized algorithm , there exists an instance with such that is . Also, since , we have is . ∎

Proof of Theorem 2.

Notation: For any instance , we define the arms and as and When the instance is clear, we will use the simplified notation and instead of and .

Proof Sketch:

Lemma 1 establishes that when , for any given policy, there exists an instance on which the sum of quality and cost regret are . Now, we generalize the above result for to any for . The main idea in our reduction is to show that if there exists an algorithm for that achieves regret on every instance in the family , then we can use as a subroutine to construct an algorithm for problem (1) that achieves regret on every instance in the family , thus contradicting the lower bound of Lemma 1. This will prove the theorem by contradiction. In order to construct the aforementioned sub-routine, we leverage techniques from Bernoulli factory to generate a sample from a Bernoulli random variable with parameter using samples from a Bernoulli random variable with parameter , for any

Aside on Bernoulli Factory:

The key tool we use in constructing the algorithm from is Bernoulli factory for the linear function. The Bernoulli factory for a specified scaling factor i.e. uses a sequence of independent and identically distributed samples from and returns a sample from .The key aspect of a Bernoulli factory is the number of samples needed from to generate a sample from . We use the Bernoulli factory described in Huber (2013) which has a guarantee on the expected number of samples from needed to generate a sample from . In particular, for a specified ,

(4)
Detailed proof:

For some value of (to be specified later in the proof) such that and , consider the family of instances and . Let be any algorithm for the family . Using , we construct an algorithm for the family . This algorithm is described in Algorithm 3. We will use to denote the arm pulled by algorithm at time after having observed rewards through arm pulls . The function returns two values - a random sample from the distribution and the number of samples of needed to generate this random sample.

Result: Arm to be pulled in each round , total number of arm pulls
input : Algorithm , - Number of arm pulls for algorithm
;
for