1 Introduction
In the traditional (stochastic) MAB problem (Robbins (1952)), the learning agent has access to a set of actions (arms) with unknown, but fixed reward distributions and has to repeatedly select an arm to maximize the cumulative reward. Here, the challenge is designing a policy that balances the tension between acquiring information about arms with little historical observations and exploiting the most rewarding arm based on existing information. The aforementioned explorationexploitation tradeoff has been extensively studied leading to a number of simple but extremely effective algorithms like Upper Confidence Bound (Auer et al. (2002a)) and Thompson Sampling (Thompson (1933); Agrawal and Goyal (2017a)), which have been further generalized and applied in a wide range of application domains including online advertising (Langford and Zhang (2008); CesaBianchi et al. (2014), Oliver and Li (2011)), recommendation systems (Li et al. (2015, 2011); Agrawal et al. (2016)), social networks and crowd sourcing (Anandkumar et al. (2011); Sankararaman et al. (2019), Slivkins and Vaughan (2014)); see Bubeck and CesaBianchi (2012) and Slivkins (2019) for a detailed review. However, most of these approaches cannot be generalized to settings involving multiple metrics (for example reward and cost) when the underlying tradeoffs between these metrics are not known a priori.
In many realworld applications of MAB, some of which we will elaborate below, it is common for the agent to incur costs to play an arm, with high performing arms costing more. Though, one can model this in the traditional MAB framework by considering cost subtracted from the reward as the modified objective, such a modification is not always meaningful, particularly in settings where the reward and cost associated with an arm represent different quantities (for example click rate and cost of an ad). In such problems, it is natural for the learning agent to optimize for both the metrics, typically trying to avoid incurring exorbitant costs for a marginal increase in cumulative reward. Motivated by the aforementioned scenario, in this paper, we consider a variant of the MAB problem, where the agent is not only concerned about balancing the explorationexploitation tradeoffs to maximize the cumulative reward but also balance the tradeoffs associated with multiple objectives that are intrinsic to several practical applications. More specifically, in this work, we study a stylized problem, where to manage costs, the agent is willing to tolerate a small loss from the highest reward measured as the reward that could be obtained by the traditional MAB problem in absence of costs. We refer to this problem as MAB problem with a cost subsidy (see Sec 1.1 for exact problem formulation), where the subsidy refers to the amount of reward the learning agent is willing to forgo to improve costs. Before we explain our problem and technical contributions in detail, we will elaborate on the applications that motivate this problem.
Intelligent SMS Routing. Many businesses such as banks, delivery services, airlines, hotels, and various online platforms send SMSes (text messages) to their users for a variety of reasons including twofactor authentication, order confirmations, appointment reminders, transaction alerts, and as a direct marketing line (see Twilio and Uber (2020)). These text messages referred to as ApplicationtoPerson (A2P) messages constitute a significant portion of all text messages sent through cellular networks today. In fact, A2P messages are forecasted to be a $86.3 billion business by 2025 (MarketWatch (2020)).
To deliver these messages, businesses typically enlist the support of telecom aggregators, who have private agreements with mobile operators. Each aggregator offers a unique combination of quality, as measured by the fraction of text messages successfully delivered by them and price per message. Surprisingly, it is common for delivery rates of text messages to not be very high (see Canlas et al. (2010); Meng et al. (2007); Zerfos et al. (2006); Osunade and Nurudeen for QoS analysis in different geographies) and for aggregator’s quality to fluctuate with time due to various reasons ranging from network outage to traffic congestion. Therefore, the platform’s problem of balancing the tension between inferring aggregator’s quality through exploration and exploiting the current best performing aggregator to maximize the number of messages delivered to users leads to a standard MAB formulation. However, given the large volume of messages that need to be dispatched, an MAB based solution that focuses exclusively on quality of the aggregator could result in exorbitant spending for the business. A survey of businesses shows that the number of text messages they are willing to send will have a significant drop if the cost per SMS is increased by a few cents per SMS (Ovum (2017)). Moreover, in many situations platforms have back up communication channels such as email based authentication or notifications via inapp/website features, which though not as effective as a text message in terms of read rate, can be used if guaranteeing the text message delivery proves to be very costly. Therefore, it is natural for businesses to prefer an aggregator with lower costs as long as their quality is comparable to the aggregator with the best quality.
Adaudience Optimization. We now describe another realworld application in the context of online advertisements. Many advertisers (especially smalltomedium scale businesses) have increasingly embraced the notion of autotargeting where they let the advertising platform identify a highquality audience group (e.g., Koningstein (2006); Amazon (2019); Facebook (2016); Google (2014)). To enable this, the platform explores the audiencespace to identify cheaper opportunities that also give high clickthroughrate (ctr) and conversion rate. Here, it is possible for different audience groups to have different yields i.e. quality (CTR/conversion rate) for a specific ad. However, it may require vastly different bids to reach different audiences due to auction overlap with other ad campaigns with smaller audience targeting. Thus, the algorithm is faced with a similar tradeoff; as long as a particular audiencegroup gives a highyield, the goal is to find the cheapest one.
We now present a novel formulation of a multiarmed bandit problem that captures key features of these applications, where our goal is to develop a cost sensitive MAB algorithm that balances both the explorationexploitation tradeoffs as well as the tension between conflicting metrics in a multiobjective setting.
1.1 Problem formulation
To formally state our problem, given an instance , in every round the agent chooses an arm and realizes a reward sampled independently from a fixed, but unknown distribution with mean (or ) and incurs a cost (or ), which is known a priori. Here, in order to manage costs, we allow the agent to be agnostic between arms, whose expected reward is greater than fraction of the highest expected reward, for a fixed and known value of , which we refer to as the subsidy factor. The agent’s objective is to learn and pull the cheapest arm among these high quality arms as frequently as possible.
More specifically, let denote the arm with highest expected mean, i.e., and be the set of arms whose expected reward is within factor of the highest expected reward, i.e., We refer to the quantity as the smallest tolerated reward. Without loss of generality, we assume the reward distribution has support . The goal of the agent is to design a policy (algorithm) that will learn the cheapest arm whose expected reward is at least as large as the smallest tolerated reward. In other words, the agent needs to learn the identity and simultaneously maximize the number of plays of arm Since in the SMS application, the reward is the quality of the chosen aggregator, we will use the terms reward and quality interchangeably.
To measure the performance of any policy , we propose two notions of regret  quality and cost regret, with the agent’s goal being minimizing both of them:
(1)  
where and the expectation is over the randomness in policy . Equivalently, the cost and quality regret of policy on an instance of the problem is denoted as and where the instance is defined by the distributions of the reward and cost of each arm. The objective then is to design a policy that simultaneously minimizes both the cost and quality regret for all possible choices of and (equivalently all instances I).
Remark 1 (Choice of objective function).
Note that a parametrized linear combination of reward and cost metrics, i.e. for an appropriately chosen is a popular approach to balance costreward tradeoff. However, the setting considered in this paper is not equivalent to this approach. In particular, for any specified subsidy factor , the value required in the linear objective function, for to be the optimal arm would depend on the cost and reward distributions of the arms. Therefore, using a single value of and relying on standard MAB algorithms would not lead to the desired outcome for our problem. Further, from an application standpoint there are two important considerations. First, in a realworld system we need an explicit control over the parameter that is not instancedependent to understand and defend the tradeoff between the various objectives. Second, for the intelligent SMS routing application discussed earlier, different sets of aggregators operate in different regions. Thus, separate values would need to be configured for each region, making the process cumbersome.
1.2 Related Work
Our problem is closely related to the MAB with multiple objectives line of work, which has attracted considerable attention in recent times. The existing literature on multiobjective MAB
can be broadly classified into three different categories.
Bandits with Knapsacks (BwK). Bandits with knapsacks (BwK), introduced in the seminal work of Badanidiyuru et al. (2018) is a general framework that considers the standard MAB problem under the presence of additional budget/resource constraints. The BwK problem encapsulates a large number of constrained bandit problems that naturally arise in many application domains including dynamic pricing, auction bidding, routing and scheduling ((see TranThanh et al. (2012); Agrawal and Devanur (2014); Immorlica et al. (2019)). In this formulation, the agent has access to a set of finite resources and arms, each associated with a reward distribution. Upon playing arm at time , the agent realizes a reward of and incurs a penalty of for resource , all drawn from a fixed, but unknown distribution corresponding to the arm. The objective of the agent is to maximize the cumulative reward before one of the resources is completely depleted. Although appealing in many applications, BwK formulation requires hard constraint on resources (cost in our setting) and hence, cannot be easily generalized to our problem. In particular, in the cost subsidized MAB problem, the equivalent budget limits depend on the problem instance and therefore cannot be determined a priori.
Pareto Optimality and Composite Objective. The second formulation is focused on identifying Pareto optimal alternatives and uniformly choosing among these options (see Drugan and Nowe (2013); Yahyaa et al. (2014); Paria et al. (2018); Yahyaa and Manderick (2015)). These approaches do not apply to our problem, since some of the Pareto alternatives could have extreme values for one of the metrics, for example having very low cost and low quality or extremely high cost and quality, making them undesirable for the applications discussed earlier. Closely related to this line of work is the set of works that focus on a composite objective by appropriately weighting the different metrics (see Paria et al. (2018); Yahyaa and Manderick (2015)). Such formulations also do not immediately apply for our problem, since in the SMS and ad applications discussed earlier, it is not acceptable to drop the quality beyond the allowed level irrespective of the cost savings we could obtain. Furthermore, in the SMS application, the tradeoffs between quality and costs could vary from region to region, making it hard to identify a good set of weights for the composite objective (see Remark 1).
Conservative Bandits and Bandits with Safety Constraints. Two other lines of work that are recently receiving increased attention, particularly from practitioners are bandits with safety constraints (e.g., Daulton et al. (2019); Amani et al. (2020); Galichet et al. (2013)) and conservative bandits (e.g., Wu et al. (2016); Kazerouni et al. (2017) ). In both these formulation, the algorithm chooses one of the arms and receives a reward and a cost associated with it. The goal of the algorithms is to maximize the total reward obtained while ensuring that either the chosen arm is within a prespecified threshold (when costs of arms are unknown a priori) or reward of the arm is at least a specified fraction of a known benchmark arm. Neither of these models exactly capture the requirements of our applications: a) we do not have a hard constraint on the acceptable cost of a pulled arm. In particular, choosing low quality aggregators to avoid high costs (even for a few rounds) could be disastrous since it leads to bad user experience on the platform and eventual churn. b) the equivalent benchmark arm in our case i.e. the arm with the highest mean reward is not known a priori.
Best Arm Identification. Apart from the closely related works mentioned above, our problem of identifying the cheapest arm whose expected reward is within an acceptable margin from the highest reward can be formulated as a stylized version of the bestarm identification problem (KatzSamuels and Scott (2019); Jamieson and Nowak (2014); Chen et al. (2014); Cao et al. (2015); Chen et al. (2016)). However, in many settings and particularly applications discussed earlier, the agent’s objective is optimizing cumulative reward and not just identifying the best arm.
1.3 Our Contributions
Novel Problem Formulation. In this work, we propose a stylized model, MAB with a cost subsidy and introduce new performance metrics that uniquely capture the salient features of many real world online learning problems involving multiple objectives. For this problem, we first show that naive generalization of popular algorithms like Upper Confidence Bound (UCB) and Thompson Sampling (TS) could lead to poor performance on the metrics. In particular, we show that the naive generalization of TS for this problem would lead to a linear cost regret for some problem instances.
Lower Bound. We establish a fundamental limit on the performance of any online algorithm for our problem. More specifically, we show that any online learning algorithm will incur a regret of on either the cost or the quality metric (refer to (1)), further establishing the hardness of our problem relative to the standard MAB problem, for which it is possible to design algorithms that achieve worst case regret bound of . We introduce a novel reduction technique to derive the above lower bound, which is of independent interest.
Cost Subsidized ExploreThenCommit. We present a simple algorithm, based on the explorethencommit (ETC) principle and show that it achieves nearoptimal performance guarantees. In particular, we establish that our algorithm achieves a worstcase bound of for both cost and quality regret. A key challenge in generalizing the ETC algorithm for this problem arises from having to balance between two asymmetric objectives. We also discuss generalizations of the algorithm for settings where cost of the arms is not known a priori. Furthermore, we consider a special scenario of bounded costs, where naive generalizations of TS and UCB work reasonably well and establish worst case regret bounds.
Numerical Simulation. Lastly, we perform extensive simulations to understand various regimes of the problem parameters and compare different algorithms. More specifically, we consider scenarios where naive generalizations of UCB and TS, which have been adapted in real life implementations (see Daulton et al. (2019)) perform well and settings where they perform poorly, which should be of interest to practitioners.
1.4 Outline
The rest of this paper is structured as follows. In Section 2, we show that the naive generalization to TS or UCB algorithms perform poorly and in Section 3, we establish lower bounds on performance of any algorithm for MAB with cost subsidy problem. In Section 4, we present a variation of the ETC algorithm, and show that it achieves a nearoptimal regret bound of for both the metrics. In section 5, we show that with additional assumptions it is possible to show improved performance bounds for naive generalization of existing algorithms. Finally, in section 6 we perform numerical simulations to explore various regimes of the instancespace.
2 Performance of Existing MAB Algorithms
In this section, we consider a natural extension of two popular MAB algorithms, TS and UCB for our problem and show that such adaptations perform poorly. This highlights the challenges involved in developing good algorithms for the MAB problem with cost subsidy. In particular, we establish theoretically that for some problem instances the TS variant incurs a linear cost regret and observe similar performance for the UCB variant empirically. Our key focus on TS in this section is primarily motivated by the superior performance that have been observed over a stream of recent papers in the context of TS versus more traditional approaches such as UCB (see Scott (2010); Oliver and Li (2011); May et al. (2012); Agrawal et al. (2017)).
We present the details of TS and UCB adaptations in Algorithm 1, which we will refer to as CostSubsidized TS(CSTS) and CostSubsidized UCB(CSUCB) respectively. These extensions are inspired by Daulton et al. (2019), which demonstrates empirical efficacy on a related (but different) problem. Briefly, in the CSTS(CSUCB) variation, we follow the standard TS (UCB) algorithm and obtain a quality score which is a sample from the posterior distribution (upper confidence bound) for each arm. We then construct a feasible set of arms consisting of arms whose quality scores are greater than fraction of the highest quality score. Finally, we pull the cheapest arm among the feasible set of arms.
We will now show that CSTS with Gaussian priors and posteriors (i.e. Gaussian distribution with mean
and variance
) described in Algorithm 1 incurs a linear cost regret in the worst case. More precisely, we prove the following result.Theorem 1.
For any given there exists an instance of problem such that is .
Proof Sketch.
The proof closely follows the lower bound argument in Agrawal and Goyal (2017b). We briefly describe the intuition behind the result. Consider a scenario where the highest reward arm is expensive arm while all other arms are cheap and have rewards marginally above the smallest tolerated reward. In the traditional MAB problem, the anticoncentration property of the Gaussian distribution (see Agrawal and Goyal (2017b)) ensures samples from good arm would be large enough with sufficient frequency, ensuring appropriate exploration and good performance. However, in our problem, the anticoncentration property would result in playing the expensive arm too often since the difference in the mean qualities is small, incurring a linear cost regret while achieving zero quality regret. A complete proof of the theorem is provided in Appendix C. ∎
The poor performance of the algorithm is not limited only to the above instance and usage of Gaussian prior. More generally, the CSTS and CSUCB algorithms seem to perform poorly whenever the mean reward of the optimal arm is very close to the smallest tolerated reward. We illustrate this through another empirical example. Consider the following instance with two arms each having Bernoulli rewards and . The costs of the two arms are and . The expected qualities are with . The prior of the mean reward of both the arms is a Beta(1,1) distribution. Here, the quality regret will be zero irrespective of which arm is played. But both CSTS and CSUCB incur significant cost regret as shown in Figure 1. (In the figure, we also plot the performance of the key algorithm we propose in the paper (Algorithm 2) and note that it has much superior performance as compared to CSTS and CSUCB.)
3 Lower Bound
In this section, we establish that any policy must incur a regret of on at least one of the regret metrics. More precisely, we prove the following result.
Theorem 2.
For any given and (possibly randomized) policy , there exists an instance of problem (1) with arms such that is when and .
3.1 Proof Overview
We consider the following families of instances to establish the lower bound. More specifically, we first prove the result for and then establish a reduction for to the special case of .
Definition 1 (Family of instances ).
Define a family of instances consisting of instances each with Bernoulli arms indexed by . For the instance , the costs and mean reward of the th arm are
for . For the instance with , the costs and mean rewards of the th arm are
for , where and .
Lemma 1.
For any given and any (possibly randomized) policy , there exists an instance (from the family ) such that is when and .
Lemma 1 establishes that when , any policy must incur a regret of on an instance from the family To prove Lemma 1, we argue that any online learning algorithm will not be able to differentiate the instance from the instance for and therefore, must either incur a high cost regret if the algorithm does not select arm frequently or high quality regret if the algorithm selects arm frequently. More specifically, any online algorithm would require samples or rounds to distinguish instance from instance for . Hence, any policy can avoid high quality regret by exploring sufficiently for rounds, incurring a cost regret of or incur zero cost regret at the expense of regret on the reward metric. This suggests a tradeoff between and , which are of the same magnitude at resulting in the aforementioned lower bound. The complete proof generalizes techniques from the standard MAB lower bound proof and is provided in Appendix B. ∎
Now, we generalize the above result for to any for . The main idea in our reduction is to show that if there exists an algorithm for such that is on every instance in the family , then we can use as a subroutine to construct an algorithm for problem (1) such that is on every instance in , thus contradicting the lower bound of Lemma 1. This will prove Theorem 2 by contradiction. In order to construct the aforementioned subroutine, we leverage techniques from Bernoulli factory (Keane and O’Brien (1994); Huber (2013)
) to generate a sample from a Bernoulli random variable with parameter
using samples from a Bernoulli random variable with parameter , for any We provide the exact subroutine and complete proof in Appendix B.4 Explorethencommit based algorithm
We propose an explorethencommit algorithm, named CostSubsidized ExploreThenCommit
(CSETC), to have better worst case performance guarantees as compared to the extensions of the TS and UCB algorithms. As the name suggests, first this algorithm plays each arm for a specified number of rounds. After sufficient exploration, the algorithm continues in a UCBlike fashion. In every round, based on the upper and lower confidence bounds on the reward of each arm, a feasible set of arms is constructed as an estimate of all arms having mean reward greater than the smallest tolerated reward. The lowest cost arm in this feasible set is then pulled. This is detailed in Algorithm
2. The key question that arises in this algorithm is how many exploration rounds are needed before exploitation can begin. We establish that rounds are sufficient for exploration in the following result (proof in Appendix C).Theorem 3.
For an instance with arms, when the number of exploration pulls of each arm , then the sum of cost and quality regret incurred by CSETC(Algorithm 2) on any instance i.e. is .
The key reason that sufficient exploration is needed for our problem is that there can be arms with mean rewards very close to each other but significantly different costs. If cost regret were not of concern, then playing either arm would have led to satisfactory performance by giving low quality regret. But the need for performing well on both cost and quality regrets necessitates differentiating between the two arms and finding the one with the cheapest cost among the arms with mean reward above the smallest tolerated reward.
The regret guarantee mainly stems from the exploration phase of the algorithm. In fact, an algorithm which estimates the optimal arm only once after the exploration phase and pulls that arm for the remaining time will have the same regret upper bound as CSETC. But we empirically observed that the nonasymptotic performance of this algorithm is worse as compared to Algorithm 2.
5 Performance With Constraints on Costs and Rewards
In this section, we present some extensions of the previous results.
5.1 Consistent Cost and Quality
The lower bound result in Theorem 2 is motivated by an extreme instance where arms with very similar mean rewards have very different costs. This raises the following question  can better performing algorithms be obtained if the rewards and costs are consistent with each other? We show that this is indeed the case. Motivated by the instance which led to the worst case performance, we consider a constraint which gives an upper bound on the difference in costs of every pair of arms by a multiple of the difference in the qualities of these arms. Under this constraint, CSUCB has good performance as per the following result with the proof in Appendix C.
Theorem 4.
If for an instance with arms, and any (possibly unknown) , then is .
Note that, in general, can be unknown. Hence, even with the above assumption on consistency of cost and quality, a priori any algorithm cannot get a bound on the quality difference between arms, only by virtue of knowing their costs.
5.2 Unknown Costs
In some applications, it is possible that the costs of the arms are also unknown and random. Hence, in addition to the mean reward, the mean costs also need to be estimated. Without loss of generality, we assume that the distribution of the random cost of each arm has support [0,1]. Not knowing the cost of the arm, does not fundamentally change the regret minimization problem we have discussed in the above sections. Clearly, the lower bound result is still valid. Algorithm 2 can be generalized to the unknown costs setting with a minor modification in the UCB phase of the algorithm. The modified UCB phase is described in Algorithm 4 in Appendix D. In this algorithm, we maintain confidence bounds on the costs of each arm. Instead of picking the arm with the lowest cost among all feasible arms, the algorithm now picks the arm with the lowest lower confidence bound on cost. Theorem 3 holds for this modified algorithm also.
6 Numerical Experiments
In the previous sections, we have shown theoretical results on the worst case performance of different algorithms for (1). Now, we illustrate the empirical performance of these algorithms. We shed light on which algorithm performs better in what regime of parameter values. The key quantity which differentiates the performance of different algorithms is how close the mean rewards of different arms are to each other. We consider a setting with two Bernoulli arms and vary the mean reward of one arm (the cheaper arm) while keeping the other quantities (reward distribution of the other arm and costs of both arms) fixed. The values of these parameters are described in Table 1
. The reward in each round follows a Bernoulli distribution whereas the cost is a known fixed value. The cost and quality regret at time
of the different algorithms are plotted in Figure 2.Parameter  Value 
Mean reward of arm 1  0.5 
Mean reward of arm 2  0.30.6 
Cost of arm 1  1 
Cost of arm 2  0 
Subsidy factor  0.1 
Time horizon  5000 
We observe that the performance of the CSTS and CSUCB are close to each other for the entire range of mean reward values. To compare the performance of these algorithms with CSETC, we focus on how close the mean reward of the lower mean reward arm is to the smallest tolerated reward. When (), the lowest tolerated reward is 0.45 (). In terms of quality regret, when is much smaller than 0.45, CSTS and CSUCB perform much better than CSETC. This is because the number of exploration rounds in the CSETC algorithm is fixed (independent of the difference in mean rewards of the two arms) leading to higher quality regret when is much smaller than 0.45. On the other hand, because of the large difference in and 0.45, CSTS and CSUCB algorithms are easily able to find the optimal arm and incur low quality regret. The cost regret of all algorithms is 0 because the optimal arm is the expensive arm.
When is close to (and less than) 0.45, CSTS and CSUCB incur much higher cost regret as compared to CSETC. This is in line with the intuition established in Section 2. Here, CSTS and CSUCB are unable to effectively conclude that the second (cheaper) arm is optimal. Thus, they end up pulling the first (expensive) arm many times leading to high cost regret. On the other hand, CSETC, after the exploration rounds is able to correctly identify the second arm as the optimal arm.
Thus, we recommend using the CSTS/CSUCB algorithm when the mean rewards of arms are well differentiated and CSETC when the mean rewards are close to one another (as is often the case in the SMS application). This is in line with the notion that algorithms which perform well in the worst case might not have good performance for an average case.
6.1 Conclusion and Future Work
In this paper, we have proposed a new variant of the MAB problem which factors costs associated with playing an arm and introduces new metrics that uniquely capture the features of multiple real world applications. We argue about the hardness of this problem by establishing fundamental limits on performance of any online algorithm and also demonstrating that traditional MAB algorithms perform poorly from a both theoretical and empirical standpoint. We present a simple nearoptimal algorithm and through numerical simulations, we prescribe ideal algorithmic choice for different problem regimes.
An important question that naturally arises from this work is developing an algorithm for the adversarial variant of the MAB with cost subsidy problem. In particular, it is not immediately clear if EXP3 (Auer et al. (2002b)) family of algorithms, that are popular for nonstochastic MAB problem can be generalized to setting where the reward distribution is not stationary.
Acknowledgements
We would like to thank Nicolas StierMoses for introducing us to the SMS routing application and helping us in formulating it as an MAB problem.
References
 Abramowitz and Stegun (1948) M. Abramowitz and I. A. Stegun. Handbook of mathematical functions with formulas, graphs, and mathematical tables, volume 55. US Government printing office, 1948.
 Agrawal and Devanur (2014) S. Agrawal and N. R. Devanur. Bandits with concave rewards and convex knapsacks. In Proceedings of the Fifteenth ACM Conference on Economics and Computation, EC ’14, page 989–1006, New York, NY, USA, 2014. Association for Computing Machinery. ISBN 9781450325653.
 Agrawal and Goyal (2017a) S. Agrawal and N. Goyal. Nearoptimal regret bounds for thompson sampling. J. ACM, 64(5), 2017a.
 Agrawal and Goyal (2017b) S. Agrawal and N. Goyal. Nearoptimal regret bounds for thompson sampling. Journal of the ACM (JACM), 64(5):1–24, 2017b.
 Agrawal et al. (2016) S. Agrawal, V. Avadhanula, V. Goyal, and A. Zeevi. A nearoptimal explorationexploitation approach for assortment selection. Proceedings of the 2016 ACM Conference on Economics and Computation (EC), pages 599–600, 2016.
 Agrawal et al. (2017) S. Agrawal, V. Avadhanula, V. Goyal, and A. Zeevi. Thompson sampling for the mnlbandit. In Conference on Learning Theory, pages 76–78, 2017.
 Amani et al. (2020) S. Amani, M. Alizadeh, and C. Thrampoulidis. Generalized linear bandits with safety constraints. In ICASSP 20202020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3562–3566. IEEE, 2020.
 Amazon (2019) Amazon. Amazon autotargeting, 2019. URL https://tinyurl.com/yx9lyfwq.
 Anandkumar et al. (2011) A. Anandkumar, N. Michael, A. K. Tang, and A. Swami. Distributed algorithms for learning and cognitive medium access with logarithmic regret. IEEE Journal on Selected Areas in Communications, 29(4):731–745, 2011.
 Auer et al. (2002a) P. Auer, N. CesaBianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48–77, 2002a.
 Auer et al. (2002b) P. Auer, N. CesaBianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002b.
 Badanidiyuru et al. (2018) A. Badanidiyuru, R. Kleinberg, and A. Slivkins. Bandits with knapsacks. J. ACM, 65(3):13:1–13:55, Mar. 2018. ISSN 00045411. doi: 10.1145/3164539. URL http://doi.acm.org/10.1145/3164539.

Bubeck and CesaBianchi (2012)
S. Bubeck and N. CesaBianchi.
Regret analysis of stochastic and nonstochastic multiarmed bandit
problems.
Foundations and Trends® in Machine Learning
, 5(1):1–122, 2012.  Canlas et al. (2010) M. Canlas, K. P. Cruz, M. K. Dimarucut, P. Uyengco, G. Tangonan, M. L. Guico, N. Libatique, and C. Pineda. A quantitative analysis of the quality of service of short message service in the philippines. In 2010 IEEE International Conference on Communication Systems, pages 710–714. IEEE, 2010.
 Cao et al. (2015) W. Cao, J. Li, Y. Tao, and Z. Li. On topk selection in multiarmed bandits and hidden bipartite graphs. In Advances in Neural Information Processing Systems, pages 1036–1044, 2015.
 CesaBianchi et al. (2014) N. CesaBianchi, C. Gentile, and Y. Mansour. Regret minimization for reserve prices in secondprice auctions. IEEE Transactions on Information Theory, 61(1):549–564, 2014.
 Chen et al. (2016) L. Chen, A. Gupta, and J. Li. Pure exploration of multiarmed bandit under matroid constraints. In Conference on Learning Theory, pages 647–669, 2016.
 Chen et al. (2014) S. Chen, T. Lin, I. King, M. R. Lyu, and W. Chen. Combinatorial pure exploration of multiarmed bandits. In Advances in Neural Information Processing Systems, pages 379–387, 2014.
 Daulton et al. (2019) S. Daulton, S. Singh, V. Avadhanula, D. Dimmery, and E. Bakshy. Thompson sampling for contextual bandit problems with auxiliary safety constraints. arXiv preprint arXiv:1911.00638, 2019.

Drugan and Nowe (2013)
M. M. Drugan and A. Nowe.
Designing multiobjective multiarmed bandits algorithms: A study.
In
The 2013 International Joint Conference on Neural Networks (IJCNN)
, pages 1–8. IEEE, 2013.  Facebook (2016) Facebook. Facebook targeting expansion, 2016. URL https://tinyurl.com/y3ss2j8g.
 Galichet et al. (2013) N. Galichet, M. Sebag, and O. Teytaud. Exploration vs exploitation vs safety: Riskaware multiarmed bandits. In Asian Conference on Machine Learning, pages 245–260, 2013.
 Google (2014) Google. Google autotargeting, 2014. URL https://tinyurl.com/y3c4bdaj.
 Huber (2013) M. Huber. Nearly optimal bernoulli factories for linear functions. arXiv preprint arXiv:1308.1562, 2013.
 Immorlica et al. (2019) N. Immorlica, K. A. Sankararaman, R. Schapire, and A. Slivkins. Adversarial bandits with knapsacks. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 202–219, 2019.
 Jamieson and Nowak (2014) K. Jamieson and R. Nowak. Bestarm identification algorithms for multiarmed bandits in the fixed confidence setting. In 2014 48th Annual Conference on Information Sciences and Systems (CISS), pages 1–6. IEEE, 2014.

KatzSamuels and Scott (2019)
J. KatzSamuels and C. Scott.
Top feasible arm identification.
In
The 22nd International Conference on Artificial Intelligence and Statistics
, pages 1593–1601, 2019.  Kazerouni et al. (2017) A. Kazerouni, M. Ghavamzadeh, Y. A. Yadkori, and B. Van Roy. Conservative contextual linear bandits. In Advances in Neural Information Processing Systems, pages 3910–3919, 2017.
 Keane and O’Brien (1994) M. Keane and G. L. O’Brien. A bernoulli factory. ACM Transactions on Modeling and Computer Simulation (TOMACS), 4(2):213–219, 1994.
 Koningstein (2006) R. Koningstein. Suggesting and/or providing targeting information for advertisements, July 6 2006. US Patent App. 11/026,508.

Langford and Zhang (2008)
J. Langford and T. Zhang.
The epochgreedy algorithm for multiarmed bandits with side information.
In Advances in neural information processing systems, pages 817–824, 2008.  Li et al. (2011) L. Li, W. Chu, J. Langford, and X. Wang. Unbiased offline evaluation of contextualbanditbased news article recommendation algorithms. In Proceedings of the fourth ACM international conference on Web search and data mining, pages 297–306, 2011.
 Li et al. (2015) L. Li, S. Chen, J. Kleban, and A. Gupta. Counterfactual estimation and optimization of click metrics in search engines: A case study. In Proceedings of the 24th International Conference on World Wide Web, pages 929–934, 2015.
 MarketWatch (2020) MarketWatch. Marketwatch a2p report, 2020. URL https://rb.gy/0w96oi.
 May et al. (2012) B. C. May, N. Korda, A. Lee, and D. S. Leslie. Optimistic bayesian sampling in contextualbandit problems. Journal of Machine Learning Research, (13):2069–2106, 2012.
 Meng et al. (2007) X. Meng, P. Zerfos, V. Samanta, S. H. Wong, and S. Lu. Analysis of the reliability of a nationwide short message service. In IEEE INFOCOM 200726th IEEE International Conference on Computer Communications, pages 1811–1819. IEEE, 2007.
 Oliver and Li (2011) C. Oliver and L. Li. An empirical evaluation of thompson sampling. In Advances in Neural Information Processing Systems (NIPS), 24:2249?2257, 2011.
 (38) O. Osunade and S. O. Nurudeen. Route optimization for delivery of short message service in telecommunication networks.
 Ovum (2017) Ovum. Sustaining a2p sms growth while securing mobile network, 2017. URL https://rb.gy/qqonzd.
 Paria et al. (2018) B. Paria, K. Kandasamy, and B. Póczos. A flexible framework for multiobjective bayesian optimization using random scalarizations. arXiv preprint arXiv:1805.12168, 2018.
 Robbins (1952) H. Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527–535, 1952.
 Sankararaman et al. (2019) A. Sankararaman, A. Ganesh, and S. Shakkottai. Social learning in multi agent multi armed bandits. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 3(3):1–35, 2019.
 Scott (2010) S. L. Scott. A modern bayesian look at the multiarmed bandit. Applied Stochastic Models in Business and Industry, 26(6):639–658, 2010.
 Slivkins (2019) A. Slivkins. Introduction to multiarmed bandits. Foundations and Trends® in Machine Learning, 12(12):1–286, 2019. ISSN 19358237. URL http://dx.doi.org/10.1561/2200000068.
 Slivkins and Vaughan (2014) A. Slivkins and J. W. Vaughan. Online decision making in crowdsourcing markets: Theoretical challenges. ACM SIGecom Exchanges, 12(2):4–23, 2014.

Thompson (1933)
W. Thompson.
On the likelihood that one unknown probability exceeds another in view of the evidence of two samples.
Biometrika, 25(3/4):285–294, 1933.  TranThanh et al. (2012) L. TranThanh, A. Chapman, A. Rogers, and N. R. Jennings. Knapsack based optimal policies for budget–limited multi–armed bandits. In TwentySixth AAAI Conference on Artificial Intelligence, 2012.
 Twilio and Uber (2020) Twilio and Uber. Uber built a great ridesharing experience with sms & voice, 2020. URL https://customers.twilio.com/208/uber/.
 Wu et al. (2016) Y. Wu, R. Shariff, T. Lattimore, and C. Szepesvári. Conservative bandits. In International Conference on Machine Learning, pages 1254–1262, 2016.
 Yahyaa and Manderick (2015) S. Yahyaa and B. Manderick. Thompson sampling for multiobjective multiarmed bandits problem. In Proceedings, page 47. Presses universitaires de Louvain, 2015.

Yahyaa et al. (2014)
S. Q. Yahyaa, M. M. Drugan, and B. Manderick.
Annealingpareto multiobjective multiarmed bandit algorithm.
In
2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL)
, pages 1–8. IEEE, 2014.  Zerfos et al. (2006) P. Zerfos, X. Meng, S. H. Wong, V. Samanta, and S. Lu. A study of the short message service of a nationwide cellular network. In Proceedings of the 6th ACM SIGCOMM conference on Internet measurement, pages 263–268, 2006.
Multiarmed Bandits with Cost Subsidy:
Supplementary Material
Outline
The supplementary material of the paper is organized as follows.

Appendix A contains technical lemmas used in subsequent proofs.

Appendix B contains a proof of the lower bound.

Appendix C contains proofs related to the performance of various algorithms presented in the paper.

Appendix D gives a detailed description of the CSETCalgorithm when the costs of the arms are unknown and random.
References
 Abramowitz and Stegun (1948) M. Abramowitz and I. A. Stegun. Handbook of mathematical functions with formulas, graphs, and mathematical tables, volume 55. US Government printing office, 1948.
 Agrawal and Devanur (2014) S. Agrawal and N. R. Devanur. Bandits with concave rewards and convex knapsacks. In Proceedings of the Fifteenth ACM Conference on Economics and Computation, EC ’14, page 989–1006, New York, NY, USA, 2014. Association for Computing Machinery. ISBN 9781450325653.
 Agrawal and Goyal (2017a) S. Agrawal and N. Goyal. Nearoptimal regret bounds for thompson sampling. J. ACM, 64(5), 2017a.
 Agrawal and Goyal (2017b) S. Agrawal and N. Goyal. Nearoptimal regret bounds for thompson sampling. Journal of the ACM (JACM), 64(5):1–24, 2017b.
 Agrawal et al. (2016) S. Agrawal, V. Avadhanula, V. Goyal, and A. Zeevi. A nearoptimal explorationexploitation approach for assortment selection. Proceedings of the 2016 ACM Conference on Economics and Computation (EC), pages 599–600, 2016.
 Agrawal et al. (2017) S. Agrawal, V. Avadhanula, V. Goyal, and A. Zeevi. Thompson sampling for the mnlbandit. In Conference on Learning Theory, pages 76–78, 2017.
 Amani et al. (2020) S. Amani, M. Alizadeh, and C. Thrampoulidis. Generalized linear bandits with safety constraints. In ICASSP 20202020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3562–3566. IEEE, 2020.
 Amazon (2019) Amazon. Amazon autotargeting, 2019. URL https://tinyurl.com/yx9lyfwq.
 Anandkumar et al. (2011) A. Anandkumar, N. Michael, A. K. Tang, and A. Swami. Distributed algorithms for learning and cognitive medium access with logarithmic regret. IEEE Journal on Selected Areas in Communications, 29(4):731–745, 2011.
 Auer et al. (2002a) P. Auer, N. CesaBianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal on Computing, 32(1):48–77, 2002a.
 Auer et al. (2002b) P. Auer, N. CesaBianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002b.
 Badanidiyuru et al. (2018) A. Badanidiyuru, R. Kleinberg, and A. Slivkins. Bandits with knapsacks. J. ACM, 65(3):13:1–13:55, Mar. 2018. ISSN 00045411. doi: 10.1145/3164539. URL http://doi.acm.org/10.1145/3164539.

Bubeck and CesaBianchi (2012)
S. Bubeck and N. CesaBianchi.
Regret analysis of stochastic and nonstochastic multiarmed bandit
problems.
Foundations and Trends® in Machine Learning
, 5(1):1–122, 2012.  Canlas et al. (2010) M. Canlas, K. P. Cruz, M. K. Dimarucut, P. Uyengco, G. Tangonan, M. L. Guico, N. Libatique, and C. Pineda. A quantitative analysis of the quality of service of short message service in the philippines. In 2010 IEEE International Conference on Communication Systems, pages 710–714. IEEE, 2010.
 Cao et al. (2015) W. Cao, J. Li, Y. Tao, and Z. Li. On topk selection in multiarmed bandits and hidden bipartite graphs. In Advances in Neural Information Processing Systems, pages 1036–1044, 2015.
 CesaBianchi et al. (2014) N. CesaBianchi, C. Gentile, and Y. Mansour. Regret minimization for reserve prices in secondprice auctions. IEEE Transactions on Information Theory, 61(1):549–564, 2014.
 Chen et al. (2016) L. Chen, A. Gupta, and J. Li. Pure exploration of multiarmed bandit under matroid constraints. In Conference on Learning Theory, pages 647–669, 2016.
 Chen et al. (2014) S. Chen, T. Lin, I. King, M. R. Lyu, and W. Chen. Combinatorial pure exploration of multiarmed bandits. In Advances in Neural Information Processing Systems, pages 379–387, 2014.
 Daulton et al. (2019) S. Daulton, S. Singh, V. Avadhanula, D. Dimmery, and E. Bakshy. Thompson sampling for contextual bandit problems with auxiliary safety constraints. arXiv preprint arXiv:1911.00638, 2019.

Drugan and Nowe (2013)
M. M. Drugan and A. Nowe.
Designing multiobjective multiarmed bandits algorithms: A study.
In
The 2013 International Joint Conference on Neural Networks (IJCNN)
, pages 1–8. IEEE, 2013.  Facebook (2016) Facebook. Facebook targeting expansion, 2016. URL https://tinyurl.com/y3ss2j8g.
 Galichet et al. (2013) N. Galichet, M. Sebag, and O. Teytaud. Exploration vs exploitation vs safety: Riskaware multiarmed bandits. In Asian Conference on Machine Learning, pages 245–260, 2013.
 Google (2014) Google. Google autotargeting, 2014. URL https://tinyurl.com/y3c4bdaj.
 Huber (2013) M. Huber. Nearly optimal bernoulli factories for linear functions. arXiv preprint arXiv:1308.1562, 2013.
 Immorlica et al. (2019) N. Immorlica, K. A. Sankararaman, R. Schapire, and A. Slivkins. Adversarial bandits with knapsacks. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 202–219, 2019.
 Jamieson and Nowak (2014) K. Jamieson and R. Nowak. Bestarm identification algorithms for multiarmed bandits in the fixed confidence setting. In 2014 48th Annual Conference on Information Sciences and Systems (CISS), pages 1–6. IEEE, 2014.

KatzSamuels and Scott (2019)
J. KatzSamuels and C. Scott.
Top feasible arm identification.
In
The 22nd International Conference on Artificial Intelligence and Statistics
, pages 1593–1601, 2019.  Kazerouni et al. (2017) A. Kazerouni, M. Ghavamzadeh, Y. A. Yadkori, and B. Van Roy. Conservative contextual linear bandits. In Advances in Neural Information Processing Systems, pages 3910–3919, 2017.
 Keane and O’Brien (1994) M. Keane and G. L. O’Brien. A bernoulli factory. ACM Transactions on Modeling and Computer Simulation (TOMACS), 4(2):213–219, 1994.
 Koningstein (2006) R. Koningstein. Suggesting and/or providing targeting information for advertisements, July 6 2006. US Patent App. 11/026,508.

Langford and Zhang (2008)
J. Langford and T. Zhang.
The epochgreedy algorithm for multiarmed bandits with side information.
In Advances in neural information processing systems, pages 817–824, 2008.  Li et al. (2011) L. Li, W. Chu, J. Langford, and X. Wang. Unbiased offline evaluation of contextualbanditbased news article recommendation algorithms. In Proceedings of the fourth ACM international conference on Web search and data mining, pages 297–306, 2011.
 Li et al. (2015) L. Li, S. Chen, J. Kleban, and A. Gupta. Counterfactual estimation and optimization of click metrics in search engines: A case study. In Proceedings of the 24th International Conference on World Wide Web, pages 929–934, 2015.
 MarketWatch (2020) MarketWatch. Marketwatch a2p report, 2020. URL https://rb.gy/0w96oi.
 May et al. (2012) B. C. May, N. Korda, A. Lee, and D. S. Leslie. Optimistic bayesian sampling in contextualbandit problems. Journal of Machine Learning Research, (13):2069–2106, 2012.
 Meng et al. (2007) X. Meng, P. Zerfos, V. Samanta, S. H. Wong, and S. Lu. Analysis of the reliability of a nationwide short message service. In IEEE INFOCOM 200726th IEEE International Conference on Computer Communications, pages 1811–1819. IEEE, 2007.
 Oliver and Li (2011) C. Oliver and L. Li. An empirical evaluation of thompson sampling. In Advances in Neural Information Processing Systems (NIPS), 24:2249?2257, 2011.
 (38) O. Osunade and S. O. Nurudeen. Route optimization for delivery of short message service in telecommunication networks.
 Ovum (2017) Ovum. Sustaining a2p sms growth while securing mobile network, 2017. URL https://rb.gy/qqonzd.
 Paria et al. (2018) B. Paria, K. Kandasamy, and B. Póczos. A flexible framework for multiobjective bayesian optimization using random scalarizations. arXiv preprint arXiv:1805.12168, 2018.
 Robbins (1952) H. Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527–535, 1952.
 Sankararaman et al. (2019) A. Sankararaman, A. Ganesh, and S. Shakkottai. Social learning in multi agent multi armed bandits. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 3(3):1–35, 2019.
 Scott (2010) S. L. Scott. A modern bayesian look at the multiarmed bandit. Applied Stochastic Models in Business and Industry, 26(6):639–658, 2010.
 Slivkins (2019) A. Slivkins. Introduction to multiarmed bandits. Foundations and Trends® in Machine Learning, 12(12):1–286, 2019. ISSN 19358237. URL http://dx.doi.org/10.1561/2200000068.
 Slivkins and Vaughan (2014) A. Slivkins and J. W. Vaughan. Online decision making in crowdsourcing markets: Theoretical challenges. ACM SIGecom Exchanges, 12(2):4–23, 2014.

Thompson (1933)
W. Thompson.
On the likelihood that one unknown probability exceeds another in view of the evidence of two samples.
Biometrika, 25(3/4):285–294, 1933.  TranThanh et al. (2012) L. TranThanh, A. Chapman, A. Rogers, and N. R. Jennings. Knapsack based optimal policies for budget–limited multi–armed bandits. In TwentySixth AAAI Conference on Artificial Intelligence, 2012.
 Twilio and Uber (2020) Twilio and Uber. Uber built a great ridesharing experience with sms & voice, 2020. URL https://customers.twilio.com/208/uber/.
 Wu et al. (2016) Y. Wu, R. Shariff, T. Lattimore, and C. Szepesvári. Conservative bandits. In International Conference on Machine Learning, pages 1254–1262, 2016.
 Yahyaa and Manderick (2015) S. Yahyaa and B. Manderick. Thompson sampling for multiobjective multiarmed bandits problem. In Proceedings, page 47. Presses universitaires de Louvain, 2015.

Yahyaa et al. (2014)
S. Q. Yahyaa, M. M. Drugan, and B. Manderick.
Annealingpareto multiobjective multiarmed bandit algorithm.
In
2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL)
, pages 1–8. IEEE, 2014.  Zerfos et al. (2006) P. Zerfos, X. Meng, S. H. Wong, V. Samanta, and S. Lu. A study of the short message service of a nationwide cellular network. In Proceedings of the 6th ACM SIGCOMM conference on Internet measurement, pages 263–268, 2006.
Appendix A Technical Lemmas
Lemma 2 (Taylor’s Series Approximation).
For .
Proof.
For ,
∎
Lemma 3 (Taylor’s Series Approximation).
For .
Proof.
For ,
∎
Lemma 4 (Pinsker’s inequality).
Let denote a Bernoulli distribution with mean where . Then, where , and and the KL divergence between two Bernoulli distributions with mean and is given as .
Proof.
Thus,
∎
Appendix B Proof of Lower Bound
Proof of Lemma 1.
In the family of instances , the costs of the arms are same across instances. Arm 0 is the cheapest arm in all the instances. With this, we define a modified notion of quality regret which penalizes the regret only when this cheap arm is pulled as
(2) 
An equivalent notation for denoting the modified regret of policy on an instance of the problem is . This modified quality regret is at most equal to the quality regret. For proving the lemma, we will show a stronger result that there exists an instance such that is which will imply the required result.
Let us first consider any deterministic policy (or algorithm) . For a deterministic algorithm, the number of times an arm is pulled is a function of the observed rewards. Let the number of times arm is played be denoted by and let the total number of times any arm with cost 1 i.e. an expensive arm is played be . For any such that , we can use the proof of Lemma A.1 in Auer et al. (2002b), with function to get
where
is the expectation operator with respect to the probability distribution defined by the random rewards in instance
. Thus, using Lemma 4, we get,(3) 
Now, let us look at the regret of the algorithm for each instance in the family . We have


.
Now, define randomized instance as the instance obtained by randomly choosing from the family of instances such that with probability and with probability for . The expected regret of this randomized instance is
Taking , we get is when .
Using Yao’s principle, for any randomized algorithm , there exists an instance with such that is . Also, since , we have is . ∎
Proof of Theorem 2.
Notation: For any instance , we define the arms and as and When the instance is clear, we will use the simplified notation and instead of and .
Proof Sketch:
Lemma 1 establishes that when , for any given policy, there exists an instance on which the sum of quality and cost regret are . Now, we generalize the above result for to any for . The main idea in our reduction is to show that if there exists an algorithm for that achieves regret on every instance in the family , then we can use as a subroutine to construct an algorithm for problem (1) that achieves regret on every instance in the family , thus contradicting the lower bound of Lemma 1. This will prove the theorem by contradiction. In order to construct the aforementioned subroutine, we leverage techniques from Bernoulli factory to generate a sample from a Bernoulli random variable with parameter using samples from a Bernoulli random variable with parameter , for any
Aside on Bernoulli Factory:
The key tool we use in constructing the algorithm from is Bernoulli factory for the linear function. The Bernoulli factory for a specified scaling factor i.e. uses a sequence of independent and identically distributed samples from and returns a sample from .The key aspect of a Bernoulli factory is the number of samples needed from to generate a sample from . We use the Bernoulli factory described in Huber (2013) which has a guarantee on the expected number of samples from needed to generate a sample from . In particular, for a specified ,
(4) 
Detailed proof:
For some value of (to be specified later in the proof) such that and , consider the family of instances and . Let be any algorithm for the family . Using , we construct an algorithm for the family . This algorithm is described in Algorithm 3. We will use to denote the arm pulled by algorithm at time after having observed rewards through arm pulls . The function returns two values  a random sample from the distribution and the number of samples of needed to generate this random sample.
Comments
There are no comments yet.