1 Introduction
The classical stochastic multiarmed bandit (MAB) model provides an elegant abstraction to a number of important sequential decision making problems. In this setting, the planner chooses (or pulls) a single arm in each discrete time instant from a fixed pool of finitely many arms for a finite number of time instants. Typically it is assumed that the number of arms is much smaller than the number of time instances. Each arm, when pulled, generates a reward from a fixed but apriori unknown stochastic distribution corresponding to the pulled arm. However, the arms that are not pulled do not generate any reward to the planner. The planner’s goal is to minimize the regret, i.e., the loss incurred in expected cumulative reward due to not knowing the reward distribution of the arms beforehand. The MAB problem encapsulates the classical exploration versus exploitation dilemma, in that the planner’s algorithm has to arrive at an optimal tradeoff between exploration (pulling relatively unexplored arms) and exploitation (pulling the best arms according to the history of pulls thus far). This problem has been extensively studied in the literature. These studies include analyzing the lower bound on regret [lai85], design and analysis of asymptotically optimal algorithms [auer2010ucb; agrawal2012analysis; thompson1933likelihood; auer2002finite], empirical studies [chapelle2011empirical; devanand2017empirical; russo2018tutorial], and several extensions [slivkins2019introduction; bubeck2012regret]. We provide a detailed review of the relevant literature in Section 7.
The theoretical results in MAB are complemented by a wide variety of modern applications which can be seamlessly modelled in the MAB setup. Internet advertising [babaioff2009characterizing; nuara2018combinatorial], crowdsourcing [JAIN2018], clinical trials [villar2015multi], wireless communication [maghsudi2014joint] represent a few of the many applications. Due to its wide range applications and an elegant theoretical foundation, many variants of the MAB problem have been proposed. In this paper, we propose a novel variant which we call Ballooning MultiArmed Bandits (BLMAB). In contrast to the classical MAB where the set of available arms is fixed throughout the run of an algorithm, the set of arms in BLMAB grows (or balloons) over time. As the number of arms increases (potentially linearly) with time, it is clear that an optimal algorithm has to ignore (or drop) a few arms. Hence, in addition to achieving an optimal tradeoff between the number of exploratory pulls and exploitation pulls, the algorithm must also ensure that it does not drop too many (or too few) arms.
To see that the traditional algorithms are not regretoptimal in the BLMAB setting, consider the following thought experiment. Let a new arm arrive at each time instant in decreasing order of mean reward, and let the MAB algorithm run for a total of time instants. The traditional MAB algorithms (such as UCB1, Moss etc.) would pull the newly arrived arm at each time and thus would incur a regret of . Note here that the best arm appeared at the first time instant itself, however, as the set of arms is monotonically expanding over time, the algorithm could not sufficiently explore every arm. Observe that the regret in BLMAB depends not only on the mean reward of the arms, but also on when they arrive. Hence, any BLMAB algorithm ought to be aware of the arrival of the arms.
1.1 Motivation
We motivate the practical significance of BLMAB with a few applications. In general, BLMAB is directly applicable in any scenario where the set of options grows over time, and, the objective is to choose the best option available at any given time.
A contemporary example is provided by question and answer (Q&A) platforms such as Reddit, Stack Overflow, Quora, Yahoo! Answers, and ResearchGate, where the platform’s goal is to discover the highest quality answer that should be displayed in the most prominent slot, for a given question. Each answer post is modeled as a distinct arm of a BLMAB instance, and the rewards are distributed according to a Bernoulli distribution parameterized by the quality of the posted answer. Note that the quality parameter is a priori unknown to the platform and hence needs to be learnt. For this, the platform employs certain endorsement mechanisms with indicators such as upvotes, likes, and shares (or reposts). A user endorses the answer that is displayed to her, if she likes it. Each display of a posted answer corresponds to a pull of the corresponding arm. At each time instant, a new user observes the existing answer posts shown by the platform, decides whether to endorse them, and may also choose to post her own answer, thus increasing the number of available arms. Hence, the number of available arms (answers) monotonically increases over time.
The problem of learning qualities of the answers on Q&A forums has been modeled under the MAB framework in various studies [ghosh2013learning; tang2019bandit; LIU18]. However, these studies resort to the existing MAB variations which are not well suited for Q&A forums. For instance, ghosh2013learning
model the problem with a classical MAB framework by limiting the number of arms via strategic choice of an agent, by assuming that a user incurs a certain cost for posting an answer and hence posts it only if she derives a positive utility by doing so. However, a user’s behavior on the platform may be driven by simple cognitive heuristics rather than a well calibrated strategic decision
[burghardt2016myopia]. In another work [LIU18], the number of arms is limited by randomly dropping some of the arms from consideration. The regret is then computed with respect to only the considered arms. That is, they do not account for the regret incurred due to the randomly dropped arms.Some of the other applications of BLMAB framework are in various websites that feature user reviews, such as Amazon and Flipkart (product reviews), Tripadvisor (hotel reviews), IMDB (movie reviews). As time progresses, the reviews for a product (or a hotel or a movie) keep arriving, and the website aims to display the most useful reviews for that product (or hotel or movie) at the top. The usefulness of a review is estimated using users’ endorsements for that review, similar to that in Q&A forums. BLMAB is also applicable in scenarios where users comment on a video or news article on a video or news hosting website, where the website’s objective is to display the most popular or interesting comment on the top. The BLMAB setting thus provides a natural framework to be considered in such type of application.
BLMAB needs an independent investigation owing to a number of reasons. For instance, one of the MAB variants that holds some similarity with BLMAB is sleeping multiarmed bandit (SMAB) [kleinberg2010; chatterjee2017analysis], where a subset of a fixed set of base arms is available at each time instant. Though the SMAB framework captures the availability of a small subset of arms at each time, it assumes that the set of base arms is fixed and is small as compared to the time horizon. In contrast, the BLMAB framework allows for the number of available arms to increase, potentially linearly with time. Hence, an optimal sleeping bandits algorithm such as Auer[kleinberg2010] would end up giving a linear regret in BLMAB setting.
Another MAB variant which is somewhat similar is the manyarmed (potentially infinite) bandit [wang2009algorithms; carpentier2015simple; berry1997bandit], where the number of arms could be potentially equal to or greater than the time horizon. berry1997bandit consider the case of an infinite arm bandit with Bernoulli reward distribution. However, they consider that the optimal arm has a quality of , which is seldom the case in practical applications. Other investigations considering infinitely many arms [wang2009algorithms; carpentier2015simple] make certain assumptions on the distribution of the near optimal arm to achieve sublinear regret. Further, all the above works consider that all the arms are available in all time instants, and hence use the traditional notion of regret. In our case, the regret incurred by an algorithm in a given time instant is the difference between the quality of the best available arm during that time and the quality of the arm pulled by the algorithm (same as the notion of regret considered in sleeping bandits). The BLMAB framework is thus an interesting blend of both the sleeping bandit model and the manyarmed bandit model.
1.2 Our Contributions

[leftmargin=5pt]

We introduce the BLMAB model that allows the set of arms to grow over time.

We show for the BLMAB model that the instance independent regret will grow linearly with time, in the absence of any distributional assumption on the arrival time of the highest quality arm. Specifically, we show that if the best arm is equally likely to arrive at any time, the sublinear expected regret cannot be achieved (Theorem 1).

We propose an algorithm (BLMoss) which determines: (1) the fraction of the time horizon until which the newly arriving arms should be explored at least once and (2) the sequence of arm pulls during the exploitation phase. Our key finding is that BLMoss achieves sublinear regret under practical and minimal assumptions on the arrival distribution of the best arm, namely, subexponential tail (Theorem 3) and subPareto tail (Theorem 4). Note that we make no assumption on the arrival of the other arms. As the regret depends on the qualities of the arms and the sequence of their arrivals, it is interesting that with subexponential and subPareto assumption on only the best arm’s arrival pattern, we can achieve sublinear regret.

We carry out a pertinent simulation study to empirically observe how the expected regret varies with the time horizon. We find a strong validation for our theoretically derived regret bounds.
The paper is organized as follows. In Section 2, we present the BLMAB model. In Section 3, we first show that if the best arm arrives uniformly at random, one cannot achieve sublinear regret. We hence define two distributional assumptions on the arrival time of the best arm which enables us to achieve sublinear regret. Next, we present some preliminaries in Section 4, followed by our proposed algorithm and its theoretical analysis in Section 1. Section 6 presents our simulation results. We conclude the paper with related work (Section 7) and future directions (Section 8).
2 The Model
A classical MAB instance is given by the tuple . Here, is a fixed set of arms and is the reward distribution corresponding to an arm . Further, it is assumed that are mutually independent distributions. Denote by , the mean of distribution . Consider that each of the distributions is supported over a finite interval and is unknown to the algorithm. Throughout the paper, without loss of generality, we consider that is supported over . Further, we will refer to as the quality of the arm . A MAB algorithm is run in discrete time instants, and the total number of time instants is denoted by time horizon . In each time instant aka round, the algorithm selects a single arm and observes the reward corresponding to the selected arm. The arms which are not selected, do not give any reward. More precisely, a MAB algorithm is a mapping from the history of arm pulls and obtained rewards, to the set of arms.
At each time instant, a BLMAB algorithm chooses a single arm from the set of available arms and receives a reward generated randomly according to the reward distribution of the chosen arm . New arms may spring up at each time instant. Throughout the paper, we consider that at most one new arm arrives at each time, and the arms are never dropped. Let denote the set of arms available at round . In the BLMAB model, this set of available arms grows by at most one arm per round, i.e., and . A BLMAB instance, therefore, is given by .
Similar to the notion of regret in the sleeping stochastic MAB model, we introduce the notion of regret in BLMAB setting that takes into account the availability of the arms at each time . Let denote the arm pulled by the algorithm and be the best available arm at time , i.e., . Further, let denote a BLMAB instance and be a BLMAB algorithm. The instancedependent regret of is given by
Throughout the paper, we consider instanceindependent regret given as . Note that the instanceindependent regret bound is a worst case regret bound over all the arrival sequences of the arms and all possible reward distributions. In the next section, we show, for the BLMAB setting, that it is not possible to achieve sublinear instanceindependent regret bound.
3 Lower Bound on Regret
As pointed out in Section 1, it is clear that UCBstyle algorithms (which pull arms based on uncertainty) would pull each incoming arm at least once, leaving no rounds for exploitation. Hence, they incur linear regret in the ballooning bandit setup ^{1}^{1}1In particular, when .. However, it is not obvious that a different, more sophisticated algorithm (such as the one which randomly drops some arms) may not be able to achieve sublinear regret. Our first result (Theorem 1) shows that no algorithm can attain sublinear regret under a general BLMAB setting.
Consider the following BLMAB instance . Let there be a unique best arm with quality and all other arms have quality . A new arm arrives at each time, i.e.,
. Further, let the arrival of the best arm be uniformly distributed over time, i.e.,
for all . Let denote the optimal arm till time . Further, let be the set of arms pulled by the algorithm, i.e., . We will show that for any fixed , the expected regret is lower bounded by . We first observe that an algorithm that pulls number of arms achieves the minimum regret when it pulls the first arms.Claim 1.
The minimum regret is achieved when algorithm considers the first arms.
Theorem 1.
For the BLMAB instance , the expected regret of any algorithm is lower bounded by .
Proof.
The expected regret of a BLMAB algorithm is given by
In the above expression, the first term represents the internal regret of the learning algorithm and the second quantity is the external regret. Here, if and otherwise. From Claim 1, we have that an algorithm will incur least regret if it pulls first arms. Further, from the classical result in [lai85], in order to separate the quality of the arms, we should have for some positive problem dependent constant , for all and . Hence, we have
Note that the above expression is quadratic in . For , the minimum occurs when the value of is the least (in the positive domain), which is 1. In this case we have
For the case where , the minimum occurs when . Hence,
∎
Theorem 1 provides a strong impossibility result on the achievable instanceindependent regret bound under BLMAB setting. However, one can still achieve sublinear regret by imposing appropriate structure on the BLMAB instances. Observe that the regret depends on the arrival of arms, i.e., , and their reward distributions . We impose restrictions on the arrival of the best arm
so that the probability that
arrives early is large enough; this would allow a learning algorithm to explore the best arm enough to estimate the true quality of that arm with high probability. As noted previously, the other arms may arrive arbitrarily. Further note that we make no assumption on the qualities of individual arms.3.1 Arrival of the Best Arm
Let
be the random variable denoting the time at which the best arm arrives. Further, let
denote the cumulative distribution function of
. In our first result, we use the following Subexponential tail assumption on the arrival time of the best arm.Subexponential tail.
There exists a constant such that the probability of the best arm arriving later than rounds, is upper bounded by , i.e., .
Next, we consider a relaxed condition on the tail probabilities, i.e., when the tail does not shrink as fast as in the subexponential case. We consider the family of distributions whose tail is thicker than that of subexponential arrival distribution.
SubPareto tail.
There exists a constant such that the probability of the best arm arriving later than rounds, is upper bounded by , i.e., .
The aforementioned assumptions naturally arise in the context of Q&A forums as observed in extensive empirical studies on the nature of answering as well as voting behavior of the users. Anderson et al. [anderson2012discovering] observe that high reputation users hasten to post their answers early. One possible explanation for this phenomenon could be that the users who are motivated by the visibility that their answers receive, tend to be more active on the platform and also provide high quality answers early on, which explains their reputation score. Similarly, the most useful product reviews are likely to arrive early after the product release. Thus, it is reasonable to assume that the best arm arrives, with high probability, in early rounds.
Note that the uniform distribution is the limiting case of the subexponential case, when . We show that, while the uniform distribution results in linear regret (Theorem 1), a sublinear regret can be achieved for BLMAB instances having the best arm arrival distribution with even slightly thinner tail than that of uniform distribution.
4 Preliminaries
We now present some essentials which will be useful for our analysis in the remainder of the paper.
4.1 Lambert Function
Definition 1.
For any , the Lambert function, , is defined as the solution to the equation , i.e., .
Lambert function satisfies the following properties [hoorfar2008inequalities] (proofs are provided in Appendix for completeness):
Property 1.
The Lambert function can be equivalently written as the inverse of the function , i.e., .
Property 2.
For any , the Lambert function is unique, nonnegative, and strictly increasing.
Property 3.
For any , we have .
It can be noted that it is easy to numerically approximate for a given , using NewtonRaphson’s or Halley’s method. Moreover, there exist efficient numerical methods for evaluating it to arbitrary precision [Corless1996].
4.2 The Moss Algorithm
We use Moss (Minimax Optimal Strategy in the Stochastic case) [audibert2010regret] as a black box learning algorithm. For a fixed number of arms, the Moss algorithm pulls an arm at time such that
Here, denotes the set of arms and is the number of times arm was pulled before (and excluding) round and are the empirical estimates of the arm from samples. Each arm is pulled once in the beginning, and ties are broken arbitrarily. The following result gives an upper bound on the expected regret of Moss which is optimal up to a constant factor (it achieves the lower bound on regret given by [auer1995gambling]). Throughout the paper, we use the notation Moss (K) to denote that the Moss algorithm is run with set of arms .
Theorem 2.
[audibert2010regret] For any time horizon , the expected regret of Moss is given by .
5 The BMOSS Algorithm and Regret Analysis
5.1 The BLMoss Algorithm
We now present our algorithm, BLMoss (Algorithm 1), that uses Moss as a blackbox. The number of arms explored by BLMoss is dependent on the distribution of arrival of the best arm. In particular, BLMoss considers only the first arms in its execution (). Later in this section, we show how to derive the value of for distributions with subexponential and subPareto tails. Observe that the proposed BLMoss is a simple extension of Moss and this algorithm is practically easy to implement.
5.2 Regret Analysis of BLMoss
For a given BLMAB instance , let and . Clearly, we have that . As stated earlier, the regret of the algorithm can be decomposed into internal regret, i.e., the regret incurred by the learning algorithm considering only arms and external regret, i.e., the regret incurred by BLMoss due to the fact that BLMoss might have ignored the best arm. Write and let be the time of arrival of arm . Further, let denote the best arm till time . The instancedependent regret is given as
The first and the second terms respectively denote the internal regret and the external regret of BLMoss. We ignore the ceiling in throughout this section to avoid notation clutter.
Note that for all . This is true for any time horizon . From Theorem 2, we have the following observation about the internal regret of BLMoss.
Observation 1.
For the value of computed by BLMoss , we have .
In order to bound the overall regret, we begin with the following lemma which explicitly shows the relation between the expected regret of the algorithm and . Recall that the random variable denotes the time of arrival of the best arm.
Lemma 1.
The upper bound on the expected regret for any BLMAB instance is given by , with BLMoss exploring only the first arrived arms.
Proof.
For a given BLMAB instance , let denote the time at which arm becomes available for the first time. Let denote the best arm till rounds, i.e., . Further, let be the best arm among the arms considered by BLMoss, i.e., . Notice that . This implies .
()  
(From Observation 1 and since )  
( )  
Note that the above inequality holds for any BLMAB instance and hence we have ∎
Next, we show that under the subexponential tail property on , BLMoss achieves sublinear regret. We begin with the following lemma that lower bounds the probability of the arrival of the best quality arm in the initial rounds.
Lemma 2.
Let the arm arrival distribution of the best arm satisfy subexponential tail property for some . Then for any and , we have that .
Proof.
Note that . Hence, by Property P1 we have
By Property P2 we have . Equivalently, . So, we have . The last inequality follows from the subexponential tail property. ∎
Theorem 3.
Let the arrival distribution of the best arm satisfy the subexponential tail property for some , and let be large enough such that for some . Then with , the upper bound on the expected regret of BLMoss, , is . The upper bound on the expected regret is minimized when and is given by .
Proof.
Note that for achieving sublinear regret, it is necessary that is strictly positive, for which it is necessary that . From Lemma 2, we also have . Since such a feasible may not exist for small values of , we consider that is large enough. It can be easily shown that (see Claim 2 in Appendix).
Thus, for , we have: . Recall that by definition, we have . Thus when , the term dominates the other terms in the regret expression, whereas when , the term dominates. We analyze these cases separately.
Case 1 (): In this case, the regret is given by . Note that the regret is minimized for the lowest feasible value of , i.e., , resulting in . The last equality follows from the equivalent definition of Lambert W function (Property P1).
Case 2 (): In this case, the regret is given by . Again, the regret is minimized when . The regret in this case is given by .
We now prove the sublinear regret of BLMoss under the subPareto tail property.
Lemma 3.
Let the arm arrival distribution of the best arm satisfy subPareto tail property for some . Then for any and , we have that .
Proof.
First note that . This implies that . Further, from the subPareto tail property, we have that . ∎
Theorem 4.
Let the arrival distribution of arms satisfy the subPareto tail property for some , and let be large enough such that for some . Then with , the upper bound on the expected regret of BLMoss, , is . The upper bound on the expected regret is minimized when and is given by .
Proof.
From Lemmas 1 and 3, we have . For achieving sublinear regret, it is necessary that is strictly positive. So, we should have . Further, from Lemma 3, we have . So, for a feasible to exist, it is necessary that , i.e., is large enough. Thus, for , we have . As earlier, we analyze two cases.
Case 1 (): In this case, the regret is given by . The minimum regret is obtained when and is given by .
Case 2 (): In this case, the regret is given by . Again, the regret is minimum when and is given by .
Furthermore, it is easy to see that in Case 1, for any . Similarly, in Case 2, for any . This shows that the minimum regret is achieved when . ∎
5.3 Important Observations
We conclude the section with some key observations.

[leftmargin=5pt]

In the subexponential tail case, as , we have . This implies that the upper bound on the expected regret goes to . Note that in this case, . Since BLMoss considers a single arm, the internal regret is zero. Further, we have , i.e., the first arm is optimal with probability approaching 1, the external regret is also zero. As , the tail bounds are trivial and are satisfied by uniform distribution. From Theorem 1, we have that the regret in this case cannot be sublinear.

In the subPareto tail case, following the similar argument as in the subexponential tail case, we have that as , the regret . On the other hand, as , the regret goes to . The larger value of implies that the probability that the optimal arm arrives by is close to 1; then we have that the regret of BLMoss is asymptotically optimal. Further, it asymptotically achieves the information theoretic lower bound (which is ).

One could also consider UCB1 instead of Moss. While UCB1 is anytime algorithm, Moss needs the time horizon as an input. However, the important distinction between the two algorithms is that the instanceindependent regret of UCB1, which is , is greater than that of Moss; hence we use Moss in BLMoss. One can similarly use UCB1 to get anytime version of BLMoss with slightly more (up to ) regret guarantee.
6 Simulation Study
(a) Expected regret as a function of time horizon for various subexponential tails  (b) Expected regret as a function of time horizon for various subPareto tails  (c) Empirical exponents vs. theoretical bounds for time horizons up to 
So far, we focused on deriving upper bounds on regret for distributions (on the arrival time of the best arm) having subexponential and subPareto tail with different values of and , respectively. In particular, for the case of subPareto tail, we deduced that the extent of sublinearity of the regret (the exponent of in the order of the regret) depends on the value of . On the other hand, the upper bound on regret for the case of subexponential tail had the same order with respect to for any reasonable value of , albeit with different multiplicative and additive terms for different values of . In this section, we aim to illustrate how the expected regret varies with the time horizon , and how the empirical exponents compare with their theoretical bounds for different values of and , for time horizons up to rounds.
6.1 Simulation Setup
Note that in a traditional MAB setup, a simulation for a larger time horizon could be conducted as an extension of a simulation for a smaller time horizon . In other words, after obtaining the results for time horizon , the results for time horizon can be obtained by running simulations for an additional rounds. However, in the BLMAB setup where new arms continue arriving with time and the desired time horizon is known, we have seen that the optimal value of and hence depend on the time horizon. Owing to different values of for different time horizons , the simulation for a time horizon are not extendable to time horizon . So even if we have simulation results for time horizon , it is necessary to run a fresh set of simulations for obtaining results for time horizon . In our simulation study, we consider the following values of time horizon: .
We consider that a new arm arrives in each round, and the probability of an arm arriving at time being the best arm is determined by the distribution function . Thereafter, this best arm () is assigned a quality () between 0 and 1 uniformly at random, and the rest of the arms are assigned quality parameters between 0 and uniformly at random. Given a time horizon , the value of and hence are obtained based on our theoretical analysis. The arm to be pulled in a round is determined by Algorithm 1, wherein the pulled arm generates unit reward with probability equal to its quality, and no reward otherwise (i.e., as per Bernoulli distribution). The regret in each round is computed as the difference between the quality of the best arm available in that round and the quality of the pulled arm. The overall regret is the sum of the regrets over all rounds from till . Note that we are concerned with the regret irrespective of the numerical values of the arms’ qualities. So, for a given instance of the arrival of the best arm, we consider the worstcase regret over 50 subinstances, where the quality parameters assigned to the arms in different subinstances are independent of each other. Also, since different instances would have the best arm arriving in different rounds, the expected regret is obtained by simulating over 1000 such random instances and averaging over the corresponding worstcase regret values.
Our primary objective is to observe how the expected regret varies with the time horizon . In order to observe the influence of various subexponential and subPareto tail distributions over the arrival time of the best arm, we conduct simulations for different values of parameters and : . The other objective is to determine the empirical exponent of the plots (i.e., the value of such that the expected regret is approximately a constant multiple of ). To achieve this, we first estimate the constant factor by dividing the expected regret for by , for a given value of . We then compute the squared error when attempting to fit the expected regret with . Considering candidate values of to be between 0 and 1 with intervals of 0.01, we deduce the empirical exponent to be the value of which results in the least squared error. We also consider another method for determining the empirical exponent: we produce the line of best fit for the scatter plot of versus the log of the expected regret for that ; the slope of this line gives the empirical exponent. The empirical exponents obtained using the two methods are almost identical (differing by less than 0.01).
6.2 Simulation Results
As mentioned at the end of our theoretical analysis, for the subexponential tail case when , the upper bound on the expected regret goes to 0. In our simulations with the maximum observed time horizon of , the expected regret was observed to be uniformly zero, even for (see Figure 1(a)). Further, for other considered values of , the plots exhibit a prominent sublinear nature. In particular, considering the maximum observed time horizon of , the empirical exponents for different values of were consistently observed to be between 0.45 and 0.5 (Theorem 3 showed the order of the regret with respect to , for reasonable values of , to be bounded by , which is an exponent close to 0.5).
For the subPareto tail case illustrated in Figure 1(b), note that we have no result for because the value of for obtaining a feasible should be greater than , which is beyond our maximum observed time horizon of . Moreover, we have partial results for because the value of for obtaining a feasible should be greater than ; so the plot starts with . It can be seen, in general, that the plots in Figure 1(b) follow a far less sublinear nature and exhibit a much higher expected regret than those in Figure 1(a). This is intuitive from our analysis that the subexponential tail case is likely to result in a much lower regret than the subPareto tail case. In particular, the empirical exponent for was deduced to be 0.8, which is close to linear (its theoretical upper bound as per our analysis is 0.83). In general, considering the maximum observed time horizon of , it can be seen from Figure 1(c) that the upper bound on the theoretical exponent (which is from Theorem 4) and the empirical exponent are close to each other.
Note that the gap between the empirical exponents and the corresponding theoretical upper bounds could be attributed to the fact that it is difficult to find the worstcase distribution over the reward parameters of the arms. Hence, it is unlikely that the worstcase (or instanceindependent) expected regret could be attained in the simulations with a random reward structure. Since the gap is not very significant, the simulation results suggest that the bounds derived in our regret analysis of BLMoss (in Section 1) are, in all probability, tight.
7 Additional Related Work
A standard stochastic MAB framework considers that the number of available arms is fixed (say ) and typically much less than the time horizon (say ). In the seminal work of Lai and Robbins [lai85], the authors showed that any MAB algorithm in such a setting must incur a regret of where
is the KullbackLeibler divergence between the best arm and the second best arm. Auer
[auer2002using] proposed the UCB1 algorithm which attains a matching upper bound on the expected regret. However, the instanceindependent (i.e., in adversarial case) regret of the variant of UCB1, UCB, is given by [bubeck2012regret]. The Moss algorithm proposed by Audibert and Bubeck [audibert2010regret] achieves the instanceindependent regret of . Bubeck and CesaBianchi [bubeck2012regret] present a detailed survey on regret bounds of these algorithms.The problem of learning qualities of the answers on Q&A forums was first modeled under MAB framework by Ghosh and Hummel [ghosh2013learning] where generation of a new arm was considered as a consequence of strategic choice of an agent. Though this model captures strategic aspects of the contributors, there is an important practical issue with such modelling. Each agent, being a strategic attention seeker, is assumed to produce the effort just enough to satisfy incentive compatibility in the equilibrium. We do not assume an effortsandcosts model and show that, even when the number of answers grows linearly with time if the qualities of arriving answers follow certain mild distributional assumption, the proposed algorithm achieves sublinear regret.
Tang and Ho [tang2019bandit] consider a model with fixed number of arms but with a platform where agents provide biased feedback. On such Q&A forums, it is more relevant to consider the problem with increasing number of arms. A recent work by Liu and Ho [LIU18] limits the growth of the bandit arms by randomly dropping some of the arms from consideration, and computing the regret with respect to only the considered arms. That is, they do not account for the regret incurred due to the randomly dropped arms.
8 Discussion and Future Work
In this paper, we presented a novel extension to the classical MAB model, which we call the Ballooning bandits model (BLMAB). We showed that, in the absence of any distributional assumption on the arrival of the best quality arm, it is impossible to achieve a sublinear regret. We proposed an algorithm for the BLMAB model and provided sufficient conditions under which the proposed algorithm achieves sublinear regret. In particular, when the arrival distribution of the best quality arm has a subexponential or subPareto tail, our algorithm BLMoss achieves sublinear regret by restricting the number of arms to be explored in an intelligent way. Our results indicate that, the number of arms to be explored depends on the distributional parameters, namely, (for subexponential case) and (for subPareto case), which must be known to the algorithm. It will be interesting to see how a learning algorithm can be designed to learn these parameters as well. As noted earlier, we only consider a structure on the arrival of the best arm. One could also consider a more sophisticated arrival process of the arms, for obtaining tighter regret guarantees.
References
Appendix
Appendix A Omitted Proofs
See 1
Proof.
Suppose there exists an arm such that , then there exists a corresponding arm such that . Construct a set . As the best arm is uniformly distributed, both and have equal probability of being an optimal arm. Thus, we have . Hence, the internal regret of the algorithm will be the same. The external regret will decrease as . ∎
See 1
Proof.
The forward direction is straightforward. As is one to one function in the nonnegative domain, we have . Let then we have . To show that implies , observe that . We get the required result by taking the inverse. ∎
See 2
Proof.
Observe that . Note that is continuous, one to one and strictly increasing. Hence, its inverse, , is also increasing. ∎
See 3
Proof.
By definition, the Lambert function satisfies . It is easy to see that . Further we have,
Let . We have that . We have that . Hence we have that is increasing i.e. for all . This shows that .
Now, let . Here, we have , and (since for ). So, for all , implying that . This completes the proof. ∎
Claim 2.
Proof.
We have the following equivalent inequalities.
The second to last inequality is obtained by applying the monotone increasing function on both sides, and then using Definition 1 of Lambert function. ∎
Claim 3.
is decreasing in for .
Proof.
For , we have
∎
Comments
There are no comments yet.