The classical stochastic multi-armed bandit (MAB) model provides an elegant abstraction to a number of important sequential decision making problems. In this setting, the planner chooses (or pulls) a single arm in each discrete time instant from a fixed pool of finitely many arms for a finite number of time instants. Typically it is assumed that the number of arms is much smaller than the number of time instances. Each arm, when pulled, generates a reward from a fixed but a-priori unknown stochastic distribution corresponding to the pulled arm. However, the arms that are not pulled do not generate any reward to the planner. The planner’s goal is to minimize the regret, i.e., the loss incurred in expected cumulative reward due to not knowing the reward distribution of the arms beforehand. The MAB problem encapsulates the classical exploration versus exploitation dilemma, in that the planner’s algorithm has to arrive at an optimal trade-off between exploration (pulling relatively unexplored arms) and exploitation (pulling the best arms according to the history of pulls thus far). This problem has been extensively studied in the literature. These studies include analyzing the lower bound on regret [lai85], design and analysis of asymptotically optimal algorithms [auer2010ucb; agrawal2012analysis; thompson1933likelihood; auer2002finite], empirical studies [chapelle2011empirical; devanand2017empirical; russo2018tutorial], and several extensions [slivkins2019introduction; bubeck2012regret]. We provide a detailed review of the relevant literature in Section 7.
The theoretical results in MAB are complemented by a wide variety of modern applications which can be seamlessly modelled in the MAB setup. Internet advertising [babaioff2009characterizing; nuara2018combinatorial], crowdsourcing [JAIN2018], clinical trials [villar2015multi], wireless communication [maghsudi2014joint] represent a few of the many applications. Due to its wide range applications and an elegant theoretical foundation, many variants of the MAB problem have been proposed. In this paper, we propose a novel variant which we call Ballooning Multi-Armed Bandits (BL-MAB). In contrast to the classical MAB where the set of available arms is fixed throughout the run of an algorithm, the set of arms in BL-MAB grows (or balloons) over time. As the number of arms increases (potentially linearly) with time, it is clear that an optimal algorithm has to ignore (or drop) a few arms. Hence, in addition to achieving an optimal trade-off between the number of exploratory pulls and exploitation pulls, the algorithm must also ensure that it does not drop too many (or too few) arms.
To see that the traditional algorithms are not regret-optimal in the BL-MAB setting, consider the following thought experiment. Let a new arm arrive at each time instant in decreasing order of mean reward, and let the MAB algorithm run for a total of time instants. The traditional MAB algorithms (such as UCB1, Moss etc.) would pull the newly arrived arm at each time and thus would incur a regret of . Note here that the best arm appeared at the first time instant itself, however, as the set of arms is monotonically expanding over time, the algorithm could not sufficiently explore every arm. Observe that the regret in BL-MAB depends not only on the mean reward of the arms, but also on when they arrive. Hence, any BL-MAB algorithm ought to be aware of the arrival of the arms.
We motivate the practical significance of BL-MAB with a few applications. In general, BL-MAB is directly applicable in any scenario where the set of options grows over time, and, the objective is to choose the best option available at any given time.
A contemporary example is provided by question and answer (Q&A) platforms such as Reddit, Stack Overflow, Quora, Yahoo! Answers, and ResearchGate, where the platform’s goal is to discover the highest quality answer that should be displayed in the most prominent slot, for a given question. Each answer post is modeled as a distinct arm of a BL-MAB instance, and the rewards are distributed according to a Bernoulli distribution parameterized by the quality of the posted answer. Note that the quality parameter is a priori unknown to the platform and hence needs to be learnt. For this, the platform employs certain endorsement mechanisms with indicators such as upvotes, likes, and shares (or re-posts). A user endorses the answer that is displayed to her, if she likes it. Each display of a posted answer corresponds to a pull of the corresponding arm. At each time instant, a new user observes the existing answer posts shown by the platform, decides whether to endorse them, and may also choose to post her own answer, thus increasing the number of available arms. Hence, the number of available arms (answers) monotonically increases over time.
The problem of learning qualities of the answers on Q&A forums has been modeled under the MAB framework in various studies [ghosh2013learning; tang2019bandit; LIU18]. However, these studies resort to the existing MAB variations which are not well suited for Q&A forums. For instance, ghosh2013learning
model the problem with a classical MAB framework by limiting the number of arms via strategic choice of an agent, by assuming that a user incurs a certain cost for posting an answer and hence posts it only if she derives a positive utility by doing so. However, a user’s behavior on the platform may be driven by simple cognitive heuristics rather than a well calibrated strategic decision[burghardt2016myopia]. In another work [LIU18], the number of arms is limited by randomly dropping some of the arms from consideration. The regret is then computed with respect to only the considered arms. That is, they do not account for the regret incurred due to the randomly dropped arms.
Some of the other applications of BL-MAB framework are in various websites that feature user reviews, such as Amazon and Flipkart (product reviews), Tripadvisor (hotel reviews), IMDB (movie reviews). As time progresses, the reviews for a product (or a hotel or a movie) keep arriving, and the website aims to display the most useful reviews for that product (or hotel or movie) at the top. The usefulness of a review is estimated using users’ endorsements for that review, similar to that in Q&A forums. BL-MAB is also applicable in scenarios where users comment on a video or news article on a video or news hosting website, where the website’s objective is to display the most popular or interesting comment on the top. The BL-MAB setting thus provides a natural framework to be considered in such type of application.
BL-MAB needs an independent investigation owing to a number of reasons. For instance, one of the MAB variants that holds some similarity with BL-MAB is sleeping multi-armed bandit (S-MAB) [kleinberg2010; chatterjee2017analysis], where a subset of a fixed set of base arms is available at each time instant. Though the S-MAB framework captures the availability of a small subset of arms at each time, it assumes that the set of base arms is fixed and is small as compared to the time horizon. In contrast, the BL-MAB framework allows for the number of available arms to increase, potentially linearly with time. Hence, an optimal sleeping bandits algorithm such as Auer[kleinberg2010] would end up giving a linear regret in BL-MAB setting.
Another MAB variant which is somewhat similar is the many-armed (potentially infinite) bandit [wang2009algorithms; carpentier2015simple; berry1997bandit], where the number of arms could be potentially equal to or greater than the time horizon. berry1997bandit consider the case of an infinite arm bandit with Bernoulli reward distribution. However, they consider that the optimal arm has a quality of , which is seldom the case in practical applications. Other investigations considering infinitely many arms [wang2009algorithms; carpentier2015simple] make certain assumptions on the distribution of the near optimal arm to achieve sub-linear regret. Further, all the above works consider that all the arms are available in all time instants, and hence use the traditional notion of regret. In our case, the regret incurred by an algorithm in a given time instant is the difference between the quality of the best available arm during that time and the quality of the arm pulled by the algorithm (same as the notion of regret considered in sleeping bandits). The BL-MAB framework is thus an interesting blend of both the sleeping bandit model and the many-armed bandit model.
1.2 Our Contributions
We introduce the BL-MAB model that allows the set of arms to grow over time.
We show for the BL-MAB model that the instance independent regret will grow linearly with time, in the absence of any distributional assumption on the arrival time of the highest quality arm. Specifically, we show that if the best arm is equally likely to arrive at any time, the sublinear expected regret cannot be achieved (Theorem 1).
We propose an algorithm (BL-Moss) which determines: (1) the fraction of the time horizon until which the newly arriving arms should be explored at least once and (2) the sequence of arm pulls during the exploitation phase. Our key finding is that BL-Moss achieves sub-linear regret under practical and minimal assumptions on the arrival distribution of the best arm, namely, sub-exponential tail (Theorem 3) and sub-Pareto tail (Theorem 4). Note that we make no assumption on the arrival of the other arms. As the regret depends on the qualities of the arms and the sequence of their arrivals, it is interesting that with sub-exponential and sub-Pareto assumption on only the best arm’s arrival pattern, we can achieve sub-linear regret.
We carry out a pertinent simulation study to empirically observe how the expected regret varies with the time horizon. We find a strong validation for our theoretically derived regret bounds.
The paper is organized as follows. In Section 2, we present the BL-MAB model. In Section 3, we first show that if the best arm arrives uniformly at random, one cannot achieve sub-linear regret. We hence define two distributional assumptions on the arrival time of the best arm which enables us to achieve sub-linear regret. Next, we present some preliminaries in Section 4, followed by our proposed algorithm and its theoretical analysis in Section 1. Section 6 presents our simulation results. We conclude the paper with related work (Section 7) and future directions (Section 8).
2 The Model
A classical MAB instance is given by the tuple . Here, is a fixed set of arms and is the reward distribution corresponding to an arm . Further, it is assumed that are mutually independent distributions. Denote by , the mean of distribution . Consider that each of the distributions is supported over a finite interval and is unknown to the algorithm. Throughout the paper, without loss of generality, we consider that is supported over . Further, we will refer to as the quality of the arm . A MAB algorithm is run in discrete time instants, and the total number of time instants is denoted by time horizon . In each time instant aka round, the algorithm selects a single arm and observes the reward corresponding to the selected arm. The arms which are not selected, do not give any reward. More precisely, a MAB algorithm is a mapping from the history of arm pulls and obtained rewards, to the set of arms.
At each time instant, a BL-MAB algorithm chooses a single arm from the set of available arms and receives a reward generated randomly according to the reward distribution of the chosen arm . New arms may spring up at each time instant. Throughout the paper, we consider that at most one new arm arrives at each time, and the arms are never dropped. Let denote the set of arms available at round . In the BL-MAB model, this set of available arms grows by at most one arm per round, i.e., and . A BL-MAB instance, therefore, is given by .
Similar to the notion of regret in the sleeping stochastic MAB model, we introduce the notion of regret in BL-MAB setting that takes into account the availability of the arms at each time . Let denote the arm pulled by the algorithm and be the best available arm at time , i.e., . Further, let denote a BL-MAB instance and be a BL-MAB algorithm. The instance-dependent regret of is given by
Throughout the paper, we consider instance-independent regret given as . Note that the instance-independent regret bound is a worst case regret bound over all the arrival sequences of the arms and all possible reward distributions. In the next section, we show, for the BL-MAB setting, that it is not possible to achieve sublinear instance-independent regret bound.
3 Lower Bound on Regret
As pointed out in Section 1, it is clear that UCB-style algorithms (which pull arms based on uncertainty) would pull each incoming arm at least once, leaving no rounds for exploitation. Hence, they incur linear regret in the ballooning bandit setup 111In particular, when .. However, it is not obvious that a different, more sophisticated algorithm (such as the one which randomly drops some arms) may not be able to achieve sub-linear regret. Our first result (Theorem 1) shows that no algorithm can attain sub-linear regret under a general BL-MAB setting.
Consider the following BL-MAB instance . Let there be a unique best arm with quality and all other arms have quality . A new arm arrives at each time, i.e.,
. Further, let the arrival of the best arm be uniformly distributed over time, i.e.,for all . Let denote the optimal arm till time . Further, let be the set of arms pulled by the algorithm, i.e., . We will show that for any fixed , the expected regret is lower bounded by . We first observe that an algorithm that pulls number of arms achieves the minimum regret when it pulls the first arms.
The minimum regret is achieved when algorithm considers the first arms.
For the BL-MAB instance , the expected regret of any algorithm is lower bounded by .
The expected regret of a BL-MAB algorithm is given by
In the above expression, the first term represents the internal regret of the learning algorithm and the second quantity is the external regret. Here, if and otherwise. From Claim 1, we have that an algorithm will incur least regret if it pulls first arms. Further, from the classical result in [lai85], in order to separate the quality of the arms, we should have for some positive problem dependent constant , for all and . Hence, we have
Note that the above expression is quadratic in . For , the minimum occurs when the value of is the least (in the positive domain), which is 1. In this case we have
For the case where , the minimum occurs when . Hence,
Theorem 1 provides a strong impossibility result on the achievable instance-independent regret bound under BL-MAB setting. However, one can still achieve sub-linear regret by imposing appropriate structure on the BL-MAB instances. Observe that the regret depends on the arrival of arms, i.e., , and their reward distributions . We impose restrictions on the arrival of the best arm
so that the probability thatarrives early is large enough; this would allow a learning algorithm to explore the best arm enough to estimate the true quality of that arm with high probability. As noted previously, the other arms may arrive arbitrarily. Further note that we make no assumption on the qualities of individual arms.
3.1 Arrival of the Best Arm
be the random variable denoting the time at which the best arm arrives. Further, let
denote the cumulative distribution function of. In our first result, we use the following Sub-exponential tail assumption on the arrival time of the best arm.
There exists a constant such that the probability of the best arm arriving later than rounds, is upper bounded by , i.e., .
Next, we consider a relaxed condition on the tail probabilities, i.e., when the tail does not shrink as fast as in the sub-exponential case. We consider the family of distributions whose tail is thicker than that of sub-exponential arrival distribution.
There exists a constant such that the probability of the best arm arriving later than rounds, is upper bounded by , i.e., .
The aforementioned assumptions naturally arise in the context of Q&A forums as observed in extensive empirical studies on the nature of answering as well as voting behavior of the users. Anderson et al. [anderson2012discovering] observe that high reputation users hasten to post their answers early. One possible explanation for this phenomenon could be that the users who are motivated by the visibility that their answers receive, tend to be more active on the platform and also provide high quality answers early on, which explains their reputation score. Similarly, the most useful product reviews are likely to arrive early after the product release. Thus, it is reasonable to assume that the best arm arrives, with high probability, in early rounds.
Note that the uniform distribution is the limiting case of the sub-exponential case, when . We show that, while the uniform distribution results in linear regret (Theorem 1), a sub-linear regret can be achieved for BL-MAB instances having the best arm arrival distribution with even slightly thinner tail than that of uniform distribution.
We now present some essentials which will be useful for our analysis in the remainder of the paper.
4.1 Lambert Function
For any , the Lambert function, , is defined as the solution to the equation , i.e., .
Lambert function satisfies the following properties [hoorfar2008inequalities] (proofs are provided in Appendix for completeness):
The Lambert function can be equivalently written as the inverse of the function , i.e., .
For any , the Lambert function is unique, non-negative, and strictly increasing.
For any , we have .
It can be noted that it is easy to numerically approximate for a given , using Newton-Raphson’s or Halley’s method. Moreover, there exist efficient numerical methods for evaluating it to arbitrary precision [Corless1996].
4.2 The Moss Algorithm
We use Moss (Minimax Optimal Strategy in the Stochastic case) [audibert2010regret] as a black box learning algorithm. For a fixed number of arms, the Moss algorithm pulls an arm at time such that
Here, denotes the set of arms and is the number of times arm was pulled before (and excluding) round and are the empirical estimates of the arm from samples. Each arm is pulled once in the beginning, and ties are broken arbitrarily. The following result gives an upper bound on the expected regret of Moss which is optimal up to a constant factor (it achieves the lower bound on regret given by [auer1995gambling]). Throughout the paper, we use the notation Moss (K) to denote that the Moss algorithm is run with set of arms .
[audibert2010regret] For any time horizon , the expected regret of Moss is given by .
5 The B-MOSS Algorithm and Regret Analysis
5.1 The BL-Moss Algorithm
We now present our algorithm, BL-Moss (Algorithm 1), that uses Moss as a black-box. The number of arms explored by BL-Moss is dependent on the distribution of arrival of the best arm. In particular, BL-Moss considers only the first arms in its execution (). Later in this section, we show how to derive the value of for distributions with sub-exponential and sub-Pareto tails. Observe that the proposed BL-Moss is a simple extension of Moss and this algorithm is practically easy to implement.
5.2 Regret Analysis of BL-Moss
For a given BL-MAB instance , let and . Clearly, we have that . As stated earlier, the regret of the algorithm can be decomposed into internal regret, i.e., the regret incurred by the learning algorithm considering only arms and external regret, i.e., the regret incurred by BL-Moss due to the fact that BL-Moss might have ignored the best arm. Write and let be the time of arrival of arm . Further, let denote the best arm till time . The instance-dependent regret is given as
The first and the second terms respectively denote the internal regret and the external regret of BL-Moss. We ignore the ceiling in throughout this section to avoid notation clutter.
Note that for all . This is true for any time horizon . From Theorem 2, we have the following observation about the internal regret of BL-Moss.
For the value of computed by BL-Moss , we have .
In order to bound the overall regret, we begin with the following lemma which explicitly shows the relation between the expected regret of the algorithm and . Recall that the random variable denotes the time of arrival of the best arm.
The upper bound on the expected regret for any BL-MAB instance is given by , with BL-Moss exploring only the first arrived arms.
For a given BL-MAB instance , let denote the time at which arm becomes available for the first time. Let denote the best arm till rounds, i.e., . Further, let be the best arm among the arms considered by BL-Moss, i.e., . Notice that . This implies .
|(From Observation 1 and since )|
Note that the above inequality holds for any BL-MAB instance and hence we have ∎
Next, we show that under the sub-exponential tail property on , BL-Moss achieves sub-linear regret. We begin with the following lemma that lower bounds the probability of the arrival of the best quality arm in the initial rounds.
Let the arm arrival distribution of the best arm satisfy sub-exponential tail property for some . Then for any and , we have that .
Note that . Hence, by Property P1 we have
By Property P2 we have . Equivalently, . So, we have . The last inequality follows from the sub-exponential tail property. ∎
Let the arrival distribution of the best arm satisfy the sub-exponential tail property for some , and let be large enough such that for some . Then with , the upper bound on the expected regret of BL-Moss, , is . The upper bound on the expected regret is minimized when and is given by .
Note that for achieving sub-linear regret, it is necessary that is strictly positive, for which it is necessary that . From Lemma 2, we also have . Since such a feasible may not exist for small values of , we consider that is large enough. It can be easily shown that (see Claim 2 in Appendix).
Thus, for , we have: . Recall that by definition, we have . Thus when , the term dominates the other terms in the regret expression, whereas when , the term dominates. We analyze these cases separately.
Case 1 (): In this case, the regret is given by . Note that the regret is minimized for the lowest feasible value of , i.e., , resulting in . The last equality follows from the equivalent definition of Lambert W function (Property P1).
Case 2 (): In this case, the regret is given by . Again, the regret is minimized when . The regret in this case is given by .
We now prove the sub-linear regret of BL-Moss under the sub-Pareto tail property.
Let the arm arrival distribution of the best arm satisfy sub-Pareto tail property for some . Then for any and , we have that .
First note that . This implies that . Further, from the sub-Pareto tail property, we have that . ∎
Let the arrival distribution of arms satisfy the sub-Pareto tail property for some , and let be large enough such that for some . Then with , the upper bound on the expected regret of BL-Moss, , is . The upper bound on the expected regret is minimized when and is given by .
From Lemmas 1 and 3, we have . For achieving sub-linear regret, it is necessary that is strictly positive. So, we should have . Further, from Lemma 3, we have . So, for a feasible to exist, it is necessary that , i.e., is large enough. Thus, for , we have . As earlier, we analyze two cases.
Case 1 (): In this case, the regret is given by . The minimum regret is obtained when and is given by .
Case 2 (): In this case, the regret is given by . Again, the regret is minimum when and is given by .
Furthermore, it is easy to see that in Case 1, for any . Similarly, in Case 2, for any . This shows that the minimum regret is achieved when . ∎
5.3 Important Observations
We conclude the section with some key observations.
In the sub-exponential tail case, as , we have . This implies that the upper bound on the expected regret goes to . Note that in this case, . Since BL-Moss considers a single arm, the internal regret is zero. Further, we have , i.e., the first arm is optimal with probability approaching 1, the external regret is also zero. As , the tail bounds are trivial and are satisfied by uniform distribution. From Theorem 1, we have that the regret in this case cannot be sub-linear.
In the sub-Pareto tail case, following the similar argument as in the sub-exponential tail case, we have that as , the regret . On the other hand, as , the regret goes to . The larger value of implies that the probability that the optimal arm arrives by is close to 1; then we have that the regret of BL-Moss is asymptotically optimal. Further, it asymptotically achieves the information theoretic lower bound (which is ).
One could also consider UCB1 instead of Moss. While UCB1 is any-time algorithm, Moss needs the time horizon as an input. However, the important distinction between the two algorithms is that the instance-independent regret of UCB1, which is , is greater than that of Moss; hence we use Moss in BL-Moss. One can similarly use UCB1 to get any-time version of BL-Moss with slightly more (up to ) regret guarantee.
6 Simulation Study
|(a) Expected regret as a function of time horizon for various sub-exponential tails||(b) Expected regret as a function of time horizon for various sub-Pareto tails||(c) Empirical exponents vs. theoretical bounds for time horizons up to|
So far, we focused on deriving upper bounds on regret for distributions (on the arrival time of the best arm) having sub-exponential and sub-Pareto tail with different values of and , respectively. In particular, for the case of sub-Pareto tail, we deduced that the extent of sublinearity of the regret (the exponent of in the order of the regret) depends on the value of . On the other hand, the upper bound on regret for the case of sub-exponential tail had the same order with respect to for any reasonable value of , albeit with different multiplicative and additive terms for different values of . In this section, we aim to illustrate how the expected regret varies with the time horizon , and how the empirical exponents compare with their theoretical bounds for different values of and , for time horizons up to rounds.
6.1 Simulation Setup
Note that in a traditional MAB setup, a simulation for a larger time horizon could be conducted as an extension of a simulation for a smaller time horizon . In other words, after obtaining the results for time horizon , the results for time horizon can be obtained by running simulations for an additional rounds. However, in the BL-MAB setup where new arms continue arriving with time and the desired time horizon is known, we have seen that the optimal value of and hence depend on the time horizon. Owing to different values of for different time horizons , the simulation for a time horizon are not extendable to time horizon . So even if we have simulation results for time horizon , it is necessary to run a fresh set of simulations for obtaining results for time horizon . In our simulation study, we consider the following values of time horizon: .
We consider that a new arm arrives in each round, and the probability of an arm arriving at time being the best arm is determined by the distribution function . Thereafter, this best arm () is assigned a quality () between 0 and 1 uniformly at random, and the rest of the arms are assigned quality parameters between 0 and uniformly at random. Given a time horizon , the value of and hence are obtained based on our theoretical analysis. The arm to be pulled in a round is determined by Algorithm 1, wherein the pulled arm generates unit reward with probability equal to its quality, and no reward otherwise (i.e., as per Bernoulli distribution). The regret in each round is computed as the difference between the quality of the best arm available in that round and the quality of the pulled arm. The overall regret is the sum of the regrets over all rounds from till . Note that we are concerned with the regret irrespective of the numerical values of the arms’ qualities. So, for a given instance of the arrival of the best arm, we consider the worst-case regret over 50 sub-instances, where the quality parameters assigned to the arms in different sub-instances are independent of each other. Also, since different instances would have the best arm arriving in different rounds, the expected regret is obtained by simulating over 1000 such random instances and averaging over the corresponding worst-case regret values.
Our primary objective is to observe how the expected regret varies with the time horizon . In order to observe the influence of various sub-exponential and sub-Pareto tail distributions over the arrival time of the best arm, we conduct simulations for different values of parameters and : . The other objective is to determine the empirical exponent of the plots (i.e., the value of such that the expected regret is approximately a constant multiple of ). To achieve this, we first estimate the constant factor by dividing the expected regret for by , for a given value of . We then compute the squared error when attempting to fit the expected regret with . Considering candidate values of to be between 0 and 1 with intervals of 0.01, we deduce the empirical exponent to be the value of which results in the least squared error. We also consider another method for determining the empirical exponent: we produce the line of best fit for the scatter plot of versus the log of the expected regret for that ; the slope of this line gives the empirical exponent. The empirical exponents obtained using the two methods are almost identical (differing by less than 0.01).
6.2 Simulation Results
As mentioned at the end of our theoretical analysis, for the sub-exponential tail case when , the upper bound on the expected regret goes to 0. In our simulations with the maximum observed time horizon of , the expected regret was observed to be uniformly zero, even for (see Figure 1(a)). Further, for other considered values of , the plots exhibit a prominent sub-linear nature. In particular, considering the maximum observed time horizon of , the empirical exponents for different values of were consistently observed to be between 0.45 and 0.5 (Theorem 3 showed the order of the regret with respect to , for reasonable values of , to be bounded by , which is an exponent close to 0.5).
For the sub-Pareto tail case illustrated in Figure 1(b), note that we have no result for because the value of for obtaining a feasible should be greater than , which is beyond our maximum observed time horizon of . Moreover, we have partial results for because the value of for obtaining a feasible should be greater than ; so the plot starts with . It can be seen, in general, that the plots in Figure 1(b) follow a far less sub-linear nature and exhibit a much higher expected regret than those in Figure 1(a). This is intuitive from our analysis that the sub-exponential tail case is likely to result in a much lower regret than the sub-Pareto tail case. In particular, the empirical exponent for was deduced to be 0.8, which is close to linear (its theoretical upper bound as per our analysis is 0.83). In general, considering the maximum observed time horizon of , it can be seen from Figure 1(c) that the upper bound on the theoretical exponent (which is from Theorem 4) and the empirical exponent are close to each other.
Note that the gap between the empirical exponents and the corresponding theoretical upper bounds could be attributed to the fact that it is difficult to find the worst-case distribution over the reward parameters of the arms. Hence, it is unlikely that the worst-case (or instance-independent) expected regret could be attained in the simulations with a random reward structure. Since the gap is not very significant, the simulation results suggest that the bounds derived in our regret analysis of BL-Moss (in Section 1) are, in all probability, tight.
7 Additional Related Work
A standard stochastic MAB framework considers that the number of available arms is fixed (say ) and typically much less than the time horizon (say ). In the seminal work of Lai and Robbins [lai85], the authors showed that any MAB algorithm in such a setting must incur a regret of where
is the Kullback-Leibler divergence between the best arm and the second best arm. Auer[auer2002using] proposed the UCB1 algorithm which attains a matching upper bound on the expected regret. However, the instance-independent (i.e., in adversarial case) regret of the variant of UCB1, -UCB, is given by [bubeck2012regret]. The Moss algorithm proposed by Audibert and Bubeck [audibert2010regret] achieves the instance-independent regret of . Bubeck and Cesa-Bianchi [bubeck2012regret] present a detailed survey on regret bounds of these algorithms.
The problem of learning qualities of the answers on Q&A forums was first modeled under MAB framework by Ghosh and Hummel [ghosh2013learning] where generation of a new arm was considered as a consequence of strategic choice of an agent. Though this model captures strategic aspects of the contributors, there is an important practical issue with such modelling. Each agent, being a strategic attention seeker, is assumed to produce the effort just enough to satisfy incentive compatibility in the equilibrium. We do not assume an efforts-and-costs model and show that, even when the number of answers grows linearly with time if the qualities of arriving answers follow certain mild distributional assumption, the proposed algorithm achieves sub-linear regret.
Tang and Ho [tang2019bandit] consider a model with fixed number of arms but with a platform where agents provide biased feedback. On such Q&A forums, it is more relevant to consider the problem with increasing number of arms. A recent work by Liu and Ho [LIU18] limits the growth of the bandit arms by randomly dropping some of the arms from consideration, and computing the regret with respect to only the considered arms. That is, they do not account for the regret incurred due to the randomly dropped arms.
8 Discussion and Future Work
In this paper, we presented a novel extension to the classical MAB model, which we call the Ballooning bandits model (BL-MAB). We showed that, in the absence of any distributional assumption on the arrival of the best quality arm, it is impossible to achieve a sub-linear regret. We proposed an algorithm for the BL-MAB model and provided sufficient conditions under which the proposed algorithm achieves sub-linear regret. In particular, when the arrival distribution of the best quality arm has a sub-exponential or sub-Pareto tail, our algorithm BL-Moss achieves sub-linear regret by restricting the number of arms to be explored in an intelligent way. Our results indicate that, the number of arms to be explored depends on the distributional parameters, namely, (for sub-exponential case) and (for sub-Pareto case), which must be known to the algorithm. It will be interesting to see how a learning algorithm can be designed to learn these parameters as well. As noted earlier, we only consider a structure on the arrival of the best arm. One could also consider a more sophisticated arrival process of the arms, for obtaining tighter regret guarantees.
Appendix A Omitted Proofs
Suppose there exists an arm such that , then there exists a corresponding arm such that . Construct a set . As the best arm is uniformly distributed, both and have equal probability of being an optimal arm. Thus, we have . Hence, the internal regret of the algorithm will be the same. The external regret will decrease as . ∎
The forward direction is straightforward. As is one to one function in the non-negative domain, we have . Let then we have . To show that implies , observe that . We get the required result by taking the inverse. ∎
Observe that . Note that is continuous, one to one and strictly increasing. Hence, its inverse, , is also increasing. ∎
By definition, the Lambert function satisfies . It is easy to see that . Further we have,
Let . We have that . We have that . Hence we have that is increasing i.e. for all . This shows that .
Now, let . Here, we have , and (since for ). So, for all , implying that . This completes the proof. ∎
We have the following equivalent inequalities.
The second to last inequality is obtained by applying the monotone increasing function on both sides, and then using Definition 1 of Lambert function. ∎
is decreasing in for .
For , we have