In a classical multi-armed bandit (MAB in abbreviated form) problem, the objective is to find a strategy/policy in order to sequentially explore and exploit sources of gain, referred to as arms, so as to maximize the expected cumulative gain. Each arm
is characterized by an unknown probability distribution. At each round , a strategy picks an arm and receives a random reward sampled from distribution . Whereas usual strategies aim at finding and exploiting the arm with highest expectation, the quantity of interest in many applications such as medicine, insurance or finance may not be the sum of the rewards, but rather the extreme observations (even if it might mean replacing loss minimization by gain maximization in the formulation of the practical problem). In such situations, classical bandit algorithms can be significantly sub-optimal: the ”best” arm should not be defined as that with highest expectation, but as that producing the maximal values. This setting, referred to as extreme bandits in Carpentier and Valko (2014), was originally introduced by Cicirello and Smith (2005) by the name of max -armed bandit problem. In this framework, the goal pursued is to obtain the highest possible reward during the first steps. For a given arm , we denote by
the maximal value taken until round and assume that, in expectation, there is a unique optimal arm
The expected regret of a strategy is here defined as
where is the maximal value observed when implementing strategy . When the supports of the reward distributions (i.e. the ’s) are bounded, no-regret is expected provided that every arm can be sufficiently explored, refer to Nishihara et al. (2016) (see also David and Shimkin (2016) for a PAC approach). If infinitely many arms are possibly involved in the learning strategy, the challenge is then to explore and exploit optimally the unknown reservoir of arms, see Carpentier and Valko (2015). When the rewards are unbounded in contrast, the situation is quite different: the best arm is that for which the maximum tends to infinity faster than the others. In Nishihara et al. (2016), it is shown that, for unbounded distributions, no policy can achieve no-regret without restrictive assumptions on the distributions. In accordance with the literature, we focus on a classical framework in extreme value analysis. Namely, we assume that the reward distributions are heavy-tailed. Such Pareto-like laws are widely used to model extremes in many applications, where a conservative approach to risk assessment might be relevant (e.g. finance, environmental risks). Like in Carpentier and Valko (2014), rewards are assumed to be distributed as second order Pareto laws in the present article. For the sake of completeness, we recall that a probability law with cdf belongs to the -second order Pareto family if, for every ,
where are strictly positive constants, see e.g. Resnick (2007). In this context, Carpentier and Valko (2014) have proposed the ExtremeHunter algorithm to solve the extreme bandit problem and provided a regret analysis.
The contribution of this paper is twofold. First, the regret analysis of the ExtremeHunter algorithm is significantly improved, in a nearly optimal fashion. This essentially relies on a new technical result of independent interest (see Theorem 2.1 below), which provides a bound for the difference between the expectation of the maximum among independent realizations of a -second order Pareto distribution, namely, and its rough approximation . As a by-product, we propose a more simple Explore-Then-Commit strategy that offers the same theoretical guarantees as ExtremeHunter. Second, we explain how extreme bandit can be reduced to a classical bandit problem to a certain extent. We show that a regret-minimizing strategy such as Robust-UCB (see Bubeck et al. (2013)), applied on correctly left-censored rewards, may also reach a very good performance. This claim is supported by theoretical guarantees on the number of pulls of the best arm and by numerical experiments both at the same time. From a practical angle, the main drawback of this alternative approach consists in the fact that its implementation requires some knowledge of the complexity of the problem (i.e. of the gap between the first-order Pareto coefficients of the first and second arms). In regard to its theoretical analysis, efficiency is proved for large horizons only.
This paper is organized as follows. Section 2 presents the technical result mentioned above, which next permits to carry out a refined regret analysis of the ExtremeHunter algorithm in Section 3. In Section 4, the regret bound thus obtained is proved to be nearly optimal: precisely, we establish a lower bound under the assumption that the distributions are close enough to Pareto distributions showing the regret bound is sharp in this situation. In Section 5, reduction of the extreme bandit problem to a classical bandit problem is explained at length, and an algorithm resulting from this original view is then described. Finally, we provide a preliminary numerical study that permits to compare the two approaches from an experimental perspective. Due to space limitations, certain technical proofs are deferred to the Supplementary Material.
2 Second-order Pareto distributions: approximation of the expected maximum among i.i.d. realizations
In the extreme bandit problem, the key to controlling the behavior of explore-exploit strategies is to approximate the expected payoff of a fixed arm . The main result of this section, stated in Theorem 2.1, provides such control: it significantly improves upon the result originaly obtained by Carpentier and Valko (2014) (see Theorem 1 therein). As shall be next shown in Section 3, this refinement has substantial consequences on the regret bound.
In Carpentier and Valko (2014), the distance between the expected maximum of independent realizations of a -second order Pareto and the corresponding expectation of a Fréchet distribution is controlled as follows:
Notice that the leading term of this bound is as . Below, we state a sharper result where, remarkably, this (exploding) term disappears, the contribution of the related component in the approximation error decomposition being proved as (asymptotically) negligible in contrast.
(Fréchet approximation bound) If are i.i.d. r.v.’s drawn from a -second order Pareto distribution with and , where is the constant depending only on and given in Eq. 3 below, then,
where . In particular, if , we have:
We emphasize that the bound above shows that the distance of to the Fréchet mean actually vanishes as as soon as , a property that shall be useful in Section 3 to study the behavior of learning algorithms in the extreme bandit setting.
Assume that , where
As in the proof of Theorem 1 in Carpentier and Valko (2014), we consider the quantity that serves as a cut-off between tail and bulk behaviors. Observe that
For , we set . Equipped with this notation, we may write
Instead of loosely bounding the bulk term by , we write
Second, the integral in Eq. 4 can be bounded as follows:
This concludes the proof.
3 The ExtremeHunter and ExtremeETC algorithms
In this section, the tighter control provided by Theorem 2.1 is used in order to refine the analysis of the ExtremeHunter algorithm (Algorithm 1) carried out in Carpentier and Valko (2014). This theoretical analysis is also shown to be valid for ExtemeETC, a novel algorithm we next propose, that greatly improves upon ExtremeHunter, regarding computational efficiency.
3.1 Further Notations and Preliminaries
Throughout the paper, the indicator function of any event is denoted by and means the complementary event of . We assume that the reward related to each arm is drawn from a -second order Pareto distribution. Sorting the tail indices by increasing order of magnitude, we use the classical notation for order statistics: . We assume that , so that the random rewards have finite expectations, and suppose that the strict inequality holds true. We also denote by the number of times the arm is pulled up to time . For and , the r.v. is the reward obtained at the -th draw of arm if or a new r.v. drawn from independent from the other r.v.’s otherwise.
We start with a preliminary lemma supporting the intuition that the tail index fully governs the extreme bandit problem. It will allow to show next that the algorithm picks the right arm after the exploration phase, see Lemma 2.
(Optimal arm) For larger than some constant depending only on and , the optimal arm for the extreme bandit problem is given by:
We first prove the first equality. It follows from Theorem 2.1 that there exists a constant , depending only on and , such that for any arm , . Then for we have, for all , . Recalling that is proportional to , it follows that . Now consider the following quantity:
For , we have for any suboptimal arm , which proves the second equality.
From now on, we assume that is large enough for Lemma 1 to apply.
3.2 The ExtremeHunter algorithm (Carpentier and Valko, 2014)
Theorem 2.1 states that for any arm , . Consequently, the optimal strategy in hindsight always pulls the arm . At each round and for each arm , ExtremeHunter algorithm (Carpentier and Valko, 2014) estimates the coefficients and (but not , see Remark 2 in Carpentier and Valko (2014)). The corresponding confidence intervals are detailed below. Then, following the optimism-in-the-face-of-uncertainty principle (see (Auer et al., 2002) and references therein), the strategy plays the arm maximizing an optimistic plug-in estimate of . To that purpose, Theorem 3.8 in Carpentier and Kim (2014) and Theorem 2 in Carpentier et al. (2014) provide estimators and for and respectively, after draws of arm . Precisely, the estimate is given by
where is chosen in an adaptive fashion based on Lepski’s method, see (Lepskiĭ, 1990), while the estimator of considered is
The authors also provide finite sample error bounds for , where
with a known lower bound on the ’s (), and a constant depending only on and . These error bounds naturally define confidence intervals of respective widths and at level defined by
More precisely, we have
denoting by and some constants depending only on and . When , denote by and the estimators based on the observations for simplicity. ExtremeHunter’s index for arm at time , the optimistic proxy for , can be then written as
where if and otherwise.
On computational complexity. Notice that after the initialization phase, at each time , ExtremeHunter computes estimators and , each having a time complexity linear with the number of samples pulled from arm up to time . Summing on the rounds reveals that ExtremeHunter’s time complexity is quadratic with the time horizon .
3.3 ExtremeETC: a computationally appealing alternative
In order to reduce the restrictive time complexity discussed previously, we now propose the ExtremeETC algorithm, an Explore-Then-Commit version of ExtremeHunter, which offers similar theoretical guarantees.
After the initialization phase, the winner arm, which has maximal index , is fixed and is pulled in all remaining rounds. Then ExtremeETC’s time complexity, due to the computation of and only, is , which is considerably faster than quadratic time achieved by ExtremeHunter. For clarity, Table 1 summarizes time and memory complexities of both algorithms.
Due to the significant gain of computational time, we used the ExtremeETC algorithm in our simulation study (Section 6) rather than ExtremeHunter.
Controlling the number of suboptimal rounds. We introduce a high probability event that corresponds to the favorable situation where, at each round, all coefficients simultaneously belong to the confidence intervals recalled in the previous subsection.
The event is the event on which the bounds
hold true for any and .
The union bound combined with (11) yields
For , where is the constant defined in (15), ExtremeETC and ExtremeHunter always pull the optimal arm after the initialization phase on the event . Hence, for any suboptimal arm , we have on :
Here we place ourselves on the event . For any arm , Lemma 1 in Carpentier and Valko (2014) provides lower and upper bounds for when
where is a constant which depends only on and C’. Introduce the horizon , which depends on and C’
If , we have under the event that for any suboptimal arm and any time that .
Hence the optimal arm is pulled at any time .
The following result immediately follows from Lemma 2.
For larger than some constant depending only on and we have under
Upper bounding the expected extreme regret. The upper bound on the expected extreme regret stated in the theorem below improves upon that given in Carpentier and Valko (2014) for ExtremeHunter. It is also valid for ExtremeETC.
For ExtremeETC and ExtremeHunter, the expected extreme regret is upper bounded as follows
as . If , we have in particular as .
The proof of Theorem 3.1 is deferred to Appendix 0.A. It closely follows that of Theorem 2 in Carpentier and Valko (2014), the main difference being that their concentration bound (Theorem 1 therein) can be replaced by our tighter bound (see Theorem 2.1 in the present paper). Recall that in Theorem 2 in Carpentier and Valko (2014), the upper bound on the expected extreme regret for ExtremeHunter goes to infinity when :
4 Lower bound on the expected extreme regret
In this section we prove a lower bound on the expected extreme regret for ExtremeETC and ExtremeHunter in specific cases. We assume now that and we start with a preliminary result on second order Pareto distributions, proved in Appendix 0.A.
If is a r.v. drawn from a -second order Pareto distribution and is a strictly positive constant, the distribution of the r.v. is a -second order Pareto.
In order to prove the lower bound on the expected extreme regret, we first establish that the event corresponding to the situation where the highest reward obtained by ExtremeETC and ExtremeHunter comes from the optimal arm occurs with overwhelming probability. Precisely, we denote by the event such that the bound
holds true. The following lemma, proved in Appendix 0.A, provides a control of its probability of occurence.
For larger than some constant depending only on and , the following assertions hold true.
where is given in Eq. 10.
Under the event , the maximum reward obtained by ExtremeETC and ExtremeHunter comes from the optimal arm:
The following lower bound shows that the upper bound (Theorem 3.1) is actually tight in the case .
If and , the expected extreme regret of ExtremeETC and ExtremeHunter are lower bounded as follows
Here, refers to either ExtremeETC or else ExtremeHunter. In order to bound from below , we start with bounding as follows
In addition, in the sum of expectations on the right-hand-side of Eq. 17, may be roughly bounded from above by . A straightforward application of Hölder inequality yields
From in Lemma 5 and Eq. 13, we have . By virtue of Lemma 4, the r.v. follows a -second order Pareto distribution. Then, applying Theorem 2.1 to the right-hand side of (19) and using the identity (18), the upper bound (17) becomes
5 A reduction to classical bandits
The goal of this section is to render explicit the connections between the max -armed bandit considered in the present paper and a particular instance of the classical Multi-Armed Bandit (MAB) problem.
5.1 MAB setting for extreme rewards
In a situation where only the large rewards matter, an alternative to the max -armed problem would be to consider the expected cumulative sum of the most ‘extreme’ rewards, that is, those which exceeds a given high threshold . For and , we denote by these new rewards
In this context, the classical MAB problem consists in maximizing the expected cumulative gain
It turns out that for a high enough threshold , the unique optimal arm for this MAB problem, , is also the optimal arm for the max
-armed problem. We still assume second order Pareto distributions for the random variablesand that all the hypothesis listed in Section 3.1 hold true. The rewards are also heavy-tailed so that it is legitimate to attack this MAB problem with the Robust UCB algorithm (Bubeck et al., 2013)
, which assumes that the rewards have finite moments of order
where and are known constants. Given our second order Pareto assumptions, it follows that Eq. 21 holds with . Even if the knowledge of such constants and is a strong assumption, it is still fair to compare Robust UCB to ExtremeETC/Hunter, which also has strong requirements. Indeed, ExtremeETC/Hunter assumes that and are known and verify conditions depending on unknown problem parameters (e.g. , see Eq. 3).
The following Lemma, whose the proof is postponed to Appendix 0.A, ensures that the two bandit problems are equivalent for high thresholds.
then the unique best arm for the MAB problem is .
Tuning the threshold based on the data is a difficult question, outside our scope. A standard practice is to monitor a relevant output (e.g. estimate of ) as a function of the threshold and to pick the latter as low as possible in the stability region of the output. This is related to the Lepski’s method, see e.g. Boucheron and Thomas (2015), Carpentier and Kim (2014), Hall and Welsh (1985).
5.2 Robust UCB algorithm (Bubeck et al., 2013)
For the sake of completeness, we recall below the main feature of Robust UCB and make explicit its theoretical guarantees in our setting. The bound stated in the following proposition is a direct consequence of the regret analysis conducted by Bubeck et al. (2013).
Applying the Robust UCB algorithm of (Bubeck et al., 2013) to our MAB problem, the expected number of times we pull any suboptimal arm is upper bounded as follows
See proof of Proposition 1 in Bubeck et al. (2013).
Hence, in expectation, Robust UCB pulls fewer times suboptimal arms than ExtremeETC/Hunter. Indeed with ExtremeETC/Hunter, .
Proposition 1 may be an indication that the Robust UCB approach performs better than ExtremeETC/Hunter. Nevertheless, guarantees on its expected extreme regret require sharp concentration bounds on (), which is out of the scope of this paper and left for future work.
6 Numerical experiments
In order to illustrate some aspects of the theoretical results presented previously, we consider a time horizon with arms and exact Pareto distributions with parameters given in Table 2. Here, the optimal arm is the second one (incidentally, the distribution with highest mean is the first one).
with linear regressions computed over.
We have implemented Robust UCB with parameters , which satisfies , achieving the equality in Eq. 21 (ideal case) and a threshold equal to the lower bound in Eq. 22 plus to respect the strict inequality. ExtremeETC is runned with . In this setting, the most restrictive condition on the time horizon, (given by Eq. 9), is checked, which places us in the validity framework of ExtremeETC. The resulting strategies are compared to each other and to the random strategy pulling each arm uniformly at random, but not to Threshold Ascent algorithm (Streeter and Smith, 2006) which is designed only for bounded rewards. Precisely, simulations have been run and Figure 1 depicts the extreme regret (1) in each setting averaged over these trajectories. These experiments empirically support the theoretical bounds in Theorem 3.1: the expected extreme regret of ExtremeETC converges to zero for large horizons. On the log-log scale (Fig. 0(b)), ExtremeETC’s extreme regret starts linearly decreasing after the initialization phase, at , which is consistent with Lemma 2. The corresponding linear regression reveals a slope (with a coefficient of determination ), which confirms Theorem 3.1 and Theorem 4.1 yielding the theoretical slope .
This paper brings two main contributions. It first provides a refined regret bound analysis of the performance of the ExtremeHunter algorithm in the context of the max -armed bandit problem that significantly improves upon the results obtained in the seminal contribution Carpentier and Valko (2014), also proved to be valid for ExtremeETC, a computationally appealing alternative we introduce. In particular, the obtained upper bound on the regret converges to zero for large horizons and is shown to be tight when the tail of the rewards is sufficiently close to a Pareto tail (second order parameter ). On the other hand, this paper offers a novel view of this approach, interpreted here as a specific version of a classical solution (Robust UCB) of the MAB problem, in the situation when only very large rewards matter.
Based on these encouraging results, several lines of further research can be sketched. In particular, future work will investigate to which extent the lower bound established for ExtremeETC/Hunter holds true for any strategy with exploration stage of the same duration, and whether improved performance is achievable with alternative stopping criteria for the exploration stage.
This work was supported by a public grant (Investissement d’avenir project, reference ANR-11-LABX-0056-LMH, LabEx LMH) and by the industrial chair Machine Learning for Big Data from Télécom ParisTech.
- Auer et al. (2002) Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256.
- Boucheron and Thomas (2015) Boucheron, S. and Thomas, M. (2015). Tail index estimation, concentration and adaptivity. Electron. J. Statist., 9(2):2751–2792.
- Bubeck et al. (2013) Bubeck, S., Cesa-Bianchi, N., and Lugosi, G. (2013). Bandits with heavy tail. IEEE Transactions on Information Theory, 59(11):7711–7717.
- Carpentier and Kim (2014) Carpentier, A. and Kim, A. K. (2014). Adaptive and minimax optimal estimation of the tail coefficient. Statistica Sinica, 25:1133–1144.
- Carpentier et al. (2014) Carpentier, A., Kim, A. K., et al. (2014). Honest and adaptive confidence interval for the tail coefficient in the pareto model. Electronic Journal of Statistics, 8(2):2066–2110.
- Carpentier and Valko (2014) Carpentier, A. and Valko, M. (2014). Extreme bandits. In Advances in Neural Information Processing Systems 27, pages 1089–1097. Curran Associates, Inc.
- Carpentier and Valko (2015) Carpentier, A. and Valko, M. (2015). Simple regret for infinitely many armed bandits. In Proceedings of The 32nd International Conference on Machine Learning, pages 1133–1141.
Cicirello and Smith (2005)
Cicirello, V. A. and Smith, S. F. (2005).
The max k-armed bandit: A new model of exploration applied to search heuristic selection.In
The Proceedings of the Twentieth National Conference on Artificial Intelligence, volume 3, pages 1355–1361. AAAI Press.
- David and Shimkin (2016) David, Y. and Shimkin, N. (2016). Pac lower bounds and efficient algorithms for the max k-armed bandit problem. In Proceedings of The 33nd International Conference on Machine Learning.
- Hall and Welsh (1985) Hall, P. and Welsh, A. H. (1985). Adaptive estimates of parameters of regular variation. Ann. Statist., 13(1):331–341.
Lepskiĭ, O. V. (1990).
A problem of adaptive estimation in Gaussian white noise.Teor. Veroyatnost. i Primenen., 35(3):459–470.
- Nishihara et al. (2016) Nishihara, R., Lopez-Paz, D., and Bottou, L. (2016). No regret bound for extreme bandits. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (AISTATS).
- Resnick (2007) Resnick, S. (2007). Heavy-Tail Phenomena: Probabilistic and Statistical Modeling. Number vol. 10 in Heavy-tail Phenomena: Probabilistic and Statistical Modeling. Springer.
- Streeter and Smith (2006) Streeter, M. J. and Smith, S. F. (2006). A simple distribution-free approach to the max k-armed bandit problem. In International Conference on Principles and Practice of Constraint Programming, pages 560–574. Springer.
Appendix 0.A Appendix
0.a.1 Proof of Lemma 3
For (defined in Eq. 6), one has