# Almost Optimal Algorithms for Linear Stochastic Bandits with Heavy-Tailed Payoffs

In linear stochastic bandits, it is commonly assumed that payoffs are with sub-Gaussian noises. In this paper, under a weaker assumption on noises, we study the problem of linear stochastic bandits with h eavy- tailed payoffs (LinBET), where the distributions have finite moments of order 1+ϵ, for some ϵ∈ (0,1]. We rigorously analyze the regret lower bound of LinBET as Ω(T^1/1+ϵ), implying that finite moments of order 2 (i.e., finite variances) yield the bound of Ω(√(T)), with T being the total number of rounds to play bandits. The provided lower bound also indicates that the state-of-the-art algorithms for LinBET are far from optimal. By adopting median of means with a well-designed allocation of decisions and truncation based on historical information, we develop two novel bandit algorithms, where the regret upper bounds match the lower bound up to polylogarithmic factors. To the best of our knowledge, we are the first to solve LinBET optimally in the sense of the polynomial order on T. Our proposed algorithms are evaluated based on synthetic datasets, and outperform the state-of-the-art results.

## Authors

• 3 publications
• 2 publications
• 41 publications
• 41 publications
• ### Nearly Optimal Regret for Stochastic Linear Bandits with Heavy-Tailed Payoffs

In this paper, we study the problem of stochastic linear bandits with fi...
04/28/2020 ∙ by Bo Xue, et al. ∙ 5

• ### Bandits with heavy tail

The stochastic multi-armed bandit problem is well understood when the re...
09/08/2012 ∙ by Sébastien Bubeck, et al. ∙ 0

• ### Regret Analysis of the Anytime Optimally Confident UCB Algorithm

I introduce and analyse an anytime version of the Optimally Confident UC...
03/29/2016 ∙ by Tor Lattimore, et al. ∙ 0

• ### Budget-Constrained Bandits over General Cost and Reward Distributions

We consider a budget-constrained bandit problem where each arm pull incu...
02/29/2020 ∙ by Semih Cayci, et al. ∙ 0

• ### Be Aware of Non-Stationarity: Nearly Optimal Algorithms for Piecewise-Stationary Cascading Bandits

Cascading bandit (CB) is a variant of both the multi-armed bandit (MAB) ...
09/12/2019 ∙ by Lingda Wang, et al. ∙ 0

• ### A minimax and asymptotically optimal algorithm for stochastic bandits

We propose the kl-UCB ++ algorithm for regret minimization in stochastic...
02/23/2017 ∙ by Pierre Ménard, et al. ∙ 0

• ### Fighting Contextual Bandits with Stochastic Smoothing

We introduce a new stochastic smoothing perspective to study adversarial...
10/11/2018 ∙ by Young Hun Jung, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The decision-making model named Multi-Armed Bandits (MAB), where at each time step an algorithm chooses an arm among a given set of arms and then receives a stochastic payoff with respect to the chosen arm, elegantly characterizes the tradeoff between exploration and exploitation in sequential learning. The algorithm usually aims at maximizing cumulative payoffs over a sequence of rounds. A natural and important variant of MAB is linear stochastic bandits with the expected payoff of each arm satisfying a linear mapping from the arm information to a real number. The model of linear stochastic bandits enjoys some good theoretical properties, e.g., there exists a closed-form solution of the linear mapping at each time step in light of ridge regression. Many practical applications take advantage of MAB and its variants to control decision performance, e.g., online personalized recommendations

(Li et al., 2010) and resource allocations (Lattimore et al., 2014).

In most previous studies of MAB and linear stochastic bandits, a common assumption is that noises in observed payoffs are sub-Gaussian conditional on historical information (Abbasi-Yadkori et al., 2011, Bubeck et al., 2012)

, which encompasses cases of all bounded payoffs and many unbounded payoffs, e.g., payoffs of an arm following a Gaussian distribution. However, there do exist practical scenarios of non-sub-Gaussian noises in observed payoffs for sequential decisions, such as high-probability extreme returns in investments for financial markets

(Cont and Bouchaud, 2000) and fluctuations of neural oscillations (Roberts et al., 2015), which are called heavy-tailed noises. Thus, it is significant to completely study theoretical behaviours of sequential decisions in the case of heavy-tailed noises.

Many practical distributions, e.g., Pareto distributions and Weibull distributions, are heavy-tailed, which perform high tail probabilities compared with exponential family distributions. We consider a general characterization of heavy-tailed payoffs in bandits, where the distributions have finite moments of order , where . When , stochastic payoffs are generated from distributions with finite variances. When , stochastic payoffs are generated from distributions with infinite variances (Shao and Nikias, 1993). Note that, different from sub-Gaussian noises in the traditional bandit setting, noises from heavy-tailed distributions do not enjoy exponentially decaying tails, and thus make it more difficult to learn a parameter of an arm.

The regret of MAB with heavy-tailed payoffs has been well addressed by Bubeck et al. (2013), where stochastic payoffs have bounds on raw or central moments of order . For MAB with finite variances (i.e., ), the regret of truncation algorithms or median of means recovers the optimal regret for MAB under the sub-Gaussian assumption. Recently, Medina and Yang (2016) investigated theoretical guarantees for the problem of linear stochastic bandits with heavy-tailed payoffs (LinBET). It is surprising to find that, for , the regret of bandit algorithms by Medina and Yang (2016) to solve LinBET is  111We omit polylogarithmic factors of for ., which is far away from the regret of the state-of-the-art algorithms (i.e., ) in linear stochastic bandits under the sub-Gaussian assumption (Dani et al., 2008a, Abbasi-Yadkori et al., 2011). Thus, the most interesting and non-trivial question is

Is it possible to recover the regret of when for LinBET?

In this paper, we answer this question affirmatively. Specifically, we investigate the problem of LinBET characterized by finite -th moments, where . The problem of LinBET raises several interesting challenges. The first challenge is the lower bound of the problem, which remains unknown. The technical issues come from the construction of an elegant setting for LinBET, and the derivation of a lower bound with respect to

. The second challenge is how to develop a robust estimator for the parameter in LinBET, because heavy-tailed noises greatly affect errors of the conventional least-squares estimator. It is worth mentioning that

Medina and Yang (2016) has tried to tackle this challenge, but their estimators do not make full use of the contextual information of chosen arms to eliminate the effect from heavy-tailed noises, which eventually leads to large regrets. The third challenge is how to successfully adopt median of means and truncation to solve LinBET with regret upper bounds matching the lower bound as closely as possible.

#### Our Results.

First of all, we rigorously analyze the lower bound on the problem of LinBET, which enjoys a polynomial order on as . The lower bound provides two essential hints: one is that finite variances in LinBET yield a bound of , and the other is that algorithms by Medina and Yang (2016) are sub-optimal. Then, we develop two novel bandit algorithms to solve LinBET based on the basic techniques of median of means and truncation. Both the algorithms adopt the optimism in the face of uncertainty principle, which is common in bandit problems (Abbasi-Yadkori et al., 2011, Munos et al., 2014). The regret upper bounds of the proposed two algorithms, which are , match the lower bound up to polylogarithmic factors. To the best of our knowledge, we are the first to solve LinBET almost optimally. We conduct experiments based on synthetic datasets, which are generated by Student’s -distribution and Pareto distribution, to demonstrate the effectiveness of our algorithms. Experimental results show that our algorithms outperform the state-of-the-art results. The contributions of this paper are summarized as follows:

• We provide the lower bound for the problem of LinBET characterized by finite -th moments, where . In the analysis, we construct an elegant setting of LinBET, which results in a regret bound of in expectation for any bandit algorithm.

• We develop two novel bandit algorithms, which are named as MENU and TOFU (with details shown in Section 4). The MENU algorithm adopts median of means with a well-designed allocation of decisions and the TOFU algorithm adopts truncation via historical information. Both algorithms achieve the regret with high probability.

• We conduct experiments based on synthetic datasets to demonstrate the effectiveness of our proposed algorithms. By comparing our algorithms with the state-of-the-art results, we show improvements on cumulative payoffs for MENU and TOFU, which are strictly consistent with theoretical guarantees in this paper.

## 2 Preliminaries and Related Work

In this section, we first present preliminaries, i.e., notations and learning setting of LinBET. Then, we give a detailed discussion on the line of research for bandits with heavy-tailed payoffs.

### 2.1 Notations

For a positive integer , . Let the

-norm of a vector

be , where and is the -th element of with . For , its absolute value is , its ceiling integer is , and its floor integer is . The inner product of two vectors is denoted by . Given a positive definite matrix , the weighted Euclidean norm of a vector is . denotes a Euclidean ball centered at with radius , where is the set of positive numbers. Let be a Euler’s number, and

an identity matrix. Let

be an indicator function, and the expectation of .

### 2.2 Learning Setting

For a bandit algorithm , we consider sequential decisions with the goal to maximize cumulative payoffs, where the total number of rounds for playing bandits is . For each round , the bandit algorithm is given a decision set such that for any . has to choose an arm and then observes a stochastic payoff . For notation simplicity, we also write . The expectation of the observed payoff for the chosen arm satisfies a linear mapping from the arm to a real number as , where is an underlying parameter with and is a random noise. Without loss of generality, we assume , where is a -filtration and . Clearly, we have . For an algorithm , to maximize cumulative payoffs is equivalent to minimizing the regret as

 R(A,T)≜(T∑t=1⟨x∗t,θ∗⟩)−(T∑t=1⟨xt,θ∗⟩)=T∑t=1⟨x∗t−xt,θ∗⟩, (1)

where denotes the optimal decision at time for , i.e., . In this paper, we will provide high-probability upper bound of with respect to , and provide the lower bound for LinBET in expectation for any algorithm. The problem of LinBET is defined as below.

###### Definition 1 (LinBET).

Given a decision set for time step , an algorithm , of which the goal is to maximize cumulative payoffs over rounds, chooses an arm . With , the observed stochastic payoff is conditionally heavy-tailed, i.e., or , where , and .

### 2.3 Related Work

The model of MAB dates back to 1952 with the original work by Robbins et al. (1952), and its inherent characteristic is the trade-off between exploration and exploitation. The asymptotic lower bound of MAB was developed by Lai and Robbins (1985), which is logarithmic with respect to the total number of rounds. An important technique called upper confidence bound was developed to achieve the lower bound (Lai and Robbins, 1985, Agrawal, 1995)

. Other related techniques to solve the problem of sequential decisions include Thompson sampling

(Thompson, 1933, Chapelle and Li, 2011, Agrawal and Goyal, 2012) and Gittins index (Gittins et al., 2011).

The problem of MAB with heavy-tailed payoffs characterized by finite -th moments has been well investigated (Bubeck et al., 2013, Vakili et al., 2013, Yu et al., 2018). Bubeck et al. (2013) pointed out that finite variances in MAB are sufficient to achieve regret bounds of the same order as the optimal regret for MAB under the sub-Gaussian assumption, and the order of in regret bounds increases when decreases. The lower bound of MAB with heavy-tailed payoffs has been analyzed (Bubeck et al., 2013), and robust algorithms by Bubeck et al. (2013) are optimal. Theoretical guarantees by Bubeck et al. (2013), Vakili et al. (2013) are for the setting of finite arms. In Vakili et al. (2013), primary theoretical results were presented for the case of . We notice that the case of is not interesting, because it reduces to the case of finite variances in MAB.

For the problem of linear stochastic bandits, which is also named linear reinforcement learning by

Auer (2002), the lower bound is when contextual information of arms is from a -dimensional space (Dani et al., 2008b). Bandit algorithms matching the lower bound up to polylogarithmic factors have been well developed (Auer, 2002, Dani et al., 2008a, Abbasi-Yadkori et al., 2011, Chu et al., 2011). Notice that all these studies assume that stochastic payoffs contain sub-Gaussian noises. More variants of MAB can be discussed by Bubeck et al. (2012).

It is surprising to find that the lower bound of LinBET remains unknown. In Medina and Yang (2016), bandit algorithms based on truncation and median of means were presented. When is finite for LinBET, the algorithms by Medina and Yang (2016) cannot recover the bound of which is the regret of the state-of-the-art algorithms in linear stochastic bandits under the sub-Gaussian assumption. Medina and Yang (2016) conjectured that it is possible to recover with being a finite number for LinBET. Thus, it is urgent to conduct a thorough analysis of the conjecture in consideration of the importance of heavy-tailed noises in real scenarios. Solving the conjecture generalizes the practical applications of bandit models. Practical motivating examples for bandits with heavy-tailed payoffs include delays in end-to-end network routing (Liebeherr et al., 2012) and sequential investments in financial markets (Cont and Bouchaud, 2000).

Recently, the assumption in stochastic payoffs of MAB was relaxed from sub-Gaussian noises to bounded kurtosis

(Lattimore, 2017), which can be viewed as an extension of Bubeck et al. (2013). The interesting point of Lattimore (2017) is the scale free algorithm, which might be practical in applications. Besides, Carpentier and Valko (2014)

investigated extreme bandits, where stochastic payoffs of MAB follow Fréchet distributions. The setting of extreme bandits fits for the real scenario of anomaly detection without contextual information. The order of regret in extreme bandits is characterized by distributional parameters, which is similar to the results by

Bubeck et al. (2013).

It is worth mentioning that, for linear regression with heavy-tailed noises, several interesting studies have been conducted.

Hsu and Sabato (2016) proposed a generalized method in light of median of means for loss minimization with heavy-tailed noises. Heavy-tailed noises in Hsu and Sabato (2016) might come from contextual information, which is more complicated than the setting of stochastic payoffs in this paper. Therefore, linear regression with heavy-tailed noises usually requires a finite fourth moment. In Audibert et al. (2011), the basic technique of truncation was adopted to solve robust linear regression in the absence of exponential moment condition. The related studies in this line of research are not directly applicable for the problem of LinBET.

## 3 Lower Bound

In this section, we provide the lower bound for LinBET. We consider heavy-tailed payoffs with finite -th raw moments in the analysis. In particular, we construct the following setting. Assume is even (when

is odd, similar results can be easily derived by considering the first

dimensions). For with , we fix the decision set as . Then, the fixed decision set is constructed as , which is a subset of intersection of the cube

and the hyperplane

. We define a set with . The payoff functions take values in such that, for every , the expected payoff is . To be more specific, we have the payoff function of as

 y(x)=⎧⎪⎨⎪⎩(1Δ)1ϵwith a probability of Δ1ϵθ⊤∗x,0with a probability of 1−Δ1ϵθ⊤∗x. (2)

We have the theorem for the lower bound of LinBET as below.

###### Theorem 1 (Lower Bound of LinBET).

If is chosen uniformly at random from , and the payoff for each is in with mean , then for any algorithm and every , we have

 E[R(A,T)]≥d192T11+ϵ. (3)

In the proof of Theorem 1, we first prove the lower bound when , and then generalize the argument to any . We notice that the parameter in the original -dimensional space is rearranged to tuples, each of which is a -dimensional vector as with . If the -th tuple of the parameter is selected as , then the -th tuple of the optimal arm is . In this case, if we define the -th tuple of the chosen arm as , the instantaneous regret is . Then, the regret can be represented as an integration of over . Finally, with common inequalities in information theory, we obtain the regret lower bound by setting .

We notice that martingale differences to prove the lower bound for linear stochastic bandits in (Dani et al., 2008a) are not directly feasible for the proof of lower bound in LinBET, because under our construction of heavy-tailed payoffs (i.e., Eq. (4)), the information of is excluded. Besides, our proof is partially inspired by Bubeck (2010). We show the detailed proof of Theorem 1 in Appendix A.

#### Remark 1.

The above lower bound provides two essential hints: one is that finite variances in LinBET yield a bound of , and the other is that algorithms proposed by Medina and Yang (2016) are far from optimal. The result in Theorem 1 strongly indicates that it is possible to design bandit algorithms recovering with finite variances.

## 4 Algorithms and Upper Bounds

In this section, we develop two novel bandit algorithms to solve LinBET, which turns out to be almost optimal. We rigorously prove regret upper bounds for the proposed algorithms. In particular, our core idea is based on the optimism in the face of uncertainty principle (OFU). The first algorithm is median of means under OFU (MENU) shown in Algorithm 1, and the second algorithm is truncation under OFU (TOFU) shown in Algorithm 2. For comparisons, we directly name the bandit algorithm based on median of means in Medina and Yang (2016) as MoM, and name the bandit algorithm based on confidence region with truncation in Medina and Yang (2016) as CRT.

Both algorithms in this paper adopt the tool of ridge regression. At time step , let be the -regularized least-squares estimate (LSE) of as , where is a matrix of which rows are , , is a vector of the historical observed payoffs until time and is a regularization parameter.

To conduct median of means in LinBET, it is common to allocate pulls of bandits among epochs, and for each epoch the same arm is played multiple times to obtain an estimate of . We find that there exist different ways to contruct the epochs. We design the framework of MENU in Figure 1(a), and show the framework of MoM designed by Medina and Yang (2016) in Figure 1(b). For MENU and MoM, we have the following three differences. First, for each epoch , MENU plays the same arm by times, while MoM plays the same arm by times. Second, at epoch with historical payoffs, MENU conducts LSEs by times, each of which is based on , while MoM conducts LSE by one time based on intermediate payoffs calculated via median of means of observed payoffs. Third, MENU adopts median of means of LSEs, while MoM adopts median of means of the observed payoffs. Intuitively, the execution of multiple LSEs will lead to the improved regret of MENU. With a better trade-off between and in Figure 1(a), we derive an improved upper bound of regret in Theorem 2.

In light of Figure 1(a), we develop algorithmic procedures in Algorithm 1

for MENU. We notice that, in order to guarantee the median of means of LSEs not far away from the true underlying parameter with high probability, we construct the confidence interval in Line 10 of Algorithm

1. Now we have the following theorem for the regret upper bound of MENU.

###### Theorem 2 (Regret Analysis for the MENU Algorithm).

Assume that for all and with , , and . Then, with probability at least , for every , the regret of the MENU algorithm satisfies

The technical challenges in MENU (i.e., Algorithm  1) and its proofs are discussed as follows. Based on the common techniques in linear stochastic bandits (Abbasi-Yadkori et al., 2011), to guarantee the instantaneous regret in LinBET, we need to guarantee with high probability. We attack this issue by guaranteeing with a probability of

, which could reduce to a problem of bounding a weighted sum of historical noises. Interestingly, by conducting singular value decomposition on

(of which rows are ), we find that -norm of the weights is no greater than . Then the weighted sum can be bounded by a term as . With a standard analysis in linear stochastic bandits from the instantaneous regret to the regret, we achieve the above results for MENU. We show the detailed proof of Theorem 2 in Appendix B.

#### Remark 2.

For MENU, we adopt the assumption of heavy-tailed payoffs on central moments, which is required in the basic technique of median of means (Bubeck et al., 2013). Besides, there exists an implicit mild assumption in Algorithm 1 that, at each epoch , the decision set must contain the selected arm at least times, which is practical in applications, e.g., online personalized recommendations (Li et al., 2010). The condition of is required for . The regret upper bound of MENU is , which implies that finite variances in LinBET are sufficient to achieve .

### 4.2 TOFU and Regret

#### Description of TOFU.

We demonstrate the algorithmic procedures of TOFU in Algorithm 2. We point out two subtle differences between our TOFU and the algorithm of CRT as follows. In TOFU, to obtain the accurate estimate of , we need to trim all historical payoffs for each dimension individually. Besides, the truncating operations depend on the historical information of arms. By contrast, in CRT, the historical payoffs are trimmed once, which is controlled only by the number of rounds for playing bandits. Compared to CRT, our TOFU achieves a tighter confidence interval, which can be found from the setting of . Now we have the following theorem for the regret upper bound of TOFU.

###### Theorem 3 (Regret Analysis for the TOFU Algorithm).

Assume that for all and with , , and . Then, with probability at least , for every , the regret of the TOFU algorithm satisfies

 R(TOFU,T)≤2T11+ϵ(4√db11+ϵ(log(2dTδ))ϵ1+ϵ+λ12S+L)√2dlog(1+TD2λd).

Similarly to the proof in Theorem 2, we can achieve the above results for TOFU. Due to space limitation, we show the detailed proof of Theorem 3 in Appendix C.

#### Remark 3.

For TOFU, we adopt the assumption of heavy-tailed payoffs on raw moments. It is worth pointing out that, when , we have regret upper bound for TOFU as , which implies that we recover the same order of as that under sub-Gaussian assumption (Abbasi-Yadkori et al., 2011). A weakness in TOFU is high time complexity, because for each round TOFU needs to truncate all historical payoffs. The time complexity might be reasonably reduced by dividing into multiple epochs, each of which contains only one truncation.

## 5 Experiments

In this section, we conduct experiments based on synthetic datasets to evaluate the performance of our proposed bandit algorithms: MENU and TOFU. For comparisons, we adopt two baselines: MoM and CRT proposed by Medina and Yang (2016). We run multiple independent repetitions for each dataset in a personal computer under Windows 7 with Intel CPU3.70GHz and 16GB memory.

### 5.1 Datasets and Setting

To show effectiveness of bandit algorithms, we will demonstrate cumulative payoffs with respect to number of rounds for playing bandits over a fixed finite-arm decision set. For verifications, we adopt four synthetic datasets (named as S1–S4) in the experiments, of which statistics are shown in Table 1. The experiments on heavy tails require or to be known, which corresponds to the assumptions of Theorem 2 or Theorem 3. According to the required information, we can apply MENU or TOFU into practical applications. We adopt Student’s and Pareto distributions because they are common in practice. For Student’s -distributions, we easily estimate , while for Pareto distributions, we easily estimate . Besides, we can choose different parameters (e.g., larger values) in the distributions, and recalculate the parameters of and .

For S1 and S2, which contain different numbers of arms and different dimensions for the contextual information, we adopt standard Student’s -distribution to generate heavy-tailed noises. For the chosen arm , the expected payoff is , and the observed payoff is added a noise generated from a standard Student’s

-distribution. We generate each dimension of contextual information for an arm, as well as the underlying parameter, from a uniform distribution over

. The standard Student’s -distribution implies that the bound for the second central moment of S1 and S2 is .

For S3 and S4, we adopt Pareto distribution, where the shape parameter is set as . We know implying . Then, we set leading to the bound of raw moment as . We take the maximum of among all arms as the bound of the -th raw moment. We generate arms and the parameter similar to S1 and S2.

In figures, we show the average of cumulative payoffs with time evolution over ten independent repetitions for each dataset, and show error bars of a standard variance for comparing the robustness of algorithms. For S1 and S2, we run MENU and MoM and set . For S3 and S4, we run TOFU and CRT and set . For all algorithms, we set , and .

### 5.2 Results and Discussions

We show experimental results in Figure 2. From the figure, we clearly find that our proposed two algorithms outperform MoM and CRT, which is consistent with the theoretical results in Theorems 2 and 3. We also evaluate our algorithms with other synthetic datasets, as well as different and , and observe similar superiority of MENU and TOFU. Finally, for further comparison on regret, complexity and storage of four algorithms, we list the results shown in Table 2.

## 6 Conclusion

We have studied the problem of LinBET, where stochastic payoffs are characterized by finite -th moments with . We broke the traditional assumption of sub-Gaussian noises in payoffs of bandits, and derived theoretical guarantees based on the prior information of bounds on finite moments. We rigorously analyzed the lower bound of LinBET, and developed two novel bandit algorithms with regret upper bounds matching the lower bound up to polylogarithmic factors. The proposed two novel algorithms are based on median of means and truncation. In the sense of polynomial dependence on , we provided optimal algorithms for the problem of LinBET, and thus solved an open problem, which has been pointed out by Medina and Yang (2016). Finally, our proposed algorithms have been evaluated based on synthetic datasets, and outperformed the state-of-the-art results. Since both algorithms in this paper require a priori knowledge of , future directions in this line of research include automatic learning of LinBET without information of distributional moments, and evaluation of our proposed algorithms in real-world scenarios.

## Acknowledgments

The work described in this paper was partially supported by the Research Grants Council of the Hong Kong Special Administrative Region, China (No. CUHK 14208815 and No. CUHK 14210717 of the General Research Fund), and Microsoft Research Asia (2018 Microsoft Research Asia Collaborative Research Award).

## References

• Abbasi-Yadkori et al. (2011) Y. Abbasi-Yadkori, D. Pál, and C. Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011.
• Agrawal (1995) R. Agrawal. Sample mean based index policies by regret for the multi-armed bandit problem. Advances in Applied Probability, 27(4):1054–1078, 1995.
• Agrawal and Goyal (2012) S. Agrawal and N. Goyal. Analysis of Thompson sampling for the multi-armed bandit problem. In Conference on Learning Theory, pages 39–1, 2012.
• Audibert et al. (2011) J.-Y. Audibert, O. Catoni, et al. Robust linear least squares regression. The Annals of Statistics, 39(5):2766–2794, 2011.
• Auer (2002) P. Auer. Using confidence bounds for exploitation-exploration trade-offs.

Journal of Machine Learning Research

, 3(Nov):397–422, 2002.
• Bubeck (2010) S. Bubeck. Bandits games and clustering foundations. PhD thesis, Université des Sciences et Technologie de Lille-Lille I, 2010.
• Bubeck et al. (2012) S. Bubeck, N. Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
• Bubeck et al. (2013) S. Bubeck, N. Cesa-Bianchi, and G. Lugosi. Bandits with heavy tail. IEEE Transactions on Information Theory, 59(11):7711–7717, 2013.
• Carpentier and Valko (2014) A. Carpentier and M. Valko. Extreme bandits. In Advances in Neural Information Processing Systems, pages 1089–1097, 2014.
• Chapelle and Li (2011) O. Chapelle and L. Li. An empirical evaluation of Thompson sampling. In Advances in Neural Information Processing Systems, pages 2249–2257, 2011.
• Chu et al. (2011) W. Chu, L. Li, L. Reyzin, and R. Schapire. Contextual bandits with linear payoff functions. In

Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics

, pages 208–214, 2011.
• Cont and Bouchaud (2000) R. Cont and J.-P. Bouchaud. Herd behavior and aggregate fluctuations in financial markets. Macroeconomic Dynamics, 4(2):170–196, 2000.
• Dani et al. (2008a) V. Dani, T. P. Hayes, and S. M. Kakade. Stochastic linear optimization under bandit feedback. In Conference on Learning Theory, pages 355–366, 2008a.
• Dani et al. (2008b) V. Dani, S. M. Kakade, and T. P. Hayes. The price of bandit information for online optimization. In Advances in Neural Information Processing Systems, pages 345–352, 2008b.
• Gittins et al. (2011) J. Gittins, K. Glazebrook, and R. Weber. Multi-armed bandit allocation indices. John Wiley & Sons, 2011.
• Hsu and Sabato (2014) D. Hsu and S. Sabato. Heavy-tailed regression with a generalized median-of-means. In International Conference on Machine Learning, pages 37–45, 2014.
• Hsu and Sabato (2016) D. Hsu and S. Sabato. Loss minimization and parameter estimation with heavy tails. The Journal of Machine Learning Research, 17(1):543–582, 2016.
• Lai and Robbins (1985) T. L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4–22, 1985.
• Lattimore (2017) T. Lattimore. A scale free algorithm for stochastic bandits with bounded kurtosis. In Advances in Neural Information Processing Systems, pages 1583–1592, 2017.
• Lattimore et al. (2014) T. Lattimore, K. Crammer, and C. Szepesvári. Optimal resource allocation with semi-bandit feedback. In Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, pages 477–486. AUAI Press, 2014.
• Li et al. (2010) L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the Nineteenth International Conference on World Wide Web, pages 661–670. ACM, 2010.
• Liebeherr et al. (2012) J. Liebeherr, A. Burchard, and F. Ciucu. Delay bounds in communication networks with heavy-tailed and self-similar traffic. IEEE Transactions on Information Theory, 58(2):1010–1024, 2012.
• Medina and Yang (2016) A. M. Medina and S. Yang. No-regret algorithms for heavy-tailed linear bandits. In International Conference on Machine Learning, pages 1642–1650, 2016.
• Munos et al. (2014) R. Munos et al. From bandits to monte-carlo tree search: The optimistic principle applied to optimization and planning. Foundations and Trends® in Machine Learning, 7(1):1–129, 2014.
• Robbins et al. (1952) H. Robbins et al. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527–535, 1952.
• Roberts et al. (2015) J. A. Roberts, T. W. Boonstra, and M. Breakspear. The heavy tail of the human brain. Current Opinion in Neurobiology, 31:164–172, 2015.
• Seldin et al. (2012) Y. Seldin, F. Laviolette, N. Cesa-Bianchi, J. Shawe-Taylor, and P. Auer. PAC-Bayesian inequalities for martingales. IEEE Transactions on Information Theory, 58(12):7086–7093, 2012.
• Shao and Nikias (1993) M. Shao and C. L. Nikias. Signal processing with fractional lower order moments: stable processes and their applications. Proceedings of the IEEE, 81(7):986–1010, 1993.
• Thompson (1933) W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
• Vakili et al. (2013) S. Vakili, K. Liu, and Q. Zhao. Deterministic sequencing of exploration and exploitation for multi-armed bandit problems. IEEE Journal of Selected Topics in Signal Processing, 7(5):759–767, 2013.
• Yu et al. (2018) X. Yu, H. Shao, M. R. Lyu, and I. King. Pure exploration of multi-armed bandits with heavy-tailed payoffs. In Proceedings of the Thirty-Fourth Conference on Uncertainty in Artificial Intelligence, pages 937–946. AUAI Press, 2018.

## Appendix A Proof of Theorem 1 (Lower Bound of LinBET)

We prove the lower bound for . Assume is even (when is odd, similar results can be easily derived by considering the first dimensions). For with , we fix the decision set as . Then, the fixed decision set is constructed as , which is a subset of intersection of the cube and the hyperplane . We define a set with . The payoff functions take values in with , for every , the expected payoff is , where is the underlying parameter drawn from . To be more specific, we have the payoff function of as

 y(x)=⎧⎪⎨⎪⎩(1Δ)1ϵwith a probability of Δ1ϵθ⊤∗x,0with a probability of 1−Δ1ϵθ⊤∗x. (4)

In this setting, the -th raw moments of payoffs are bounded by and . We start the proof with the -dimensional case in Subsection A.1. Its extension to the general case (i.e., ) is provided in Subsection A.2. Though we set a fixed decision set in the proofs, we can easily extend the lower bound here to the setting of time-varying decision sets, as discussed by Dani et al. (2008a).

### a.1 d=2 Case

Let , and . The -dimensional decision set is . Our payoff functions take values in , and for every , the expected payoff is , where is chosen uniformly at random from . It is easy to find which is maximized at for , and for any .

###### Lemma 1.

If is chosen uniformly at random from , and the payoff for each is in with mean , then for every algorithm and every , the regret satisfies

 E[R(A,T)]≥196T11+ϵ. (5)
###### Proof.

We consider a deterministic algorithm first. Let , where denotes the number of pulls of arm . is the empirical distribution of arms with respect to and is drawn from . We let and

denote, respectively, the probability distribution of

conditional on and the expectation conditional on , where . Thus, we have for any . At each time step , is selected. We let . Hence, for , we have

 Ej[T∑t=1(y∗t−yt(xt))]=T∑t=1Ej[Δ(1−xt,j)]=T∫D(2)Δ(1−xj)dPj(x) =TΔ(1−∫D(2)xjdPj(x))=TΔ(1−(∫0≤xj≤12xjdPj(x)+∫12

which implies

 ≥TΔ(1−122∑j=1(12Pj(0≤Xj≤12)+Pj(12

According to Pinsker’s inequality, for any , we have

 Pj(X∈E)≤P0(X∈E)+√12KL(P0,Pj), (8)

where

denotes the Kullback-Leibler divergence (simply KL divergence). Hence,

 E[R(A,T)]≥TΔ(1−122∑j=1(12P0(0≤Xj≤12)+P0(12

Since is deterministic, the sequence of received rewards uniquely determines the empirical distribution and thus, conditional on is the same for any . We let be the probability distribution of conditional on

. Based on the chain rule for KL divergence, we have

 KL(P0,Pj)≤KL(PT0,PTj). (10)

Further, iteratively using the chain rule for KL divergence, we have

 KL(PT0,PTj)=KL(P10,P1j)+T∑t=2∫Wt−1KL(Pt0(⋅|wt−1),Ptj(⋅|wt−1))dPt−10(Wt−1) =KL(P10,P1j)+ (11) T∑t=2∫xt∈D(2)∫Wt−1|xt,j=xjKL(Δ1+ϵϵ,Δ1+ϵϵ(1+xj))dPt−10(Wt−1|xt,j=xj)dPt−10(xt,j=xj) (12) ≤2Δ1+ϵϵ+T∑t=2∫xt∈D(2)∫Wt−1|xt,j=xj2Δ1+ϵϵdPt−10(Wt−1|xt,j=xj)dPt−10(xt,j=xj) (13) =2TΔ1+ϵϵ, (14)

where Eq. (13) could be derived by setting . Note that for any , let and

denote the Bernoulli distribution with parameters

and respectively. We denote as in Eq. (12). Therefore, we have

 E[R(A,T)]≥TΔ(14−32√TΔ1+ϵϵ)≥196T11+ϵ, (15)

where setting .

So far we have discussed the case where is a deterministic algorithm. When is a randomized algorithm, the result is the same. In particular, let denote the expectation with respect to the randomness of . Then, we have

 E[R(A,T)]=EA[Eθ∗[Ej[T∑t=1(y∗t−yt(xt))]]]. (16)

If we fix the realization of the algorithm’s randomization, the results of the previous steps for a deterministic algorithm apply and could be lower bounded as before. Hence, is lower bounded as Eq. (15). ∎

### a.2 General Case (d>2)

Now we suppose is even. If is odd, we just take the first dimensions into consideration. Then we consider the contribution to the total expected regret from the choice of , for all . We call the -th component of .

Analogously to the case, we set . The decision region is . Then, by following the proof for case, we could derive the regret due to the -th component of as

 E[R(i)(A,T)]≥196T11+ϵ, (17)

where . Summing over the components of Eq. (17) completes the proof for Theorem 1.

## Appendix B Proof of Theorem 2 (Regret Analysis for the MENU Algorithm)

To prove Theorem 2, we start with proving the following two lemmas. Recall that the algorithm in the paper is based on least-squares estimate (LSE).

###### Lemma 2 (Confidence Ellipsoid of LSE).

Let denote the LSE of with the sequence of decisions and observed payoffs . Assume that for all and all ,