1 Introduction
The decisionmaking model named MultiArmed Bandits (MAB), where at each time step an algorithm chooses an arm among a given set of arms and then receives a stochastic payoff with respect to the chosen arm, elegantly characterizes the tradeoff between exploration and exploitation in sequential learning. The algorithm usually aims at maximizing cumulative payoffs over a sequence of rounds. A natural and important variant of MAB is linear stochastic bandits with the expected payoff of each arm satisfying a linear mapping from the arm information to a real number. The model of linear stochastic bandits enjoys some good theoretical properties, e.g., there exists a closedform solution of the linear mapping at each time step in light of ridge regression. Many practical applications take advantage of MAB and its variants to control decision performance, e.g., online personalized recommendations
(Li et al., 2010) and resource allocations (Lattimore et al., 2014).In most previous studies of MAB and linear stochastic bandits, a common assumption is that noises in observed payoffs are subGaussian conditional on historical information (AbbasiYadkori et al., 2011, Bubeck et al., 2012)
, which encompasses cases of all bounded payoffs and many unbounded payoffs, e.g., payoffs of an arm following a Gaussian distribution. However, there do exist practical scenarios of nonsubGaussian noises in observed payoffs for sequential decisions, such as highprobability extreme returns in investments for financial markets
(Cont and Bouchaud, 2000) and fluctuations of neural oscillations (Roberts et al., 2015), which are called heavytailed noises. Thus, it is significant to completely study theoretical behaviours of sequential decisions in the case of heavytailed noises.Many practical distributions, e.g., Pareto distributions and Weibull distributions, are heavytailed, which perform high tail probabilities compared with exponential family distributions. We consider a general characterization of heavytailed payoffs in bandits, where the distributions have finite moments of order , where . When , stochastic payoffs are generated from distributions with finite variances. When , stochastic payoffs are generated from distributions with infinite variances (Shao and Nikias, 1993). Note that, different from subGaussian noises in the traditional bandit setting, noises from heavytailed distributions do not enjoy exponentially decaying tails, and thus make it more difficult to learn a parameter of an arm.
The regret of MAB with heavytailed payoffs has been well addressed by Bubeck et al. (2013), where stochastic payoffs have bounds on raw or central moments of order . For MAB with finite variances (i.e., ), the regret of truncation algorithms or median of means recovers the optimal regret for MAB under the subGaussian assumption. Recently, Medina and Yang (2016) investigated theoretical guarantees for the problem of linear stochastic bandits with heavytailed payoffs (LinBET). It is surprising to find that, for , the regret of bandit algorithms by Medina and Yang (2016) to solve LinBET is ^{1}^{1}1We omit polylogarithmic factors of for ., which is far away from the regret of the stateoftheart algorithms (i.e., ) in linear stochastic bandits under the subGaussian assumption (Dani et al., 2008a, AbbasiYadkori et al., 2011). Thus, the most interesting and nontrivial question is
Is it possible to recover the regret of when for LinBET?
In this paper, we answer this question affirmatively. Specifically, we investigate the problem of LinBET characterized by finite th moments, where . The problem of LinBET raises several interesting challenges. The first challenge is the lower bound of the problem, which remains unknown. The technical issues come from the construction of an elegant setting for LinBET, and the derivation of a lower bound with respect to
. The second challenge is how to develop a robust estimator for the parameter in LinBET, because heavytailed noises greatly affect errors of the conventional leastsquares estimator. It is worth mentioning that
Medina and Yang (2016) has tried to tackle this challenge, but their estimators do not make full use of the contextual information of chosen arms to eliminate the effect from heavytailed noises, which eventually leads to large regrets. The third challenge is how to successfully adopt median of means and truncation to solve LinBET with regret upper bounds matching the lower bound as closely as possible.Our Results.
First of all, we rigorously analyze the lower bound on the problem of LinBET, which enjoys a polynomial order on as . The lower bound provides two essential hints: one is that finite variances in LinBET yield a bound of , and the other is that algorithms by Medina and Yang (2016) are suboptimal. Then, we develop two novel bandit algorithms to solve LinBET based on the basic techniques of median of means and truncation. Both the algorithms adopt the optimism in the face of uncertainty principle, which is common in bandit problems (AbbasiYadkori et al., 2011, Munos et al., 2014). The regret upper bounds of the proposed two algorithms, which are , match the lower bound up to polylogarithmic factors. To the best of our knowledge, we are the first to solve LinBET almost optimally. We conduct experiments based on synthetic datasets, which are generated by Student’s distribution and Pareto distribution, to demonstrate the effectiveness of our algorithms. Experimental results show that our algorithms outperform the stateoftheart results. The contributions of this paper are summarized as follows:

We provide the lower bound for the problem of LinBET characterized by finite th moments, where . In the analysis, we construct an elegant setting of LinBET, which results in a regret bound of in expectation for any bandit algorithm.

We develop two novel bandit algorithms, which are named as MENU and TOFU (with details shown in Section 4). The MENU algorithm adopts median of means with a welldesigned allocation of decisions and the TOFU algorithm adopts truncation via historical information. Both algorithms achieve the regret with high probability.

We conduct experiments based on synthetic datasets to demonstrate the effectiveness of our proposed algorithms. By comparing our algorithms with the stateoftheart results, we show improvements on cumulative payoffs for MENU and TOFU, which are strictly consistent with theoretical guarantees in this paper.
2 Preliminaries and Related Work
In this section, we first present preliminaries, i.e., notations and learning setting of LinBET. Then, we give a detailed discussion on the line of research for bandits with heavytailed payoffs.
2.1 Notations
For a positive integer , . Let the
norm of a vector
be , where and is the th element of with . For , its absolute value is , its ceiling integer is , and its floor integer is . The inner product of two vectors is denoted by . Given a positive definite matrix , the weighted Euclidean norm of a vector is . denotes a Euclidean ball centered at with radius , where is the set of positive numbers. Let be a Euler’s number, andan identity matrix. Let
be an indicator function, and the expectation of .2.2 Learning Setting
For a bandit algorithm , we consider sequential decisions with the goal to maximize cumulative payoffs, where the total number of rounds for playing bandits is . For each round , the bandit algorithm is given a decision set such that for any . has to choose an arm and then observes a stochastic payoff . For notation simplicity, we also write . The expectation of the observed payoff for the chosen arm satisfies a linear mapping from the arm to a real number as , where is an underlying parameter with and is a random noise. Without loss of generality, we assume , where is a filtration and . Clearly, we have . For an algorithm , to maximize cumulative payoffs is equivalent to minimizing the regret as
(1) 
where denotes the optimal decision at time for , i.e., . In this paper, we will provide highprobability upper bound of with respect to , and provide the lower bound for LinBET in expectation for any algorithm. The problem of LinBET is defined as below.
Definition 1 (LinBET).
Given a decision set for time step , an algorithm , of which the goal is to maximize cumulative payoffs over rounds, chooses an arm . With , the observed stochastic payoff is conditionally heavytailed, i.e., or , where , and .
2.3 Related Work
The model of MAB dates back to 1952 with the original work by Robbins et al. (1952), and its inherent characteristic is the tradeoff between exploration and exploitation. The asymptotic lower bound of MAB was developed by Lai and Robbins (1985), which is logarithmic with respect to the total number of rounds. An important technique called upper confidence bound was developed to achieve the lower bound (Lai and Robbins, 1985, Agrawal, 1995)
. Other related techniques to solve the problem of sequential decisions include Thompson sampling
(Thompson, 1933, Chapelle and Li, 2011, Agrawal and Goyal, 2012) and Gittins index (Gittins et al., 2011).The problem of MAB with heavytailed payoffs characterized by finite th moments has been well investigated (Bubeck et al., 2013, Vakili et al., 2013, Yu et al., 2018). Bubeck et al. (2013) pointed out that finite variances in MAB are sufficient to achieve regret bounds of the same order as the optimal regret for MAB under the subGaussian assumption, and the order of in regret bounds increases when decreases. The lower bound of MAB with heavytailed payoffs has been analyzed (Bubeck et al., 2013), and robust algorithms by Bubeck et al. (2013) are optimal. Theoretical guarantees by Bubeck et al. (2013), Vakili et al. (2013) are for the setting of finite arms. In Vakili et al. (2013), primary theoretical results were presented for the case of . We notice that the case of is not interesting, because it reduces to the case of finite variances in MAB.
For the problem of linear stochastic bandits, which is also named linear reinforcement learning by
Auer (2002), the lower bound is when contextual information of arms is from a dimensional space (Dani et al., 2008b). Bandit algorithms matching the lower bound up to polylogarithmic factors have been well developed (Auer, 2002, Dani et al., 2008a, AbbasiYadkori et al., 2011, Chu et al., 2011). Notice that all these studies assume that stochastic payoffs contain subGaussian noises. More variants of MAB can be discussed by Bubeck et al. (2012).It is surprising to find that the lower bound of LinBET remains unknown. In Medina and Yang (2016), bandit algorithms based on truncation and median of means were presented. When is finite for LinBET, the algorithms by Medina and Yang (2016) cannot recover the bound of which is the regret of the stateoftheart algorithms in linear stochastic bandits under the subGaussian assumption. Medina and Yang (2016) conjectured that it is possible to recover with being a finite number for LinBET. Thus, it is urgent to conduct a thorough analysis of the conjecture in consideration of the importance of heavytailed noises in real scenarios. Solving the conjecture generalizes the practical applications of bandit models. Practical motivating examples for bandits with heavytailed payoffs include delays in endtoend network routing (Liebeherr et al., 2012) and sequential investments in financial markets (Cont and Bouchaud, 2000).
Recently, the assumption in stochastic payoffs of MAB was relaxed from subGaussian noises to bounded kurtosis
(Lattimore, 2017), which can be viewed as an extension of Bubeck et al. (2013). The interesting point of Lattimore (2017) is the scale free algorithm, which might be practical in applications. Besides, Carpentier and Valko (2014)investigated extreme bandits, where stochastic payoffs of MAB follow Fréchet distributions. The setting of extreme bandits fits for the real scenario of anomaly detection without contextual information. The order of regret in extreme bandits is characterized by distributional parameters, which is similar to the results by
Bubeck et al. (2013).It is worth mentioning that, for linear regression with heavytailed noises, several interesting studies have been conducted.
Hsu and Sabato (2016) proposed a generalized method in light of median of means for loss minimization with heavytailed noises. Heavytailed noises in Hsu and Sabato (2016) might come from contextual information, which is more complicated than the setting of stochastic payoffs in this paper. Therefore, linear regression with heavytailed noises usually requires a finite fourth moment. In Audibert et al. (2011), the basic technique of truncation was adopted to solve robust linear regression in the absence of exponential moment condition. The related studies in this line of research are not directly applicable for the problem of LinBET.3 Lower Bound
In this section, we provide the lower bound for LinBET. We consider heavytailed payoffs with finite th raw moments in the analysis. In particular, we construct the following setting. Assume is even (when
is odd, similar results can be easily derived by considering the first
dimensions). For with , we fix the decision set as . Then, the fixed decision set is constructed as , which is a subset of intersection of the cubeand the hyperplane
. We define a set with . The payoff functions take values in such that, for every , the expected payoff is . To be more specific, we have the payoff function of as(2) 
We have the theorem for the lower bound of LinBET as below.
Theorem 1 (Lower Bound of LinBET).
If is chosen uniformly at random from , and the payoff for each is in with mean , then for any algorithm and every , we have
(3) 
In the proof of Theorem 1, we first prove the lower bound when , and then generalize the argument to any . We notice that the parameter in the original dimensional space is rearranged to tuples, each of which is a dimensional vector as with . If the th tuple of the parameter is selected as , then the th tuple of the optimal arm is . In this case, if we define the th tuple of the chosen arm as , the instantaneous regret is . Then, the regret can be represented as an integration of over . Finally, with common inequalities in information theory, we obtain the regret lower bound by setting .
We notice that martingale differences to prove the lower bound for linear stochastic bandits in (Dani et al., 2008a) are not directly feasible for the proof of lower bound in LinBET, because under our construction of heavytailed payoffs (i.e., Eq. (4)), the information of is excluded. Besides, our proof is partially inspired by Bubeck (2010). We show the detailed proof of Theorem 1 in Appendix A.
Remark 1.
The above lower bound provides two essential hints: one is that finite variances in LinBET yield a bound of , and the other is that algorithms proposed by Medina and Yang (2016) are far from optimal. The result in Theorem 1 strongly indicates that it is possible to design bandit algorithms recovering with finite variances.
4 Algorithms and Upper Bounds
In this section, we develop two novel bandit algorithms to solve LinBET, which turns out to be almost optimal. We rigorously prove regret upper bounds for the proposed algorithms. In particular, our core idea is based on the optimism in the face of uncertainty principle (OFU). The first algorithm is median of means under OFU (MENU) shown in Algorithm 1, and the second algorithm is truncation under OFU (TOFU) shown in Algorithm 2. For comparisons, we directly name the bandit algorithm based on median of means in Medina and Yang (2016) as MoM, and name the bandit algorithm based on confidence region with truncation in Medina and Yang (2016) as CRT.
Both algorithms in this paper adopt the tool of ridge regression. At time step , let be the regularized leastsquares estimate (LSE) of as , where is a matrix of which rows are , , is a vector of the historical observed payoffs until time and is a regularization parameter.
4.1 MENU and Regret
Description of MENU.
To conduct median of means in LinBET, it is common to allocate pulls of bandits among epochs, and for each epoch the same arm is played multiple times to obtain an estimate of . We find that there exist different ways to contruct the epochs. We design the framework of MENU in Figure 1(a), and show the framework of MoM designed by Medina and Yang (2016) in Figure 1(b). For MENU and MoM, we have the following three differences. First, for each epoch , MENU plays the same arm by times, while MoM plays the same arm by times. Second, at epoch with historical payoffs, MENU conducts LSEs by times, each of which is based on , while MoM conducts LSE by one time based on intermediate payoffs calculated via median of means of observed payoffs. Third, MENU adopts median of means of LSEs, while MoM adopts median of means of the observed payoffs. Intuitively, the execution of multiple LSEs will lead to the improved regret of MENU. With a better tradeoff between and in Figure 1(a), we derive an improved upper bound of regret in Theorem 2.
In light of Figure 1(a), we develop algorithmic procedures in Algorithm 1
for MENU. We notice that, in order to guarantee the median of means of LSEs not far away from the true underlying parameter with high probability, we construct the confidence interval in Line 10 of Algorithm
1. Now we have the following theorem for the regret upper bound of MENU.Theorem 2 (Regret Analysis for the MENU Algorithm).
Assume that for all and with , , and . Then, with probability at least , for every , the regret of the MENU algorithm satisfies
The technical challenges in MENU (i.e., Algorithm 1) and its proofs are discussed as follows. Based on the common techniques in linear stochastic bandits (AbbasiYadkori et al., 2011), to guarantee the instantaneous regret in LinBET, we need to guarantee with high probability. We attack this issue by guaranteeing with a probability of
, which could reduce to a problem of bounding a weighted sum of historical noises. Interestingly, by conducting singular value decomposition on
(of which rows are ), we find that norm of the weights is no greater than . Then the weighted sum can be bounded by a term as . With a standard analysis in linear stochastic bandits from the instantaneous regret to the regret, we achieve the above results for MENU. We show the detailed proof of Theorem 2 in Appendix B.Remark 2.
For MENU, we adopt the assumption of heavytailed payoffs on central moments, which is required in the basic technique of median of means (Bubeck et al., 2013). Besides, there exists an implicit mild assumption in Algorithm 1 that, at each epoch , the decision set must contain the selected arm at least times, which is practical in applications, e.g., online personalized recommendations (Li et al., 2010). The condition of is required for . The regret upper bound of MENU is , which implies that finite variances in LinBET are sufficient to achieve .
4.2 TOFU and Regret
Description of TOFU.
We demonstrate the algorithmic procedures of TOFU in Algorithm 2. We point out two subtle differences between our TOFU and the algorithm of CRT as follows. In TOFU, to obtain the accurate estimate of , we need to trim all historical payoffs for each dimension individually. Besides, the truncating operations depend on the historical information of arms. By contrast, in CRT, the historical payoffs are trimmed once, which is controlled only by the number of rounds for playing bandits. Compared to CRT, our TOFU achieves a tighter confidence interval, which can be found from the setting of . Now we have the following theorem for the regret upper bound of TOFU.
Theorem 3 (Regret Analysis for the TOFU Algorithm).
Assume that for all and with , , and . Then, with probability at least , for every , the regret of the TOFU algorithm satisfies
Remark 3.
For TOFU, we adopt the assumption of heavytailed payoffs on raw moments. It is worth pointing out that, when , we have regret upper bound for TOFU as , which implies that we recover the same order of as that under subGaussian assumption (AbbasiYadkori et al., 2011). A weakness in TOFU is high time complexity, because for each round TOFU needs to truncate all historical payoffs. The time complexity might be reasonably reduced by dividing into multiple epochs, each of which contains only one truncation.
dataset  {arms,dimensions}  distribution {parameters}  mean of the optimal arm  
S1  {20,10}  Student’s distribution {}  {1.00, NA, 3.00}  
S2  {100,20}  Student’s distribution {}  {1.00, NA, 3.00}  
S3  {20,10}  Pareto distribution {}  {0.50, 7.72, NA}  
S4  {100,20}  Pareto distribution {}  {0.50, 54.37, NA} 
denotes the degree of freedom,
denotes the location, denotes the scale. For Pareto distribution, denotes the shape and denotes the scale. NA denotes not available.5 Experiments
In this section, we conduct experiments based on synthetic datasets to evaluate the performance of our proposed bandit algorithms: MENU and TOFU. For comparisons, we adopt two baselines: MoM and CRT proposed by Medina and Yang (2016). We run multiple independent repetitions for each dataset in a personal computer under Windows 7 with Intel CPU3.70GHz and 16GB memory.
5.1 Datasets and Setting
To show effectiveness of bandit algorithms, we will demonstrate cumulative payoffs with respect to number of rounds for playing bandits over a fixed finitearm decision set. For verifications, we adopt four synthetic datasets (named as S1–S4) in the experiments, of which statistics are shown in Table 1. The experiments on heavy tails require or to be known, which corresponds to the assumptions of Theorem 2 or Theorem 3. According to the required information, we can apply MENU or TOFU into practical applications. We adopt Student’s and Pareto distributions because they are common in practice. For Student’s distributions, we easily estimate , while for Pareto distributions, we easily estimate . Besides, we can choose different parameters (e.g., larger values) in the distributions, and recalculate the parameters of and .
For S1 and S2, which contain different numbers of arms and different dimensions for the contextual information, we adopt standard Student’s distribution to generate heavytailed noises. For the chosen arm , the expected payoff is , and the observed payoff is added a noise generated from a standard Student’s
distribution. We generate each dimension of contextual information for an arm, as well as the underlying parameter, from a uniform distribution over
. The standard Student’s distribution implies that the bound for the second central moment of S1 and S2 is .For S3 and S4, we adopt Pareto distribution, where the shape parameter is set as . We know implying . Then, we set leading to the bound of raw moment as . We take the maximum of among all arms as the bound of the th raw moment. We generate arms and the parameter similar to S1 and S2.
In figures, we show the average of cumulative payoffs with time evolution over ten independent repetitions for each dataset, and show error bars of a standard variance for comparing the robustness of algorithms. For S1 and S2, we run MENU and MoM and set . For S3 and S4, we run TOFU and CRT and set . For all algorithms, we set , and .
5.2 Results and Discussions
We show experimental results in Figure 2. From the figure, we clearly find that our proposed two algorithms outperform MoM and CRT, which is consistent with the theoretical results in Theorems 2 and 3. We also evaluate our algorithms with other synthetic datasets, as well as different and , and observe similar superiority of MENU and TOFU. Finally, for further comparison on regret, complexity and storage of four algorithms, we list the results shown in Table 2.
algorithm  MoM  MENU  CRT  TOFU 

regret  
complexity  
storage 
6 Conclusion
We have studied the problem of LinBET, where stochastic payoffs are characterized by finite th moments with . We broke the traditional assumption of subGaussian noises in payoffs of bandits, and derived theoretical guarantees based on the prior information of bounds on finite moments. We rigorously analyzed the lower bound of LinBET, and developed two novel bandit algorithms with regret upper bounds matching the lower bound up to polylogarithmic factors. The proposed two novel algorithms are based on median of means and truncation. In the sense of polynomial dependence on , we provided optimal algorithms for the problem of LinBET, and thus solved an open problem, which has been pointed out by Medina and Yang (2016). Finally, our proposed algorithms have been evaluated based on synthetic datasets, and outperformed the stateoftheart results. Since both algorithms in this paper require a priori knowledge of , future directions in this line of research include automatic learning of LinBET without information of distributional moments, and evaluation of our proposed algorithms in realworld scenarios.
Acknowledgments
The work described in this paper was partially supported by the Research Grants Council of the Hong Kong Special Administrative Region, China (No. CUHK 14208815 and No. CUHK 14210717 of the General Research Fund), and Microsoft Research Asia (2018 Microsoft Research Asia Collaborative Research Award).
References
 AbbasiYadkori et al. (2011) Y. AbbasiYadkori, D. Pál, and C. Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011.
 Agrawal (1995) R. Agrawal. Sample mean based index policies by regret for the multiarmed bandit problem. Advances in Applied Probability, 27(4):1054–1078, 1995.
 Agrawal and Goyal (2012) S. Agrawal and N. Goyal. Analysis of Thompson sampling for the multiarmed bandit problem. In Conference on Learning Theory, pages 39–1, 2012.
 Audibert et al. (2011) J.Y. Audibert, O. Catoni, et al. Robust linear least squares regression. The Annals of Statistics, 39(5):2766–2794, 2011.

Auer (2002)
P. Auer.
Using confidence bounds for exploitationexploration tradeoffs.
Journal of Machine Learning Research
, 3(Nov):397–422, 2002.  Bubeck (2010) S. Bubeck. Bandits games and clustering foundations. PhD thesis, Université des Sciences et Technologie de LilleLille I, 2010.
 Bubeck et al. (2012) S. Bubeck, N. CesaBianchi, et al. Regret analysis of stochastic and nonstochastic multiarmed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
 Bubeck et al. (2013) S. Bubeck, N. CesaBianchi, and G. Lugosi. Bandits with heavy tail. IEEE Transactions on Information Theory, 59(11):7711–7717, 2013.
 Carpentier and Valko (2014) A. Carpentier and M. Valko. Extreme bandits. In Advances in Neural Information Processing Systems, pages 1089–1097, 2014.
 Chapelle and Li (2011) O. Chapelle and L. Li. An empirical evaluation of Thompson sampling. In Advances in Neural Information Processing Systems, pages 2249–2257, 2011.

Chu et al. (2011)
W. Chu, L. Li, L. Reyzin, and R. Schapire.
Contextual bandits with linear payoff functions.
In
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics
, pages 208–214, 2011.  Cont and Bouchaud (2000) R. Cont and J.P. Bouchaud. Herd behavior and aggregate fluctuations in financial markets. Macroeconomic Dynamics, 4(2):170–196, 2000.
 Dani et al. (2008a) V. Dani, T. P. Hayes, and S. M. Kakade. Stochastic linear optimization under bandit feedback. In Conference on Learning Theory, pages 355–366, 2008a.
 Dani et al. (2008b) V. Dani, S. M. Kakade, and T. P. Hayes. The price of bandit information for online optimization. In Advances in Neural Information Processing Systems, pages 345–352, 2008b.
 Gittins et al. (2011) J. Gittins, K. Glazebrook, and R. Weber. Multiarmed bandit allocation indices. John Wiley & Sons, 2011.
 Hsu and Sabato (2014) D. Hsu and S. Sabato. Heavytailed regression with a generalized medianofmeans. In International Conference on Machine Learning, pages 37–45, 2014.
 Hsu and Sabato (2016) D. Hsu and S. Sabato. Loss minimization and parameter estimation with heavy tails. The Journal of Machine Learning Research, 17(1):543–582, 2016.
 Lai and Robbins (1985) T. L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4–22, 1985.
 Lattimore (2017) T. Lattimore. A scale free algorithm for stochastic bandits with bounded kurtosis. In Advances in Neural Information Processing Systems, pages 1583–1592, 2017.
 Lattimore et al. (2014) T. Lattimore, K. Crammer, and C. Szepesvári. Optimal resource allocation with semibandit feedback. In Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, pages 477–486. AUAI Press, 2014.
 Li et al. (2010) L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextualbandit approach to personalized news article recommendation. In Proceedings of the Nineteenth International Conference on World Wide Web, pages 661–670. ACM, 2010.
 Liebeherr et al. (2012) J. Liebeherr, A. Burchard, and F. Ciucu. Delay bounds in communication networks with heavytailed and selfsimilar traffic. IEEE Transactions on Information Theory, 58(2):1010–1024, 2012.
 Medina and Yang (2016) A. M. Medina and S. Yang. Noregret algorithms for heavytailed linear bandits. In International Conference on Machine Learning, pages 1642–1650, 2016.
 Munos et al. (2014) R. Munos et al. From bandits to montecarlo tree search: The optimistic principle applied to optimization and planning. Foundations and Trends® in Machine Learning, 7(1):1–129, 2014.
 Robbins et al. (1952) H. Robbins et al. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527–535, 1952.
 Roberts et al. (2015) J. A. Roberts, T. W. Boonstra, and M. Breakspear. The heavy tail of the human brain. Current Opinion in Neurobiology, 31:164–172, 2015.
 Seldin et al. (2012) Y. Seldin, F. Laviolette, N. CesaBianchi, J. ShaweTaylor, and P. Auer. PACBayesian inequalities for martingales. IEEE Transactions on Information Theory, 58(12):7086–7093, 2012.
 Shao and Nikias (1993) M. Shao and C. L. Nikias. Signal processing with fractional lower order moments: stable processes and their applications. Proceedings of the IEEE, 81(7):986–1010, 1993.
 Thompson (1933) W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
 Vakili et al. (2013) S. Vakili, K. Liu, and Q. Zhao. Deterministic sequencing of exploration and exploitation for multiarmed bandit problems. IEEE Journal of Selected Topics in Signal Processing, 7(5):759–767, 2013.
 Yu et al. (2018) X. Yu, H. Shao, M. R. Lyu, and I. King. Pure exploration of multiarmed bandits with heavytailed payoffs. In Proceedings of the ThirtyFourth Conference on Uncertainty in Artificial Intelligence, pages 937–946. AUAI Press, 2018.
Appendix A Proof of Theorem 1 (Lower Bound of LinBET)
We prove the lower bound for . Assume is even (when is odd, similar results can be easily derived by considering the first dimensions). For with , we fix the decision set as . Then, the fixed decision set is constructed as , which is a subset of intersection of the cube and the hyperplane . We define a set with . The payoff functions take values in with , for every , the expected payoff is , where is the underlying parameter drawn from . To be more specific, we have the payoff function of as
(4) 
In this setting, the th raw moments of payoffs are bounded by and . We start the proof with the dimensional case in Subsection A.1. Its extension to the general case (i.e., ) is provided in Subsection A.2. Though we set a fixed decision set in the proofs, we can easily extend the lower bound here to the setting of timevarying decision sets, as discussed by Dani et al. (2008a).
a.1 Case
Let , and . The dimensional decision set is . Our payoff functions take values in , and for every , the expected payoff is , where is chosen uniformly at random from . It is easy to find which is maximized at for , and for any .
Lemma 1.
If is chosen uniformly at random from , and the payoff for each is in with mean , then for every algorithm and every , the regret satisfies
(5) 
Proof.
We consider a deterministic algorithm first. Let , where denotes the number of pulls of arm . is the empirical distribution of arms with respect to and is drawn from . We let and
denote, respectively, the probability distribution of
conditional on and the expectation conditional on , where . Thus, we have for any . At each time step , is selected. We let . Hence, for , we have(6) 
which implies
(7) 
According to Pinsker’s inequality, for any , we have
(8) 
where
denotes the KullbackLeibler divergence (simply KL divergence). Hence,
(9) 
Since is deterministic, the sequence of received rewards uniquely determines the empirical distribution and thus, conditional on is the same for any . We let be the probability distribution of conditional on
. Based on the chain rule for KL divergence, we have
(10) 
Further, iteratively using the chain rule for KL divergence, we have
(11)  
(12)  
(13)  
(14) 
where Eq. (13) could be derived by setting . Note that for any , let and
denote the Bernoulli distribution with parameters
and respectively. We denote as in Eq. (12). Therefore, we have(15) 
where setting .
So far we have discussed the case where is a deterministic algorithm. When is a randomized algorithm, the result is the same. In particular, let denote the expectation with respect to the randomness of . Then, we have
(16) 
If we fix the realization of the algorithm’s randomization, the results of the previous steps for a deterministic algorithm apply and could be lower bounded as before. Hence, is lower bounded as Eq. (15). ∎
a.2 General Case ()
Now we suppose is even. If is odd, we just take the first dimensions into consideration. Then we consider the contribution to the total expected regret from the choice of , for all . We call the th component of .
Appendix B Proof of Theorem 2 (Regret Analysis for the MENU Algorithm)
To prove Theorem 2, we start with proving the following two lemmas. Recall that the algorithm in the paper is based on leastsquares estimate (LSE).
Lemma 2 (Confidence Ellipsoid of LSE).
Let denote the LSE of with the sequence of decisions and observed payoffs . Assume that for all and all ,
Comments
There are no comments yet.