# Minimax Policy for Heavy-tailed Multi-armed Bandits

We study the stochastic Multi-Armed Bandit (MAB) problem under worst case regret and heavy-tailed reward distribution. We modify the minimax policy MOSS <cit.> for the sub-Gaussian reward distribution by using saturated empirical mean to design a new algorithm called Robust MOSS. We show that if the moment of order 1+ϵ for the reward distribution exists, then the refined strategy has a worst-case regret matching the lower bound while maintaining a distribution dependent logarithm regret.

## Authors

• 12 publications
• 13 publications
• ### Nonstationary Stochastic Multiarmed Bandits: UCB Policies and Minimax Regret

We study the nonstationary stochastic Multi-Armed Bandit (MAB) problem i...
01/22/2021 ∙ by Lai Wei, et al. ∙ 0

• ### Bandits with heavy tail

The stochastic multi-armed bandit problem is well understood when the re...
09/08/2012 ∙ by Sébastien Bubeck, et al. ∙ 0

• ### Optimal Algorithms for Stochastic Multi-Armed Bandits with Heavy Tailed Rewards

In this paper, we consider stochastic multi-armed bandits (MABs) with he...
10/24/2020 ∙ by Kyungjae Lee, et al. ∙ 0

• ### Regret Minimization in Heavy-Tailed Bandits

We revisit the classic regret-minimization problem in the stochastic mul...
02/07/2021 ∙ by Shubhada Agrawal, et al. ∙ 7

• ### Continuous Assortment Optimization with Logit Choice Probabilities under Incomplete Information

We consider assortment optimization in relation to a product for which a...
07/17/2018 ∙ by Yannik Peeters, et al. ∙ 4

• ### Combinatorial Multi-Armed Bandits with Filtered Feedback

Motivated by problems in search and detection we present a solution to a...
05/26/2017 ∙ by James A. Grant, et al. ∙ 0

• ### Optimal Learning for Structured Bandits

We study structured multi-armed bandits, which is the problem of online ...
07/14/2020 ∙ by Bart P. G. Van Parys, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

The dilemma of exploration versus exploitation is common in scenarios involving decision-making in unknown environments. In these contexts, exploration means learning the environment while exploitation means taking empirically computed best actions. When finite time performance is concerned, i.e., scenarios in which one cannot learn indefinitely, ensuring a good balance of exploration and exploitation is the key to a good performance. Multi-armed bandit and its variations are prototypical models for these problems, and they are broadly applied in many areas such as economics, communication systems, and robotics.

The stochastic MAB problem was originally proposed by Robbins [2]. In this problem, an agent chooses an arm from a set of

arms and receives a reward associated with the arm at each time slot. The reward at each arm is a stationary random variable with unknown mean reward. The objective is to design a policy that maximizes the cumulative reward or equivalently minimizes the

expected cumulative regret, defined by the difference between the expected cumulative reward obtained by selecting the arm with the maximum mean reward at each time and selecting arms determined by the designed policy.

The notion of expected cumulative regret can be generalized to the worst-case regret

, which is defined by the supremum of the expected cumulative regret computed over all possible reward distributions within a certain class such as distributions with bounded support, or sub-Gaussian distributions. The

minimax regret is defined as the minimum worst case regret, where the minimum is computed over all the policies. By construction, the worst-case regret uses minimal information about the underlying distribution and the associated regret bounds are called distribution free bounds. In contrast, the standard regret bounds depend on the difference between the mean rewards associated with the optimal arm and suboptimal arms, and the corresponding bounds are referred as distribution-dependent bounds.

In their seminal work, Lai and Robbins [3] establish that the expected cumulative regret admits an asymptotic distribution dependent lower bound that is a logarithmic function of the time-horizon . Here, asymptotic refers to the limit . They also propose a general method of constructing Upper Confidence Bound (UCB) based policies that attain the lower bound asymptotically. By assuming rewards to be bounded or more generally sub-Gaussian, several subsequent works design simpler algorithms with finite time performance guarantees, e.g., the UCB1 algorithm by Auer et al. [4]. By using Kullback-Leibler(KL) divergence based upper confidence bounds, Garivier and Cappé [5] designed KL-UCB, which is proved to have efficient finite time performance as well as asymptotic optimality.

In the worst-case setting, the lower bound and upper bounds are distribution free bounds. Assuming the rewards are bounded, Audibert and Bubeck [1] establish a lower bound on the minimax regret. They also studied a modified UCB algorithm called Minimax Optimal Strategy in the Stochastic case (MOSS) and proved that it achieves an order-optimal worst case regret while maintaining a logarithm distribution-dependent regret. Degenne and Perchet [6] extend MOSS to an any-time version called MOSS-anytime.

The rewards being bounded or sub-Gaussian is a common assumption that gives sample mean an exponential convergence and simplifies the MAB problem. However in many applications, such as social networks [7] and financial markets [8], the rewards are heavy-tailed. For the standard stochastic MAB problem, Bubeck et al. [9] relax the sub-Gaussian assumption by only assuming the rewards to have finite moments of order for some . They present the robust UCB algorithm and show that it attains an upper bound on the cumulative regret that is within a constant factor of the distribution depend lower bound in the heavy-tailed setting. However, to the best of our knowledge, so-far in the literature there is a lack of an algorithm that provably achieves an order optimal worst case regret for heavy-tailed bandits. A polylogarithmic extra factor exists in the solutions provided in [9].

In this paper, we study the minimax heavy tail bandit problem. We propose and analyze Robust MOSS algorithm and show that if the reward distributions admit moments of order , with , then it achieves minimax regret matching the lower bound while maintaining a distribution dependent logarithm regret. Our results builds on techniques in [1] and [9], and augment them with new analysis based on maximal Bennett inequalities.

The remaining paper is organized as follows. We describe the minimax heavy-tailed multiarmed bandit problem and present some background material in Section II. We present and analyze the Robust MOSS algorithm in Sections III and IV, respectively. Our conclusions are presented in Section VI.

## Ii Background & Problem Description

### Ii-a Stochastic MAB Problem

In a stochastic MAB problem, an agent chooses an arm from the set of arms at each time and receives the associated reward. The reward at each arm is drawn from an unknown distribution with unknown mean . Let the maximum mean reward among all arms be . We use to measure the suboptimality of arm . The objective is to maximize the expected cumulative reward or equivalently to minimize the expected cumulative regret defined by

 RT:=E[T∑t=1(μ∗−Xφt)]=E[T∑t=1Δφt],

which is the difference between the expected cumulative reward obtained by selecting the arm with the maximum mean reward and selecting arms .

The expected cumulative regret is implicitly defined for a fixed distribution of rewards from each arm . The worst case regret is the expected cumulative regret for the worst possible choice of reward distributions . In particular,

 RworstT=sup{f1,…,fK}RT.

The regret associated with the policy that minimizes the above worst case regret is called minimax regret.

### Ii-B Problem Description: Heavy-tailed Stochastic MAB

In this paper, we study the heavy-tailed stochastic MAB problem, which is the stochastic MAB problem with following assumptions.

###### Assumption 1

Let be a random reward drawn from any arm . There exists a constant such that for some .

###### Assumption 2

Parameters , , and are known.

### Ii-C MOSS Algorithm for Worst-Case Regret

We now present the MOSS algorithm proposed in [1]. The MOSS algorithm is designed for stochastic MAB problem with bounded rewards and in this paper, we extend it to design Robust MOSS algorithm for heavy-tailed bandits.

Suppose that arm is sampled times until time , and is the associated empirical mean, then, at time , MOSS picks the arm that maximizes the following UCB

 gknk(t)=¯μknk(t)+  ⎷max(ln(TKnk(t)),0)nk(t).

If the rewards from the arms have bounded support , then the worst-case regret for MOSS satisfies , which is order optimal [1]. Meanwhile, MOSS maintains a logarithm distribution-dependent regret bound.

### Ii-D A Lower Bound for Heavy-tailed Minimax Regret

We now present the lower bound on the minimax regret for the heavy tailed bandit problem derived in [9].

###### Theorem 1 ([9, Th. 2])

For any fixed time horizon and the stochastic MAB problem under Assumptions 1 and 2 with ,

 RworstT≥0.01Kϵ1+ϵT11+ϵ.
###### Remark 1

Since scales with , the lower bound for heavy tail bandit is . This lower bound also indicates that within a finite horizon , it is almost impossible to differentiate the optimal arm from arm , if .

## Iii A Robust Minimax Policy

To deal with the heavy-tailed reward distribution, we replace the empirical mean with a saturated empirical mean. Although saturated empirical mean is a biased estimator, it has better convergence properties. We construct a novel UCB index to evaluate the arms, and at each time slot the arm with the maximum UCB is picked.

### Iii-a Robust MOSS

In Robust MOSS, we consider a robust mean estimator called saturated empirical mean which is formally defined in the following subsection. Let be the number of times that arm has been selected until time . At time , let be the saturated empirical mean reward computed from the samples at arm . Robust MOSS initializes by selecting each arm once and subsequently, at each time , selects the arm that maximizes the following UCB

 gknk(t)=^μknk(t)+(1+η)cnk(t),

where is an appropriate constant, and

 ϕ(n)=ln+(TKn)n,

where .

### Iii-B Saturated Empirical Mean

The robust saturated empirical mean is similar to the truncated empirical mean used in [9], which is employed to extend UCB1 to achieve logarithm distribution dependent regret for the heavy-tailed MAB problem. Let be a sequence of i.i.d. random variables with mean and , where . Pick and let . Define the saturation point by

 Bm:=u×[ϕ(h(m))]−11+ϵ.

Then, the saturated empirical mean estimator is defined by

 ^μm:=1mm∑i=1sat(Xi,Bm), (1)

where

Define . The following lemma examines the estimator bias and provides an upper bound on the error of saturated empirical mean.

###### Lemma 2 (Error of saturated empirical mean)

For an i.i.d. sequence of random variables such that and , the saturated empirical mean (1) satisfies

 ∣∣∣^μm−μ−1mm∑i=1di∣∣∣≤u1+ϵBϵm.

Since , the error of estimator satisfies

 ^μm−μ= 1mm∑s=1(sat(Xi,Bm)−μ) = 1mm∑i=1di+1mm∑i=1(E[sat(Xi,Bm)]−μ),

where the second term is the bias of . We now compute an upper bound on the bias.

 \absE[sat(Xi,Bm)]−μ ≤E[\absXi1{\absXi>Bm}] ≤E[\absXi1+ϵ(Bm)ϵ]=u1+ϵ(Bm)ϵ,

which concludes the proof.

We now establish properties of .

###### Lemma 3 (Properties of di)

For any , satisfies (i) (ii) .

Property (i) follows immediately from definition of , and property (ii) follows from

 E[d2i]≤E[sat2(Xi,Bm)]≤E[\absXi1+ϵB1−ϵm].

## Iv Analysis of Robust MOSS

In this section, we analyze Robust MOSS to provide both distribution-free and distribution-dependent regret bounds.

### Iv-a Properties of Saturated Empirical Mean Estimator

To derive the concentration property of saturated empirical mean, we use a maximal Bennett type inequality as shown in Lemma 4.

###### Lemma 4 (Maximal Bennett’s inequality [10])

Let be a sequence of bounded random variables with support , where . Suppose that and . Let for any . Then, for any

 P(∃m∈{1,…,n}:Sm≥δ)≤exp(−δBψ(Bδnv)),

where .

###### Remark 2

For , function is monotonically increasing in .

Now, we establish an upper bound on the probability that the UCB underestimates the mean at arm

by an amount .

###### Lemma 5

For any arm and any and , if , the probability of event is no greater than

 KTaln(a)Γ(1ϵ+2)(ψ(2η/a)2axu)−1+ϵϵ.

It follows from Lemma 2 that

 P(gknk(t)≤μk−x) ≤ P(∃m∈{1,…,T}:^μkm+(1+η)cm≤μk−x) ≤ P(∃m∈{1,…,T}:m∑i=1dkim≤u1+ϵBϵm−(1+η)cm−x) ≤ P(∃m∈{1,…,T}:1mm∑i=1dki≤−x−ηcm),

where is defined similarly to for i.i.d. reward sequence at arm and the last inequality is due to

 u1+ϵBϵm=u[ϕ(h(m))]ϵ1+ϵ≤u[ϕ(m)]ϵ1+ϵ=cm. (2)

Recall . We apply a peeling argument [11, Sec 2.2] with geometric grid over time interval . Since is monotonically decreasing with ,

 P(∃m∈{1,…,T}:1mm∑i=1dki≤−x−ηcm) ≤ ∑s≥0P(∃m∈[as,as+1):m∑i=1dki≤−as(x+ηcas+1)).

Also notice that for all . Then with properties in Lemma 3, we apply Lemma 4 to get

 ∑s≥0P(∃m∈[as,as+1):m∑i=1dki≤−as(x+ηcas+1)) ≤ ∑s≥0exp(−as(x+ηcas+1)2Basψ(2Bas(x+ηcas+1)au1+ϵB1−ϵas)) (since ψ(x) is monotonically increasing) ≤ ∑s≥0exp(−as(x+ηcas+1)2Basψ(2ηBϵascas+1au1+ϵ)) (substituting cas+1, Bas and using h(as)=as+1) = ∑s≥1exp(−as(xBas−1+ηϕ(as))ψ(2η/a)2a) (since ηψ(2η/a)≥2a) ≤ KT∑s≥1asexp(−asxBas−1ψ(2η/a)2a). (3)

Let . Since for all ,

 KT∑s≥1asexp(−baϵs1+ϵ) ≤ KT∫+∞1ayexp(−ba(y−1)ϵ1+ϵ)dy = KTa∫+∞0ayexp(−bayϵ1+ϵ)dy (where we set z=bayϵ1+ϵ) = KTaln(a)1+ϵϵb−1+ϵϵ∫+∞bz1+ϵϵ−1exp(−z)dz ≤

which conclude the proof.

The following is a straightforward corollary of Lemma 5.

###### Corollary 6

For any arm and any and , if , the probability of event shares the same bound in Lemma 5.

### Iv-B Distribution Free Regret Bound

The distribution free upper bound for Robust MOSS, which is the main result for the paper, is presented in this section. We show that the algorithm achieves order optimal worst case regret.

###### Theorem 7

For the heavy-tailed stochastic MAB problem with arms and time horizon , if and are selected such that , then Robust Moss satisfies

 RworstT≤CuKϵ1+ϵ(T/e)11+ϵ+2uK,

where .

Since both UCB and regret scales with defined in Assumption 1, to simplifies the expressions, we assume . Also notice Assumption 1 indicates , so for any . Furthermore, any terms with superscript or subscript “” and “” are with respect to the best and the -th arm, respectively. The proof is divided into steps.

Step 1: We follow a decoupling technique inspired by the proof of regret upper bound in MOSS [1]. Take the set of -bad arms as as

 Bδ:={k∈{1,…,K}|Δk>δ}, (4)

where we assign . Thus,

 RT ≤Tδ+K∑t=1Δk+E[T∑t=K+11{φt∈Bδ}(Δφt−δ)] (5)

Furthermore, we make the following decomposition

 T∑t=K+11{φt∈Bδ}(Δφt−δ) ≤ T∑t=K+11{φt∈Bδ,g∗n∗(t)≤μ∗−Δφt3}(Δφt−δ) (6) +T∑t=K+11{φt∈Bδ,g∗n∗(t)>μ∗−Δφt3}(Δφt−δ).

Notice that (6) describes regret from underestimating optimal arm . For the second summand, since ,

 T∑t=K+11{φt∈Bδ,g∗n∗(t)>μ∗−Δφt3}(Δφt−δ) ≤ T∑t=K+11{φt∈Bδ,gφtnφt(t)>μφt+2Δφt3}Δφt = ∑k∈BδT∑t=K+11{φt=k,gknk(t)>μk+2Δk3}Δk, (7)

which characterizes the regret caused by overestimating -bad arms.

Step 2: In this step, we bound the expectation of (6). When event happens, we know

 Δφ≤3μ∗−3g∗n∗(t) and g∗n∗(t)<μ∗−δ3.

Thus, we get

 1{φt∈Bδ,g∗n∗(t)≤μ∗−Δφt3}(Δφt−δ) ≤

Since is a positive random variable, its expected value can be computed involving only its cumulative density function:

 E[Yt] =∫+∞0P(Yt>x)dx ≤∫+∞0P(3μ∗−3g∗n∗(t)−δ>x)dx =∫+∞δP(μ∗−g∗n∗(t)>x3)dx.

Then we apply Lemma 5 at optimal arm to get

 E[Yt]≤KC1T∫+∞δ1ϵx−1+ϵϵdx=KC1Tδ1ϵ

where . We conclude this step by

 E[(???)]≤T∑t=K+1Yt≤C1Kδ−1ϵ.

Step 3: In this step, we bound the expectation of (7). For each arm ,

 T∑t=K+11{φt=k,gknk(t)≥μk+2Δk3} = T∑t=K+1t−K∑m=11{φt=k,nk(t)=m}1{gkm≥μk+2Δk3} = T−K∑m=11{gkm≥μk+2Δk3}T∑t=m+K1{φt=k,nk(t)=m} ≤ T∑m=11{gkm≥μk+2Δk3} ≤ T∑m=11{1mm∑i=1dki≥2Δk3−(2+η)cm}, (8)

where in the last inequality we apply Lemma 2 and in (2). We set

 lk=⎡⎢ ⎢ ⎢⎢(6+3ηΔk)1+ϵϵln(TK(Δk6+3η)1+ϵϵ)⎤⎥ ⎥ ⎥⎥.

With , we get is no less than

Furthermore, since is monotonically decreasing with , for ,

 (9)

With this result and , we continue from (8) to get

 E[(???)]≤ lk−1+T∑m=lkP{1mm∑i=1dki≥2Δk3−(2+η)cm} ≤ lk−1+T∑m=lkP{1mm∑i=1dki≥Δk3} (10)

Therefore by using Lemma 4 together with (ii) from Lemma 3, we get

 T∑m=lkP{1mm∑i=1dki≥Δk3} ≤ T∑m=lkexp(−mΔk3Bmψ(BϵmΔk)) (since ψ(x) is monotonically increasing and BϵmΔk≥(6+3η)Bϵmcm≥6+3η due to~{}(???) and~{}(???)) ≤ T∑m=lkexp(−mΔk3Bmψ(6+3η)) (since Bm=ϕ(h(m))−11+ϵ≤ϕ(am)−11+ϵ≤(am)11+ϵ) ≤ T∑m=1exp(−mϵ1+ϵa−11+ϵψ(6+3η)Δk3). (11)

Let . Then we have

 ∫+∞0exp(−βyϵ1+ϵ)dy = 1+ϵϵβ−1+ϵϵ∫+∞0z1+ϵϵ−1exp(−z)dy (where we set z=βyϵ1+ϵ) = Γ(1ϵ+2)β−1+ϵϵ.

Plugging it into (10),

 E[(???)] ≤C2Δ−1+ϵϵk+C3Δ−1+ϵϵkln(TKC3Δ1+ϵϵk)

where and . Put it together with for all ,

 E[(???)] ≤∑k∈BδC2Δ−1ϵk+C3Δ−1ϵkln(TKC3Δ1+ϵϵk) ≤C2Kδ−1ϵ+(1+ϵ)e−ϵ1+ϵC3Kδ−1ϵ,

where we use the fact that takes its maximum at .

Step 4: Plugging the results in step and step into (5),

 RworstT≤Tδ+[C1+C2+(1+ϵ)e−ϵ1+ϵC3]Kδ−1ϵ+2K.

Straightforward calculation concludes the proof.

### Iv-C Distribution Dependent Regret Upper Bound

We now show that robust MOSS also preserves a logarithm upper bound on the distribution dependent regret.

###### Theorem 8

For the heavy-tailed stochastic MAB problem with arms and time horizon , if , the regret for Robust Moss is no greater than

 ∑k:Δk>0(u1+ϵΔk)1ϵ[C1ln(TKC1(Δku)1+ϵϵ)+C2K]+Δk.

where and .

Let and define the same as (4). Since for all , the regret satisfies

 RT ≤∑k∉BδTΔk+T∑t=11{φt∈Bδ}Δφt ≤∑k∉BδeK(4+4ηΔk)1+ϵϵΔk+∑k∈BδT∑t=11{φt=k}Δk. (12)

Pick arbitrary , thus

 T∑t=11{φt=k} ≤lk+T∑t=K+11{φt=k,nk(t)≥lk} ≤lk+T∑t=K+11{gknk(t)≥g∗n∗(t),nk(t)≥lk}.

Observe that implies at least one of the following is true

 g∗n∗(t)≤μ∗−Δk4, (13) gknk(t)≥μk+Δk4+2(1+η)cnk(t), (14) (1+η)cnk(t)>Δk4. (15)

We select

 lk=⎡⎢ ⎢ ⎢⎢(4+4ηΔk)1+ϵϵln(TK(Δk4+4η)1