The dilemma of exploration versus exploitation is common in scenarios involving decision-making in unknown environments. In these contexts, exploration means learning the environment while exploitation means taking empirically computed best actions. When finite time performance is concerned, i.e., scenarios in which one cannot learn indefinitely, ensuring a good balance of exploration and exploitation is the key to a good performance. Multi-armed bandit and its variations are prototypical models for these problems, and they are broadly applied in many areas such as economics, communication systems, and robotics.
The stochastic MAB problem was originally proposed by Robbins . In this problem, an agent chooses an arm from a set of
arms and receives a reward associated with the arm at each time slot. The reward at each arm is a stationary random variable with unknown mean reward. The objective is to design a policy that maximizes the cumulative reward or equivalently minimizes theexpected cumulative regret, defined by the difference between the expected cumulative reward obtained by selecting the arm with the maximum mean reward at each time and selecting arms determined by the designed policy.
The notion of expected cumulative regret can be generalized to the worst-case regret
, which is defined by the supremum of the expected cumulative regret computed over all possible reward distributions within a certain class such as distributions with bounded support, or sub-Gaussian distributions. Theminimax regret is defined as the minimum worst case regret, where the minimum is computed over all the policies. By construction, the worst-case regret uses minimal information about the underlying distribution and the associated regret bounds are called distribution free bounds. In contrast, the standard regret bounds depend on the difference between the mean rewards associated with the optimal arm and suboptimal arms, and the corresponding bounds are referred as distribution-dependent bounds.
In their seminal work, Lai and Robbins  establish that the expected cumulative regret admits an asymptotic distribution dependent lower bound that is a logarithmic function of the time-horizon . Here, asymptotic refers to the limit . They also propose a general method of constructing Upper Confidence Bound (UCB) based policies that attain the lower bound asymptotically. By assuming rewards to be bounded or more generally sub-Gaussian, several subsequent works design simpler algorithms with finite time performance guarantees, e.g., the UCB1 algorithm by Auer et al. . By using Kullback-Leibler(KL) divergence based upper confidence bounds, Garivier and Cappé  designed KL-UCB, which is proved to have efficient finite time performance as well as asymptotic optimality.
In the worst-case setting, the lower bound and upper bounds are distribution free bounds. Assuming the rewards are bounded, Audibert and Bubeck  establish a lower bound on the minimax regret. They also studied a modified UCB algorithm called Minimax Optimal Strategy in the Stochastic case (MOSS) and proved that it achieves an order-optimal worst case regret while maintaining a logarithm distribution-dependent regret. Degenne and Perchet  extend MOSS to an any-time version called MOSS-anytime.
The rewards being bounded or sub-Gaussian is a common assumption that gives sample mean an exponential convergence and simplifies the MAB problem. However in many applications, such as social networks  and financial markets , the rewards are heavy-tailed. For the standard stochastic MAB problem, Bubeck et al.  relax the sub-Gaussian assumption by only assuming the rewards to have finite moments of order for some . They present the robust UCB algorithm and show that it attains an upper bound on the cumulative regret that is within a constant factor of the distribution depend lower bound in the heavy-tailed setting. However, to the best of our knowledge, so-far in the literature there is a lack of an algorithm that provably achieves an order optimal worst case regret for heavy-tailed bandits. A polylogarithmic extra factor exists in the solutions provided in .
In this paper, we study the minimax heavy tail bandit problem. We propose and analyze Robust MOSS algorithm and show that if the reward distributions admit moments of order , with , then it achieves minimax regret matching the lower bound while maintaining a distribution dependent logarithm regret. Our results builds on techniques in  and , and augment them with new analysis based on maximal Bennett inequalities.
Ii Background & Problem Description
Ii-a Stochastic MAB Problem
In a stochastic MAB problem, an agent chooses an arm from the set of arms at each time and receives the associated reward. The reward at each arm is drawn from an unknown distribution with unknown mean . Let the maximum mean reward among all arms be . We use to measure the suboptimality of arm . The objective is to maximize the expected cumulative reward or equivalently to minimize the expected cumulative regret defined by
which is the difference between the expected cumulative reward obtained by selecting the arm with the maximum mean reward and selecting arms .
The expected cumulative regret is implicitly defined for a fixed distribution of rewards from each arm . The worst case regret is the expected cumulative regret for the worst possible choice of reward distributions . In particular,
The regret associated with the policy that minimizes the above worst case regret is called minimax regret.
Ii-B Problem Description: Heavy-tailed Stochastic MAB
In this paper, we study the heavy-tailed stochastic MAB problem, which is the stochastic MAB problem with following assumptions.
Let be a random reward drawn from any arm . There exists a constant such that for some .
Parameters , , and are known.
Ii-C MOSS Algorithm for Worst-Case Regret
We now present the MOSS algorithm proposed in . The MOSS algorithm is designed for stochastic MAB problem with bounded rewards and in this paper, we extend it to design Robust MOSS algorithm for heavy-tailed bandits.
Suppose that arm is sampled times until time , and is the associated empirical mean, then, at time , MOSS picks the arm that maximizes the following UCB
If the rewards from the arms have bounded support , then the worst-case regret for MOSS satisfies , which is order optimal . Meanwhile, MOSS maintains a logarithm distribution-dependent regret bound.
Ii-D A Lower Bound for Heavy-tailed Minimax Regret
We now present the lower bound on the minimax regret for the heavy tailed bandit problem derived in .
Theorem 1 ([9, Th. 2])
Since scales with , the lower bound for heavy tail bandit is . This lower bound also indicates that within a finite horizon , it is almost impossible to differentiate the optimal arm from arm , if .
Iii A Robust Minimax Policy
To deal with the heavy-tailed reward distribution, we replace the empirical mean with a saturated empirical mean. Although saturated empirical mean is a biased estimator, it has better convergence properties. We construct a novel UCB index to evaluate the arms, and at each time slot the arm with the maximum UCB is picked.
Iii-a Robust MOSS
In Robust MOSS, we consider a robust mean estimator called saturated empirical mean which is formally defined in the following subsection. Let be the number of times that arm has been selected until time . At time , let be the saturated empirical mean reward computed from the samples at arm . Robust MOSS initializes by selecting each arm once and subsequently, at each time , selects the arm that maximizes the following UCB
where is an appropriate constant, and
Iii-B Saturated Empirical Mean
The robust saturated empirical mean is similar to the truncated empirical mean used in , which is employed to extend UCB1 to achieve logarithm distribution dependent regret for the heavy-tailed MAB problem. Let be a sequence of i.i.d. random variables with mean and , where . Pick and let . Define the saturation point by
Then, the saturated empirical mean estimator is defined by
Define . The following lemma examines the estimator bias and provides an upper bound on the error of saturated empirical mean.
Lemma 2 (Error of saturated empirical mean)
For an i.i.d. sequence of random variables such that and , the saturated empirical mean (1) satisfies
Since , the error of estimator satisfies
where the second term is the bias of . We now compute an upper bound on the bias.
which concludes the proof.
We now establish properties of .
Lemma 3 (Properties of )
For any , satisfies (i) (ii) .
Property (i) follows immediately from definition of , and property (ii) follows from
Iv Analysis of Robust MOSS
In this section, we analyze Robust MOSS to provide both distribution-free and distribution-dependent regret bounds.
Iv-a Properties of Saturated Empirical Mean Estimator
To derive the concentration property of saturated empirical mean, we use a maximal Bennett type inequality as shown in Lemma 4.
Lemma 4 (Maximal Bennett’s inequality )
Let be a sequence of bounded random variables with support , where . Suppose that and . Let for any . Then, for any
For , function is monotonically increasing in .
Now, we establish an upper bound on the probability that the UCB underestimates the mean at armby an amount .
For any arm and any and , if , the probability of event is no greater than
It follows from Lemma 2 that
where is defined similarly to for i.i.d. reward sequence at arm and the last inequality is due to
Recall . We apply a peeling argument [11, Sec 2.2] with geometric grid over time interval . Since is monotonically decreasing with ,
|(substituting , and using )|
Let . Since for all ,
which conclude the proof.
The following is a straightforward corollary of Lemma 5.
For any arm and any and , if , the probability of event shares the same bound in Lemma 5.
Iv-B Distribution Free Regret Bound
The distribution free upper bound for Robust MOSS, which is the main result for the paper, is presented in this section. We show that the algorithm achieves order optimal worst case regret.
For the heavy-tailed stochastic MAB problem with arms and time horizon , if and are selected such that , then Robust Moss satisfies
Since both UCB and regret scales with defined in Assumption 1, to simplifies the expressions, we assume . Also notice Assumption 1 indicates , so for any . Furthermore, any terms with superscript or subscript “” and “” are with respect to the best and the -th arm, respectively. The proof is divided into steps.
Step 1: We follow a decoupling technique inspired by the proof of regret upper bound in MOSS . Take the set of -bad arms as as
where we assign . Thus,
Furthermore, we make the following decomposition
Notice that (6) describes regret from underestimating optimal arm . For the second summand, since ,
which characterizes the regret caused by overestimating -bad arms.
Step 2: In this step, we bound the expectation of (6). When event happens, we know
Thus, we get
Since is a positive random variable, its expected value can be computed involving only its cumulative density function:
Then we apply Lemma 5 at optimal arm to get
where . We conclude this step by
Step 3: In this step, we bound the expectation of (7). For each arm ,
With , we get is no less than
Furthermore, since is monotonically decreasing with , for ,
With this result and , we continue from (8) to get
Let . Then we have
Plugging it into (10),
where and . Put it together with for all ,
where we use the fact that takes its maximum at .
Step 4: Plugging the results in step and step into (5),
Straightforward calculation concludes the proof.
Iv-C Distribution Dependent Regret Upper Bound
We now show that robust MOSS also preserves a logarithm upper bound on the distribution dependent regret.
For the heavy-tailed stochastic MAB problem with arms and time horizon , if , the regret for Robust Moss is no greater than
where and .
Let and define the same as (4). Since for all , the regret satisfies
Pick arbitrary , thus
Observe that implies at least one of the following is true