Beginning with the seminal work of Hannan (1957), researchers have been interested in algorithms that use random perturbations to generate a distribution over available actions. Kalai and Vempala (2005) showed that the perturbation idea leads to efficient algorithms for many online learning problems with large action sets. Due to the Gumbel lemma (Hazan et al., 2017), the well known exponential weights algorithm (Freund and Schapire, 1997) also has an interpretation as a perturbation based algorithm that uses Gumbel distributed perturbations.
There have been several attempts to analyze the regret of perturbation based algorithms with specific distributions such as Uniform, Double-exponential, drop-out and random walk (see, e.g., (Kalai and Vempala, 2005; Kujala and Elomaa, 2005; Devroye et al., 2013; Van Erven et al., 2014)). These works provided rigorous guarantees but the techniques they used did not generalize to general perturbations. The work of Abernethy et al. (2014) provided a general framework to understand general perturbations and clarified the relation between regularization and perturbation by understanding them as different ways to smooth an underlying non-smooth potential function.
Abernethy et al. (2015) extended the analysis of general perturbations to the partial information setting of the adversarial multi-armed bandit problem. They isolated bounded hazard rate as an important property of a perturbation and gave several examples of perturbations that lead to the near optimal regret bound of . Since Tsallis entropy regularization can achieve the minimax regret of (Audibert and Bubeck, 2009, 2010), the question of whether perturbations can match the power of regularizers remained open for the adversarial multi-armed bandit problem.
In this paper, we build upon previous works (Abernethy et al., 2014, 2015) in two distinct but related directions. First, we provide the first general result for perturbation algorithms in the stochastic multi-armed bandit problem. However, instead of the hazard rate, the parameter that shows up in our stochastic regret analysis is the sub-Weibull parameter
of the perturbation distribution. The sub-Weibull family includes sub-Gaussian and sub-Exponential distributions as special cases. Moreover, our regret is instance optimal for a range of the sub-Weibull parameter. Since the uniform distribution is sub-Gaussian, a corollary of our results is a regret bound for a randomized version of UCB where the algorithm picks a random number in the confidence interval instead of the upper bound. Our analysis relies on the simple but powerful observation that Thompson sampling with Gaussian priors and rewards can also be interpreted as a perturbation algorithm with Gaussian perturbations. We are able to generalize both the upper bound and lower bound ofAgrawal and Goyal (2013) from the special Gaussian case to general sub-Weibull distributions.
Second, we return to the open problem mentioned above: is there a perturbation that gives us minimax optimality? We do not resolve it but provide rigorous proofs that there are barriers to two natural approaches to solving the open problem. (A) One cannot simply find a perturbation that is exactly equivalent to Tsallis entropy. This is surprising since Shannon entropy does have an exact equivalent perturbation, viz. Gumbel. (B) One cannot simply do a better analysis of perturbations used by Abernethy et al. (2015) and plug the results into their general regret bound to eliminate the extra factor. In proving the first barrier, we use a fundamental result in discrete choice theory. For the second barrier, we rely on tools from extreme value theory.
2 Problem Setup
In every round , a learner chooses an action out of
arms and the environment picks a response in the form of a real-valued reward vector. While the entire reward vector is revealed to the learner in full information setting, the learner only receives a reward associated with his choice in bandit setting, but any information on other arms would not be provided. Thus, we denote the reward corresponding to his choice as .
Stochastic and Adversarial setting
In stochastic multi-armed bandit, the rewards are sampled i.i.d from a fixed, but unknown distribution with mean . Adversarial multi-armed bandit is more general in that all assumptions on how rewards are assigned to arms are dropped. It only assumes that rewards are assigned by an adversary before the interaction begins. Such an adversary is called oblivious adversary. In both environments, the learner makes a sequence of decisions based on each history, , to maximize the cumulative reward, .
Regret and No-regret algorithm
As a measure of evaluating a learner, Regret is the difference between rewards the learner would have received had he played the best in hindsight, and the rewards he actually received. Therefore, minimizing the regret is equivalent to maximizing the expected cumulative reward. In adversarial setting, the expected regret, , is considered and in stochastic setting one often considers the pseudo regret instead, . Note that the expected regret is equal to the pseudo regret with an oblivious adversary. An online decision-making algorithm is called a no-regret algorithm if for every adversary, the expected regret with respect to every action is sub-linear in , that is, as goes to infinity. Thus, it is of main interest in online learning to study the rate of growth of regret for various algorithms in various environments.
A Note on the Meaning of FTPL
We use FTPL (Follow The Perturbed Leader) to denote families of algorithms for both stochastic and adversarial settings. The common core of FTPL algorithms consists in adding random perturbations to the estimates of rewards of each arm prior to computing the current “the best arm” (or “leader”). However, the estimates used are different in the two settings: stochastic setting uses sample means and adversarial setting uses inverse probability weighted estimates. A consequence of this convention is that “FTPL with Gumbel perturbations” can refer to two different algorithms: one meant for the stochastic setting and the other meant for the adversarial setting. The meaning of FTPL will be clear from the section of the paper we are in.
3 Stochastic Bandits
In this section, we propose FTPL algorithm for stochastic multi-armed bandits and characterize a family of perturbations that make algorithm optimal in terms of regret bounds, which is sub-Weibull. This work is mainly motivated by Thompson Sampling (Thompson, 1933), one of standard algorithms in stochastic setting. We also provide a lower bound of this FTPL algorithm.
In stochastic multi-armed bandit, arm 1 is simply assumed to be optimal, , and sub-optimality gap is denoted as . Let be the average reward received from arm after round written formally as where is the number of times arm has been pulled after round . The regret for stochastic bandits can be decomposed into . The reward distributions are generally assumed to be sub-Gaussian with parameter 1. First, we introduce the definition of sub-Gaussian and sub-Exponential families, and Hoeffding bound in sub-Gaussian case.
Definition 1 (sub-Gaussian).
with mean is sub-Gaussian with parameter if it satisfies for all .
Lemma 1 (Hoeffding bound of sub-Gaussian).
Suppose , are i.i.d. random variables with and sub-Gaussian with parameter . Then for all , where .
Definition 2 (sub-Exponential).
A random variable with mean is sub-Exponential with parameter if it satisfies for all .
Gaussian distribution is sub-Gaussian with parameter and all bounded random variable is sub-Gaussian with parameter . Double exponential distribution with density is sub-Exponential with .
3.2 Upper Confidence Bound and Thompson Sampling
The standard algorithms in stochastic bandit are Upper Confidence Bound (UCB1) (Auer, 2002) and Thompson Sampling (Thompson, 1933). The former algorithm is constructed to compare the largest plausible estimate of mean for each arm based on the optimism in the face of uncertainty so that it would be deterministic in contradistinction to the latter one. At time , UCB1 algorithm chooses an action by maximizing upper confidence bounds, . Regarding the instance dependent regret of UCB1, there exist some universal constant such that .
Thompson Sampling is a Bayesian solution for stochastic bandit problem with innate randomness. The overview of Thompson Sampling is that given the number of arms, , and prior distribution , for round , it computes posterior distribution based on observed data, sample from posterior , and then choose . In Gaussian Thompson Sampling, reward is assumed to be Gaussian with mean
and unit variance for conjugacy and prior distribution for eachis also i.i.d. Gaussian with mean and variance . As prior variance goes to infinity, in the round , the policy from Thompson Sampling is to choose an index that maximizes sampled from Gaussian posterior distribution,. The details of Gaussian Thompson Sampling is in Algorithm 1 and its regret was analyzed in Theorem 2.
Theorem 2 (Agrawal and Goyal (2013)).
Assume that reward distribution of each arm is Gaussian with mean and unit variance. Thompson sampling policy via Gaussian prior defined in Algorithm 1 has the following instance dependent and independent regret bounds, for ,
Viewpoint of Follow-The-Perturbed-Leader
The more generic view of Thompson Sampling is via the idea of perturbation. We brings an interpretation of viewing this Gaussian Thompson Sampling as Follow-The-Perturbed-Leader (FTPL) algorithm via Gaussian perturbation (Lattimore and Szepesvári, 2018). If Gaussian random variables be decomposed into the average mean reward of each arm and scaled Gaussian perturbation where . In each round , FTPL algorithm chooses the action according to . The only difference is assumption under which an algorithm is analyzed. The regret analysis in Agrawal and Goyal (2013) is done under Gaussian assumption on both prior and rewards. It still achieves the same regret bound even if the assumption of reward distribution is relaxed to be 1-sub-Gaussian.
3.3 Follow-The-Perturbed-Leader via sub-Weibull Perturbation
In this section, it is shown that FTPL algorithm with Gaussian perturbation can be extended to a family of sub-Weibull perturbations with parameters and . The sub-Weibull family covers sub-Gaussian and sub-Exponential distributions since they are sub-Weibull with parammeter and , respectively (Faradonbeh et al., 2018). We propose perturbation based algorithms in stochastic multi-armed bandit via sub-Weibull() perturbation in Algorithm 2 and their regret analysis are built on the work of Agrawal and Goyal (2013) in Theorem 3 in terms of parameter .
Definition 3 (sub-Weibull).
A random variable with mean is sub-Weibull() with parameter if it satisfies for all .
Theorem 3 (FTPL via sub-Weibull Perturbation).
Assume that reward distribution of each arm is 1-sub-Gaussian with mean . Follow-The-Perturbed-Leader algorithm via sub-Weibull () perturbation with parameter and in Algorithm 2 has the following instance dependent and independent regret bounds, for ,
For each arm , we will choose two thresholds such that and define two types of events, , and . Intuitively, and are the events that the estimate and the sample value are not too far above the mean , respectively. Set , , and let and . is decomposed into following three parts according to events and ,
Let denote the time at which -th trial of arm happens. Set .
The second last inequality above holds by Hoeffding bound of sample mean of sub-Gaussian rewards, in Lemma 1. The probability in part (b) is upper bounded by 1 if is less than and by otherwise. The latter can be proved as below,
The third inequality holds by sub-Weibull() assumption on perturbation . Let be the largest step until , then part (b) is bounded,
Define as the probability where is defined as the history of plays until time . Let denote the time at which -th trial of arm happens. The following two lemmas is really helpful for handling part (c).
Lemma 4 (Lemma 1 (Agrawal and Goyal, 2013)).
See Appendix A.1. ∎
There exists a finite constant such that
Given , let , and is bounded by a constant if is strictly greater than 0 shown below,
Define events and let be a history up to time where an event is true. To get the tighter bound for large ,
Part (c) is upper bounded by,
Conditioned on history , the probability of choosing sub-optimal arm i can be bounded by a linear function of that of playing the optimal arm 1. It makes the first inequality above hold in Lemma 4. The last inequality works on the basis of Lemma 5. Combining parts (a), (b) and (c) and letting ,
Thus, the instance-dependent regret bound in equation (1) is obtained, and the instance-independent one is derived with optimal choice of . ∎
3.4 Follow-The-Perturbed-Leader via sub-Gaussian and sub-Exponential Perturbation
Corollary 6 and 7 showed that sub-Gaussian perturbation yields the optimal instance-dependent regret bound and a group of sub-Exponential distributions performs sub-optimally with extra term in regret bound.
Corollary 6 (FTPL via sub-Gaussian Perturbation).
Assume that reward distribution of each arm is 1-sub-Gaussian with mean . Follow-The-Perturbed-Leader algorithm via sub-Gaussian perturbation with parameter and in Algorithm 2 has the following instance dependent and independent regret bounds, for
Randomized Confidence Bound algorithm
Corollary 6 implies that the optimism embedded in UCB can be replaced by simple randomization. In UCB1 algorithm (Auer, 2002), an action in round is chosen by maximizing upper confidence bound, . Our modification is to introduce a perturbation in UCB1 algorithm, where . Any bounded random variable including Uniform distribution is sub-Gaussian. It implies that FTPL algorithm via Uniform perturbation can be viewed as randomized version of UCB algorithm, namely RCB (Randomized Confidence Bound) algorithm, and also achieves the comparable regret bound as that of UCB. RCB algorithm is meaningful in that it can be interpreted from the perspectives of two standard solutions, UCB and Thompson Sampling.
Corollary 7 (FTPL via sub-Exponential Perturbation).
Assume that reward distribution of each arm is 1-sub-Gaussian with mean . FTPL algorithm via perturbation with that is sub-Exponential with parameter in Algorithm 2 has the following instance dependent and independent regret bounds, for ,
FTPL via Gumbel Perturbation
Gumbel distribution turns out to be not sub-Gaussian but sub-Exponential. If , it is sub-exponential distribution with . Corollary 7 implies Gumbel perturbation in FTPL yields the regret bound as equation (8), which matches with the regret obtained from FTRL algorithm via Shannon entropy (Zimmert and Seldin, 2018).
3.5 Regret lower bound
The regret lower bound of FTPL algorithm in Algorithm 2 is built on the work of Agrawal and Goyal (2013). Theorem 8 states that regret lower bound depends on the lower bound of tail probability of perturbation.
If the perturbation with has lower bound of tail probability as for , FTPL algorithm via perturbation has the lower bound of expected regret, ).
See Appendix A.2. ∎
As special cases, FTPL algorithm via Gaussian () and Gumbel perturbation () have regret lower bounds ) and ), respectively. FTPL algorithms via Gaussian and Gumbel perturbations make the lower and upper regret bound matched, .
4 Adversarial Bandits
In this section we introduce the Gradient Based Prediction Algorithm (GBPA) family for solving the adversarial multi-armed bandit problem. Then, we will mention an important open problem regarding existence of an optimal FTPL algorithm. The main contribution of this section are theoretical results showing that two natural approaches to solving the open problem are not going to work. We also make some conjectures on what alternative ideas might work.
4.1 The GBPA Algorithm Family
Following the work of Abernethy et al. (2015) we consider a general algorithmic framework, Algorithm 3. They also derived the following general result to analyze the regret of both regularization based and perturbation based algorithms in adversarial bandit problems.
Lemma 9 (Decomposition of the Expected Regret, (Abernethy et al., 2015)).
Define the non-smooth potential function . The expected regret of GBPA() can be written as . Furthermore, it is bounded by sum of an overestimation, an underestimation and a divergence term:
where is Bregman divergence induced by .
There are two main ingredients of GBPA. First ingredient is the smoothed potential
whose gradient is used to map the current estimate of the cumulative reward vector to a probability distributionover arms. Since the gradient has to map vectors in to probability vectors, the map must give us a choice probability function . That is, a function of the type
. The second ingredient is the construction of an unbiased estimateof the rewards vector using the reward of the pulled arm only. This step reduces the bandit setting to full-information setting so that any algorithm for the full-information setting can be immediately applied to the bandit setting. Because the unbiased estimation involves inverse probability weighting, we need to ensure that does not become too small.
4.2 FTRL and FTPL as Two Types of Smoothings
If we did not use any smoothing and directly used the baseline potential (where is the -dimensional simplex), we would be running Follow The Leader (FTL) as our full information algorithm. It is well known that FTL does not have good regret guarantees. Therefore, we need to smooth the baseline potential to induce stability in the algorithm. It turns out that two major algorithm families in online learning, namely Follow The Regularized Leader (FTRL) and Follow The Perturbed Leader (FTPL) correspond to two different types of smoothings.
The smoothing used by FTRL is achieved by adding a strongly convex regularizer in the dual representation of the baseline potential. That is, we set , where is a strongly convex function The well known exponential weights algorithm (Freund and Schapire, 1997) uses the Shannon entropy regularizer, . GBPA with the resulting smoothed potential becomes the EXP3 algorithm (Auer et al., 2002) which achieves a near-optimal regret bound just logarithmically worse compared to the lower bound . This lower bound was matched by Implicit Normalized Forecaster with polynomial function (Poly-INF algorithm) (Audibert and Bubeck, 2009, 2010) and later works of Abernethy et al. (2015) showed that Poly-INF algorithm is equivalent to FTRL algorithm via Tsallis entropy regularizer, expressed by,
It converges to Shannon entropy as approaches to 1, which is why Tsallis Entropy is considered as a generalization of Shannon entropy. Therefore, FTRL via Tsallis entropy generalizes EXP3.
An alternate way of smoothing is stochastic smoothing which is what is used by FTPL algorithms. It injects stochastic perturbations to the cumulative rewards of each arm and then finds the best arm. Given perturbation distribution and consisting of i.i.d. draws from , the resulting stochastically smoothed potential is , Its gradient is where . The corresponding choice probability function, is given by,
In Section 4.5, we will recall the general regret bound proved by Abernethy et al. (2015) for distributions with bounded hazard rate. They showed that a variety of natural perturbation distributions can yield a near-optimal regret bound of . However, none of the distributions they tried yielded the minimax optimal rate
4.3 Open Problem and Two Natural Solution Approaches
Since FTRL with Tsallis entropy regularizer can achieve the minimax optimal rate in adversarial bandits, the following is an important unresolved question regarding the power of perturbations.
Is there a perturbation such that GBPA with stochastically smoothed potential using achieves the optimal regret bound in adversarial -armed bandits?
Given what we currently know, there are two very natural approaches to resolving the open question in the affirmative. Approach 1: Find a perturbation so that we get the exact same choice probability function as the one used by FTRL via Tsallis entropy. Approach 2: Provide a tighter control on expected block maxima of random variables considered as perturbations by Abernethy et al. (2015).
4.4 Barrier Against First Approach: Discrete Choice Theory
The first approach is motivated by a folklore observation in online learning theory. Namely, that the exponential weights algorithm (Freund and Schapire, 1997) can be viewed as FTRL via Shannon entropy regularizer or as FTPL via Gumbel-distributed perturbation. Thus, we might hope to find a perturbation which is an exact equivalent of the Tsallis entropy regularizer. Since FTRL via Tsallis entropy is optimal for adversarial bandits, finding such a perturbation would immediately settle the open problem.
The relation between regularizers and perturbations has been theoretically studied in discrete choice theory (McFadden, 1981; Hofbauer and Sandholm, 2002). The theorem below states that, for any perturbation, there is always a regularizer which gives the same choice probability function.
Theorem 10 (Theorem 2.1 (Hofbauer and Sandholm, 2002)).
Let be the choice probability function defined in equation (11), where the random vector has a strictly postitive density on and the function is continuously differentiable. Then there exists a regularizer such that .
However, to solve our open problem, we need the converse of Theorem 10. Such a converse, however, does not hold. Williams-Daly-Zachary Theorem provides a characterization of choice probability functions than can be derived via additive perturbations.
Theorem 11 (Williams-Daly-Zachary Theorem (McFadden, 1981)).
Let be the choice probability function and derivative matrix . The following 4 conditions are necessary and sufficient for the existence of perturbations such that this choice probability function can be written in the form of (11). (1) is symmetric, (2) is positive definite, (3) , and (4) All mixed partial derivatives of , for each .
The entropy regularizer induces the choice probability function where which satisfies all conditions in Theorem 11. Therefore, there exists a perturbation distribution, namely Gumbel, which induces this choice probability function . We now show that if the number of arms is greater than three, there does not exist any perturbation exactly equivalent to Tsallis entropy regularization. Therefore, the first approach to solving the open problem is doomed to failure.
When , there is no stochastic perturbation that yields the same choice probability function as the Tsallis entropy regularizer.
4.5 Barrier Against Second Approach: Extreme Value Theory
The second approach is built on the work of Abernethy et al. (2015) who provided the-state-of-the-art perturbation based algorithm for adversarial multi-armed bandits. The framework proposed in this work covered all distributions with bounded hazard rate and showed that the regret of GBPA via perturbation with bounded hazard is upper bounded by trade-off between the bound of hazard rate and expected block maxima as stated below.
Theorem 13 (Theorem 4.2 (Abernethy et al., 2015)).
Assume support of is unbounded in positive direction and hazard rate is bounded, then the expected regret of GBPA() in adversarial bandit is bounded by
where . The optimal choice of leads to the regret bound where .
Abernethy et al. (2015) considered several perturbations such as Gumbel, Gamma, Weibull, Frechet and Pareto. The best tuning of distribution parameters (to minimize upper bounds on the product ) always leads to the bound , which is tantalizingly close to the lower bound but does not match it. It is possible that some of their upper bounds on expected block maxima are loose and that we can get closer, or perhaps even match, the lower bound by simply doing a better job of bounding expected block maxima (we will not worry about supremum of the hazard since their bounds can easily be shown to be tight, up to constants, using elementary calculations). We will show that this approach will also not work by characterizing the asymptotic (as ) behavior of block maxima of perturbations using extreme value theory. Extreme value theory deals with the stochastic behavior of the extreme values in a process. The statistical behavior of block maxima, , where ’s is a sequence of i.i.d. random variables with distribution function can be described by one of three extreme value distributions: Gumbel, Frechet and Weibull (Fisher and Tippett, 1928; Coles et al., 2001; Resnick, 2013).
Theorem 14 (Proposition 0.3 (Resnick, 2013)).
Suppose that there exist and such that
where is a non-degenerate distribution function, then belongs to one of families; Gumbel, Frechet and Weibull. Then, is in the domain of attraction of , written as .
The normalizing sequences and are thoroughly characterized when is one of the three extreme value distributions (Leadbetter et al., 2012). See Theorem 15 and 16 in Appendix B for more details. Under the mild condition, as where is constant, the expected block maxima behave asymptotically as .