# Equipping Experts/Bandits with Long-term Memory

We propose the first reduction-based approach to obtaining long-term memory guarantees for online learning in the sense of Bousquet and Warmuth, 2002, by reducing the problem to achieving typical switching regret. Specifically, for the classical expert problem with K actions and T rounds, using our framework we develop various algorithms with a regret bound of order O(√(T(S T + n K))) compared to any sequence of experts with S-1 switches among n ≤{S, K} distinct experts. In addition, by plugging specific adaptive algorithms into our framework we also achieve the best of both stochastic and adversarial environments simultaneously. This resolves an open problem of Warmuth and Koolen, 2014. Furthermore, we extend our results to the sparse multi-armed bandit setting and show both negative and positive results for long-term memory guarantees. As a side result, our lower bound also implies that sparse losses do not help improve the worst-case regret for contextual bandits, a sharp contrast with the non-contextual case.

There are no comments yet.

## Authors

• 24 publications
• 37 publications
• 55 publications
• 98 publications
• ### Make the Minority Great Again: First-Order Regret Bound for Contextual Bandits

Regret bounds in online learning compare the player's performance to L^*...
02/09/2018 ∙ by Zeyuan Allen-Zhu, et al. ∙ 0

• ### Online Multitask Learning with Long-Term Memory

We introduce a novel online multitask setting. In this setting each task...
08/17/2020 ∙ by Mark Herbster, et al. ∙ 13

• ### Improved Regret Bounds for Tracking Experts with Memory

We address the problem of sequential prediction with expert advice in a ...
06/24/2021 ∙ by James Robinson, et al. ∙ 0

• ### Online learning over a finite action set with limited switching

This paper studies the value of switching actions in the Prediction From...
03/05/2018 ∙ by Jason Altschuler, et al. ∙ 0

• ### Contextual Bandits with Stochastic Experts

We consider the problem of contextual bandits with stochastic experts, w...
02/23/2018 ∙ by Rajat Sen, et al. ∙ 0

• ### Taking a hint: How to leverage loss predictors in contextual bandits?

We initiate the study of learning in contextual bandits with the help of...
03/04/2020 ∙ by Chen-Yu Wei, et al. ∙ 0

• ### Aggregating Strategies for Long-term Forecasting

The article is devoted to investigating the application of aggregating a...
03/18/2018 ∙ by Alexander Korotin, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In this work, we propose a black-box reduction for obtaining long-term memory guarantees for two fundamental problems in online learning: the expert problem (Freund and Schapire, 1997) and the multi-armed bandit (MAB) problem (Auer et al., 2002). In both problems, a learner interacts with the environment for rounds, with fixed available actions. At each round, the environment decides the loss for each action while simultaneously the learner selects one of the actions and suffers the loss of this action. In the expert problem, the learner observes the loss of every action at the end of each round (a.k.a. full-information feedback), while in MAB, the learner only observes the loss of the selected action (a.k.a. bandit feedback).

For both problems, the classical performance measure is the learner’s (static) regret, defined as the difference between the learner’s total loss and the loss of the best fixed action. It is well-known that the minimax optimal regret is  (Freund and Schapire, 1997) and  (Auer et al., 2002; Audibert and Bubeck, 2010) for the expert problem and MAB respectively. Comparing against a fixed action, however, does not always lead to meaningful guarantees, especially when the environment is non-stationary and no single fixed action performs well. To address this issue, prior work has considered a stronger measure called switching/tracking/shifting regret, which is the difference between the learner’s total loss and the loss of a sequence of actions with at most switches. Various existing algorithms (including some black-box approaches) achieve the following switching regret

 O(√TSln(TK)) for the expert problem (Herbster and Warmuth, 1998; Hazan and Seshadhri, 2007; Adamskiy et al., 2012a; Luo and Schapire, 2015; Jun et al., 2017), (1) O(√TKSln(TK)) for multi-armed bandits (Auer et al., 2002; Luo et al., 2018). (2)

We call these typical switching regret bounds. Such bounds essentially imply that the learner pays the worst-case static regret for each switch in the benchmark sequence. While this makes sense in the worst case, intuitively one would hope to perform better if the benchmark sequence frequently switches back to previous actions, as long as the algorithm remembers which actions have performed well previously.

Indeed, for the expert problem, algorithms with long-term memory were developed that guarantee switching regret of order , where is the number of distinct actions in the benchmark sequence (Bousquet and Warmuth, 2002; Adamskiy et al., 2012b; Cesa-Bianchi et al., 2012).111The setting considered in (Bousquet and Warmuth, 2002; Adamskiy et al., 2012b) is in fact slightly different from, yet closely related to, the expert problem. One can easily translate their regret bounds into the bounds we present here. Compared to the typical switching regret bound of form (1) (which can be written as ), this long-term memory guarantee implies that the learner pays the worst-case static regret only for each distinct action encountered in the benchmark sequence, and pays less for each switch, especially when is very small. Algorithms with long-term memory guarantees have been found to have better empirical performance (Bousquet and Warmuth, 2002). We are not aware of any similar studies for the bandit setting.

#### Overview of our contributions.

The main contribution of this work is to propose a simple black-box approach to equip expert or MAB algorithms with long-term memory and to achieve switching regret guarantees of similar flavor to those of (Bousquet and Warmuth, 2002; Adamskiy et al., 2012b; Cesa-Bianchi et al., 2012). The key idea of our approach is to utilize a variant of the confidence-rated expert framework of (Blum and Mansour, 2007), and to use a sub-routine to learn the confidence/importance of each action for each time. Importantly this sub-routine itself is an expert/bandit algorithm over only two actions and needs to enjoy some typical switching regret guarantee (for example of form (1) for the expert problem). In other words, our approach reduces the problem of obtaining long-term memory to the well-studied problem of achieving typical switching regret. Compared to existing methods (Bousquet and Warmuth, 2002; Adamskiy et al., 2012b; Cesa-Bianchi et al., 2012), the advantages of our approach are the following:

1. While existing methods are all restricted to variants of the classical Hedge algorithm (Freund and Schapire, 1997), our approach allows one to plug in a variety of existing algorithms and to obtain a range of different algorithms with switching regret . (Section 3.1)

2. Due to this flexibility, by plugging in specific adaptive algorithms, we develop a parameter-free algorithm whose switching regret is simultaneously in the worst-case and if the losses are piece-wise stochastic (see Section 2 for the formal definition). This is a generalization of previous best-of-both-worlds results for static or switching regret (Gaillard et al., 2014; Luo and Schapire, 2015), and resolves an open problem of Warmuth and Koolen (2014). The best previous bound for the stochastic case is  (Luo and Schapire, 2015). (Section 3.2)

3. Our framework allows us to derive the first nontrivial long-term memory guarantees for the bandit setting, while existing approaches fail to do so (more discussion to follow). For example, when is a constant and the losses are sparse, our algorithm achieves switching regret for MAB, which is better than the typical bound (2) when and are large. For example, when and , our bound is of order while bound (2) becomes vacuous (linear in ), demonstrating a strict separation in learnability. (Section 4)

To motivate our results on long-term memory guarantees for MAB, a few remarks are in order. It is not hard to verify that existing approaches achieve switching regret for MAB. However, the polynomial dependence on the number of actions makes the improvement of this bound over the typical bound (2) negligible. It is well-known that such polynomial dependence on is unavoidable in the worst-case due to the bandit feedback. This motivates us to consider situations where the necessary dependence on is much smaller. In particular, Bubeck et al. (2018)

recently showed that if the loss vectors are

-sparse, then a static regret bound of order is achievable, exhibiting a much more favorable dependence on . We therefore focus on this sparse MAB problem and study what nontrivial switching regret bounds are achievable.

We first show that a bound of order , a natural generalization of the typical switching regret bound of (2) to the sparse setting, is impossible. In fact, we show that for any the worst-case switching regret is at least , even when . Since achieving switching regret for MAB can be seen as a special case of contextual bandits (Auer et al., 2002; Langford and Zhang, 2008), this negative result also implies that, surprisingly, sparse losses do not help improve the worst-case regret for contextual bandits, which is a sharp contrast with the non-contextual case studied in (Bubeck et al., 2018) (see Theorem 6 and Corollary 7). Despite this negative result, however, as mentioned we are able to utilize our general framework to still obtain improvements over bound (2) when is small. Our construction is fairly sophisticated, requiring a special sub-routine that uses a novel one-sided log-barrier regularizer and admits a new kind of “local-norm” guarantee, which may be of independent interest.

## 2 Preliminaries

Throughout the paper, we use to denote the set for some integer . The learning protocol for the expert problem and MAB with actions and rounds is as follows: For each time , (1) the learner first randomly selects an action according to a distribution (the -dimensional simplex); (2) simultaneously the environment decides the loss vector ; (3) the learner suffers loss and observes either in the expert problem (full-information feedback) or only in MAB (bandit feedback). For any sequence of actions , the expected regret of the learner against this sequence is defined as

 R(i1:T)=E[T∑t=1ℓt(It)−T∑t=1ℓt(it)]=E[T∑t=1rt(it)],

where the expectation is with respect to both the learner and the environment and , the instantaneous regret (against action ), is defined as . When , this becomes the traditional static regret against a fixed action. Most existing works on switching regret impose a constraint on the number of switches for the benchmark sequence: . In other words, the sequence can be decomposed into disjoint intervals, each with a fixed comparator as in static regret. Typical switching regret bounds hold for any sequence with this constraint and are in terms of and , such as Eq. (1) and Eq. (2).

The number of switches, however, does not fully characterize the difficulty of the problem. Intuitively, a sequence that frequently switches back to previous actions should be an easier benchmark for an algorithm with long-term memory that remembers which actions performed well in the past. To encode this intuition, prior works (Bousquet and Warmuth, 2002; Adamskiy et al., 2012b; Cesa-Bianchi et al., 2012) introduced another parameter , the number of distinct actions in the sequence, to quantify the difficulty of the problem, and developed switching regret bounds in terms of and . Clearly one has , and we are especially interested in the case when , which is natural if the data exhibits some periodic pattern. Our goal is to understand what improvements are achievable in this case and how to design algorithms that can leverage this property via a unified framework.

#### Stochastic setting.

In general, we do not make any assumptions on how the losses are generated by the environment, which is known as the adversarial setting in the literature. We do, however, develop an algorithm (for the expert problem) that enjoys the best of both worlds — it not only enjoys some robust worst-case guarantee in the adversarial setting, but also achieves much smaller logarithmic regret in a stochastic setting. Specifically, in this stochastic setting, without loss of generality, we assume the distinct actions in are . It is further assumed that for each , there exists a constant gap such that for all and all such that , where the expectation is with respect to the randomness of the environment conditioned on the history up to the beginning of round . In other words, for every time step the algorithm is compared to the best action whose expected value is constant away from those of other actions. This is a natural generalization of the stochastic setting studied for static regret or typical switching regret (Gaillard et al., 2014; Luo and Schapire, 2015).

#### Confidence-rated actions.

Our approach makes use of the confidence-rated expert setting of Blum and Mansour (2007), a generalization of the expert problem (and the sleeping expert problem (Freund et al., 1997)). The protocol of this setting is the same as the expert problem, except that at the beginning of each round, the learner first receives a confidence score for each action . The regret against a fixed action is also scaled by its confidence and is now defined as . The expert problem is clearly a special case with for all and . There are a number of known examples showing why this formulation is useful, and our work will add one more to this list.

To obtain a bound on this new regret measure, one can in fact simply reduce it to the regular expert problem (Blum and Mansour, 2007; Gaillard et al., 2014; Luo and Schapire, 2015). Specifically, let be some expert algorithm over the same actions producing sampling distributions . The reduction works by sampling according to such that and then feeding to where . Note that by the definition of one has . Therefore, one can directly equalize the confidence-rated regret and the regular static regret of the reduced problem: .

## 3 General Framework for the Expert Problem

In this section, we introduce our general framework to obtain long-term memory regret bounds and demonstrate how it leads to various new algorithms for the expert problem. We start with a simpler version and then move on to a more elaborate construction that is essential to obtain best-of-both-worlds results.

### 3.1 A simple approach for adversarial losses

A simple version of our approach is described in Algorithm 1. At a high level, it simply makes use of the confidence-rated action framework described in Section 2. The reduction to the standard expert problem is executed in Lines 1 and 1, with a black-box expert algorithm .

It remains to specify how to come up with the confidence score . We propose to learn these scores via a separate black-box expert algorithm for each . More specifically, each is learning over two actions 0 and 1, where action 0 corresponds to confidence score 0 and action 1 corresponds to score 1. Therefore, the probability of picking action 1 at time naturally represents a confidence score between 0 and 1, which we denote by overloading the notation (Line 1).

As for the losses fed to , we fix the loss of action 0 to be 0 (since shifting losses by the same amount has no real effect), and set the loss of action 1 to be (Line 1). The role of the term is intuitively clear — the larger the loss of action compared to the algorithm, the less confident we should be about it; the role of the constant bias term will become clear in the analysis (in fact, it can even be removed at the cost of a worse bound — see Appendix B.2).

Finally we specify what properties we require from the black-box algorithms . In short, needs to ensure a static regret bound, while need to ensure a switching regret bound. The trick is that since are learning over only two actions, this construction helps us to separate the dependence on and the number of switches . These (static or switching) regret bounds could be the standard worst-case -dependent bounds mentioned in Section 1, in which case we would obtain looser long-term memory guarantees (specifically, times worse — see Appendix B.2). Instead, we require these bounds to be data-dependent and in particular of the form specified below:

###### Condition 1.

There exists a constant such that for any and any loss sequence , algorithm (possibly with knowledge of ) produces sampling distributions and ensures one of the following static regret bounds:

 T∑t=1w⊤tct−T∑t=1ct(i)≤ClnKη+ηT∑t=1|ct(i)|,∀i∈[K] (3) or T∑t=1w⊤tct−T∑t=1ct(i)≤ClnKη+ηT∑t=1∣∣w⊤tct−ct(i)∣∣,∀i∈[K]. (4)
###### Condition 2.

There exists a constant such that for any , any loss sequence , and any , algorithm (possibly with knowledge of ) produces sampling distributions and ensures one of the following switching regret bounds against any sequence with :222In terms of notation in Algorithm 1, .

 T∑t=1q⊤tht−T∑t=1ht(bt)≤CSlnTη+ηT∑t=1|ht(bt)|, (5) or T∑t=1q⊤tht−T∑t=1ht(bt)≤CSlnTη+ηT∑t=1∣∣q⊤tht−ht(bt)∣∣, (6) or T∑t=1q⊤tht−T∑t=1ht(bt)≤CSlnTη+ηT∑t=1∑b∈{0,1}qt(b)|ht(b)|. (7)

We emphasize that these data-dependent bounds are all standard in the online learning literature,333In fact, most standard bounds replace the absolute value we present here with square, leading to even smaller bounds (up to a constant). We choose to use the looser ones with absolute values since this makes the conditions weaker while still being sufficient for all of our analysis. and provide a few examples below (see Appendix A for brief proofs).

###### Proposition 1.

The following algorithms all satisfy Condition 1: Variants of Hedge (Hazan and Kale, 2010; Steinhardt and Liang, 2014), Prod (Cesa-Bianchi et al., 2007), Adapt-ML-Prod (Gaillard et al., 2014), AdaNormalHedge (Luo and Schapire, 2015), and iProd/Squint (Koolen and Van Erven, 2015).

###### Proposition 2.

The following algorithms all satisfy Condition 2: Fixed-share (Herbster and Warmuth, 1998), a variant of Fixed-share (Algorithm 5 in Appendix A), and AdaNormalHedge.TV (Luo and Schapire, 2015).

We are now ready to state the main result for Algorithm 1 (see Appendix B.1 for the proof).

###### Theorem 3.

Suppose Conditions 1 and 2 both hold. With , Algorithm 1 ensures for any loss sequence and benchmark sequence such that and .

Our bound in Theorem 3 is slightly worse than the existing bound of  (Bousquet and Warmuth, 2002; Adamskiy et al., 2012b),444In fact, using the adaptive guarantees of AdaNormalHedge (Luo and Schapire, 2015) or iProd/Squint (Koolen and Van Erven, 2015) that replaces the dependence in Eq. (4) by a KL divergence term, one can further improve the term in our bound to matching previous bounds. Since this improvement is small, we omit the details. but still improves over the typical switching regret (Eq. (1)), especially when is small and and are large. To better understand the implication of our bounds, consider the following thought experiment. If the learner knew about the switch points (that is, ) that naturally divide the whole game into intervals, she could simply pick any algorithm with optimal static regret () and apply instances of this algorithm, one for each interval, which, via a direct application of the Cauchy-Schwarz inequality, leads to switching regret . Compared to bound (1), this implies that the price of not knowing the switch points is . Similarly, if the learner knew not only the switch points, but also the information on which intervals share the same competitor, then she could naturally apply instances of the static algorithm, one for each set of intervals with the same competitor. Again by the Cauchy-Schwarz inequality, this leads to switching regret . Therefore, our bound implies that the price of not having any prior information of the benchmark sequence is still .

Compared to existing methods, our framework is more flexible and allows one to plug in any combination of the algorithms listed in Propositions 1 and 2. This flexibility is crucial and allows us to solve the problems discussed in the following sections. The approach of (Adamskiy et al., 2012b) makes use of a sleeping expert framework, a special case of the confidence-rated expert framework. However, their approach is not a general reduction and does not allow plugging in different algorithms. Finally, we note that our construction also shares some similarity with the black-box approach of (Christiano, 2017) for a multi-task learning problem.

### 3.2 Best of both worlds

To further demonstrate the power of our approach, we now show how to use our framework to construct a parameter-free algorithm that enjoys the best of both adversarial and stochastic environments, resolving the open problem of (Warmuth and Koolen, 2014) (see Algorithm 2). The key is to derive an adaptive switching regret bound that replaces the dependence on by the sum of the magnitudes of the instantaneous regret , which previous works (Gaillard et al., 2014; Luo and Schapire, 2015) show is sufficient for adapting to the stochastic setting and achieving logarithmic regret.

To achieve this goal, the first modification we need is to change the bias term for the loss of action “1” for from to . Following the proof of Theorem 3, one can show that the dependence on now becomes for the regret against . If we could tune optimally in terms of this data-dependent quality, then this would imply logarithmic regret in the stochastic setting by the same reasoning as in (Gaillard et al., 2014; Luo and Schapire, 2015).

However, the difficulty is that the optimal tuning of is unknown beforehand, and more importantly, different actions require tuning differently. To address this issue, at a high level we discretize the learning rate and pick exponentially increasing values (Line 2), then we make copies of each action , one for each learning rate . More specifically, this means that the number of actions for increases from to , and so does the number of sub-routines with switching regret, now denoted as for and . Different copies of an action share the same loss for , while action “1” for now suffers loss (Line 2). The rest of the construction remains the same. Note that selecting a copy of an action is the same as selecting the corresponding action, which explains the update rule of the sampling probability in Line 2 that marginalizes over . Also note that for a vector in (e.g., ), we use to index its coordinates for and .

Finally, with this new construction, we need algorithm to exhibit a more adaptive static regret bound and in some sense be aware of the fact that different actions now correspond to different learning rates. More precisely, we replace Condition 1 with the following condition:

###### Condition 3.

There exists a constant such that for any and any loss sequence , algorithm (possibly with knowledge of ) produces sampling distributions and ensures the following static regret bounds: for all and :555In fact an analogue of Eq. (3) with individual learning rates would also suffice, but we are not aware of any algorithms that achieve such guarantee.

 T∑t=1w⊤tct−T∑t=1ct(i,j)≤Cln(KM)ηj+ηjT∑t=1∣∣w⊤tct−ct(i,j)∣∣. (8)

Once again, this requirement is achievable by many existing algorithms and we provide some examples below (see Appendix A for proofs).

###### Proposition 4.

The following algorithms all satisfy Condition 3: A variant of Hedge (Algorithm 6 in Appendix A), Adapt-ML-Prod (Gaillard et al., 2014), AdaNormalHedge (Luo and Schapire, 2015), and iProd/Squint (Koolen and Van Erven, 2015).

We now state our main result for Algorithm 2 (see Appendix B.3 for the proof).

###### Theorem 5.

Suppose algorithm satisfies Condition 3 and all satisfy Condition 2. Algorithm 2 ensures that for any benchmark sequence such that and , the following hold:

• [leftmargin=*,labelindent=2mm,labelsep=2mm]

• In the adversarial setting, we have

• In the stochastic setting (defined in Section 2), we have where s.t. .666This definition of is the same as the one in the proof of Theorem 3.

In other words, with a negligible price of for the adversarial setting, our algorithm achieves logarithmic regret in the stochastic setting with favorable dependence on and . The best prior result is achieved by AdaNormalHedge.TV (Luo and Schapire, 2015), with regret for the adversarial case and for the stochastic case. We also remark that a variant of the algorithm of (Bousquet and Warmuth, 2002) with a doubling trick can achieve a guarantee similar to ours, but weaker in the sense that each is replaced by . To the best of our knowledge this was previously unknown and we provide the details in Appendix B.4 for completeness.

## 4 Long-term Memory under Bandit Feedback

In this section, we move on to the bandit setting where the learner only observes the loss of the selected action instead of . As mentioned in Section 1, one could directly generalize the approach of (Bousquet and Warmuth, 2002; Adamskiy et al., 2012b; Cesa-Bianchi et al., 2012) to obtain a bound of order , a natural generalization of the full information guarantee, but such a bound is not a meaningful improvement compared to (2), due to the dependence that is unavoidable for MAB in the worst case. Therefore, we consider a special case where the dependence on is much smaller: the sparse MAB problem (Bubeck et al., 2018). Specifically, in this setting we make the additional assumption that all loss vectors are -sparse for some , that is, for all . It was shown in (Bubeck et al., 2018) that for sparse MAB the static regret is of order , exhibiting a much favorable dependence on .

#### Negative result.

To the best of our knowledge, there are no prior results on switching regret for sparse MAB. In light of bound (2), a natural conjecture would be that it would be possible to achieve switching regret of with switches. Perhaps surprisingly, we show that this is in fact impossible.

###### Theorem 6.

For any and any MAB algorithm, there exists a sequence of loss vectors that are -sparse, such that the switching regret of this algorithm is at least .

The high level idea of the proof is to force the algorithm to overfocus on one good action and thus miss an even better action later. This is similar to the construction of (Daniely et al., 2015, Lemma 3) and (Wei et al., 2016, Theorem 4.1), and we defer the proof to Appendix C.1. This negative result implies that sparsity does not help improve the typical switching regret bound (2). In fact, since switching regret for MAB can be seen as a special case of the contextual bandits problem (Auer et al., 2002; Langford and Zhang, 2008), this result also immediately implies the following corollary, a sharp contrast compared to the positive result for the non-contextual case mentioned earlier (see Appendix C.1 for the definition of contextual bandit and related discussions).

###### Corollary 7.

Sparse losses do not help improve the worst-case regret for contextual bandits.

#### Long-term memory to the rescue.

Despite the above negative results, we next show how long-term memory can still help improve the switching regret for sparse MAB. Specifically, we use our general framework to develop a MAB algorithm whose switching regret is smaller than whenever and are small while and are large. Note that this is not a contradiction with Theorem 6, since in the construction of its proof, is as large as .

At a high level, our algorithm (Algorithm 3) works by constructing the standard unbiased importance-weighted loss estimator (Line 3) and plugging it into our general framework (Algorithm 1

). However, we emphasize that it is highly nontrivial to control the variance of these estimators without leading to bad dependence on

in this framework where two types of sub-routines interact with each other. To address this issue, we design specialized sub-algorithms and to learn and respectively. For learning , we essentially deploy the algorithm of (Bubeck et al., 2018) for sparse MAB, which is an instance of the standard follow-the-regularized-leader algorithm with a special hybrid regularizer, combining the entropy and the log-barrier (Lines 3 and 3). However, note that the loss we feed to this algorithm is not sparse and we cannot directly apply the guarantee from (Bubeck et al., 2018), but it turns out that one can still utilize the implicit exploration of this algorithm, as shown in our analysis. Compared to Algorithm 1, we also incorporate an extra bias term in the definition of (Line 3), which is important for canceling the large variance of the loss estimator.

For learning for each , we design a new algorithm that is an instance of the standard Online Mirror Descent algorithm (see e.g., (Hazan et al., 2016)). Recall that this is a one-dimensional problem, as we are trying to learn the distribution over actions . We design a special one-dimensional regularizer , which can be seen as a one-sided log-barrier,777The usual log-barrier regularizer (see e.g. (Foster et al., 2016; Agarwal et al., 2017; Wei and Luo, 2018)) would be in this case. to bias towards action “1”. Technically, this provides a special “local-norm” guarantee that is critical for our analysis and may be of independent interest (see Lemma 14 in Appendix C.2). In addition, we remove the bias term in the loss for action “1” (so it is only now) as it does not help in the bandit case, and we also force to be at least for some parameter , which is important for achieving switching regret. Line 3 summarizes the update for .

Finally, we also enforce a small amount of uniform exploration by sampling from , a smoothed version of (Line 3). We present the main result of our algorithm below (proven in Appendix C.2).

###### Theorem 8.

With , Algorithm 3 ensures

 R(i1:T)=O((ρS)13(nT)23+n√TρlnK+nK3lnT) (9)

for any sequence of -sparse losses and any benchmark sequence such that and .

In the case when and are constants, our bound (9) becomes , which improves over the existing bound when and are large (recall the example in Section 1 where our bound is sublinear in while existing bounds become vacuous).

As a final remark, one might wonder if similar best-of-both-worlds results are also possible for MAB in terms of switching regret, given the positive results for static regret (Bubeck and Slivkins, 2012; Seldin and Slivkins, 2014; Auer and Chiang, 2016; Seldin and Lugosi, 2017; Wei and Luo, 2018; Zimmert et al., 2019). We point out that the answer is negative — the proof of (Wei et al., 2016, Theorem 4.1) implicitly implies that even with one switch, logarithmic regret is impossible for MAB in the stochastic setting.

#### Acknowledgments.

The authors would like to thank Alekh Agarwal, Sébastien Bubeck, Dylan Foster, Wouter Koolen, Manfred Warmuth, and Chen-Yu Wei for helpful discussions. Haipeng Luo was supported by NSF Grant IIS-1755781. Ilias Diakonikolas was supported by NSF Award CCF-1652862 (CAREER) and a Sloan Research Fellowship.

## References

• Adamskiy et al. (2012a) D. Adamskiy, W. M. Koolen, A. Chernov, and V. Vovk. A closer look at adaptive regret. In International Conference on Algorithmic Learning Theory, pages 290–304. Springer, 2012a.
• Adamskiy et al. (2012b) D. Adamskiy, M. K. Warmuth, and W. M. Koolen. Putting bayes to sleep. In Advances in neural information processing systems, pages 135–143, 2012b.
• Agarwal et al. (2017) A. Agarwal, H. Luo, B. Neyshabur, and R. E. Schapire. Corralling a band of bandit algorithms. Conference on Learning Theory, 2017.
• Audibert and Bubeck (2010) J.-Y. Audibert and S. Bubeck. Regret bounds and minimax policies under partial monitoring.

Journal of Machine Learning Research

, 11(Oct):2785–2836, 2010.
• Auer and Chiang (2016) P. Auer and C.-K. Chiang. An algorithm with nearly optimal pseudo-regret for both stochastic and adversarial bandits. In Conference on Learning Theory, pages 116–120, 2016.
• Auer et al. (2002) P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002.
• Blum and Mansour (2007) A. Blum and Y. Mansour. From external to internal regret. Journal of Machine Learning Research, 8(Jun):1307–1324, 2007.
• Bousquet and Warmuth (2002) O. Bousquet and M. K. Warmuth. Tracking a small set of experts by mixing past posteriors. Journal of Machine Learning Research, 3(Nov):363–396, 2002.
• Bubeck and Slivkins (2012) S. Bubeck and A. Slivkins. The best of both worlds: stochastic and adversarial bandits. In Conference on Learning Theory, pages 42–1, 2012.
• Bubeck et al. (2018) S. Bubeck, M. Cohen, and Y. Li. Sparsity, variance and curvature in multi-armed bandits. In Algorithmic Learning Theory, pages 111–127, 2018.
• Bubeck et al. (2019) S. Bubeck, Y. Li, H. Luo, and C.-Y. Wei. Improved path-length regret bounds for bandits. In Conference On Learning Theory, 2019.
• Cesa-Bianchi et al. (2007) N. Cesa-Bianchi, Y. Mansour, and G. Stoltz. Improved second-order bounds for prediction with expert advice. Machine Learning, 66(2-3):321–352, 2007.
• Cesa-Bianchi et al. (2012) N. Cesa-Bianchi, P. Gaillard, G. Lugosi, and G. Stoltz. Mirror descent meets fixed share (and feels no regret). In Advances in Neural Information Processing Systems, pages 980–988, 2012.
• Christiano (2017) P. Christiano. Manipulation-resistant online learning. PhD thesis, University of California, Berkeley, 2017.
• Daniely et al. (2015) A. Daniely, A. Gonen, and S. Shalev-Shwartz. Strongly adaptive online learning. In International Conference on Machine Learning, pages 1405–1411, 2015.
• Foster et al. (2016) D. J. Foster, Z. Li, T. Lykouris, K. Sridharan, and E. Tardos. Learning in games: Robustness of fast convergence. In Advances in Neural Information Processing Systems, pages 4734–4742, 2016.
• Freund and Schapire (1997) Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
• Freund et al. (1997) Y. Freund, R. E. Schapire, Y. Singer, and M. K. Warmuth. Using and combining predictors that specialize. In

In Proceedings of the Twenty-Ninth Annual ACM Symposium on the Theory of Computing

. Citeseer, 1997.
• Gaillard et al. (2014) P. Gaillard, G. Stoltz, and T. Van Erven. A second-order bound with excess losses. In Conference on Learning Theory, pages 176–196, 2014.
• Hazan and Kale (2010) E. Hazan and S. Kale. Extracting certainty from uncertainty: Regret bounded by variation in costs. Machine learning, 80(2-3):165–188, 2010.
• Hazan and Seshadhri (2007) E. Hazan and C. Seshadhri. Adaptive algorithms for online decision problems. In Electronic colloquium on computational complexity (ECCC), volume 14, 2007.
• Hazan et al. (2016) E. Hazan et al. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(3-4):157–325, 2016.
• Herbster and Warmuth (1998) M. Herbster and M. K. Warmuth. Tracking the best expert. Machine learning, 32(2):151–178, 1998.
• Jun et al. (2017) K.-S. Jun, F. Orabona, S. Wright, R. Willett, et al. Online learning for changing environments using coin betting. Electronic Journal of Statistics, 11(2):5282–5310, 2017.
• Koolen and Van Erven (2015) W. M. Koolen and T. Van Erven.

Second-order quantile methods for experts and combinatorial games.

In Conference on Learning Theory, pages 1155–1175, 2015.
• Langford and Zhang (2008) J. Langford and T. Zhang.

The epoch-greedy algorithm for multi-armed bandits with side information.

In Advances in neural information processing systems, pages 817–824, 2008.
• Luo and Schapire (2015) H. Luo and R. E. Schapire. Achieving all with no parameters: Adanormalhedge. In Conference on Learning Theory, pages 1286–1304, 2015.
• Luo et al. (2018) H. Luo, C.-Y. Wei, A. Agarwal, and J. Langford. Efficient contextual bandits in non-stationary worlds. In Conference On Learning Theory, pages 1739–1776, 2018.
• Seldin and Lugosi (2017) Y. Seldin and G. Lugosi. An improved parametrization and analysis of the exp3++ algorithm for stochastic and adversarial bandits. In Conference on Learning Theory, 2017.
• Seldin and Slivkins (2014) Y. Seldin and A. Slivkins. One practical algorithm for both stochastic and adversarial bandits. In International Conference on Machine Learning, pages 1287–1295, 2014.
• Steinhardt and Liang (2014) J. Steinhardt and P. Liang. Adaptivity and optimism: An improved exponentiated gradient algorithm. In International Conference on Machine Learning, pages 1593–1601, 2014.
• Warmuth and Koolen (2014) M. K. Warmuth and W. M. Koolen. Open problem: Shifting experts on easy data. In Conference on Learning Theory, pages 1295–1298, 2014.
• Wei and Luo (2018) C.-Y. Wei and H. Luo. More adaptive algorithms for adversarial bandits. In Conference On Learning Theory, pages 1263–1291, 2018.
• Wei et al. (2016) C.-Y. Wei, Y.-T. Hong, and C.-J. Lu. Tracking the best expert in non-stationary stochastic environments. In Advances in neural information processing systems, pages 3972–3980, 2016.
• Zimmert et al. (2019) J. Zimmert, H. Luo, and C.-Y. Wei. Beating stochastic and adversarial semi-bandits optimally and simultaneously. In International Conference on Machine Learning, 2019.

## Appendix A Examples of Sub-routines

In this section, we briefly discuss why the algorithms listed in Propositions 1, 2, and 4 satisfy Conditions 1, 2, and 3 respectively. We first note that except for AdaNormalHedge [Luo and Schapire, 2015], all other algorithms satisfy even tighter bounds with the absolute value replaced by square (also see Footnote 3).

### a.1 Condition 1

Prod [Cesa-Bianchi et al., 2007] with learning rate satisfies Eq. (4) according to its original analysis. Adapt-ML-Prod [Gaillard et al., 2014], AdaNormalHedge [Luo and Schapire, 2015], and iProd/Squint [Koolen and Van Erven, 2015] are all parameter-free algorithms that satisfy for all ,

 T∑t=1w⊤tct−ct(i)≤O⎛⎜⎝ ⎷(lnK)T∑t=1∣∣w⊤tct−ct(i)∣∣+lnK⎞⎟⎠. (10)

By AM-GM inequality the square root term can be upper bounded by for any . Also the constraint in Condition 1 allows one to bound the extra term by . This leads to Eq. (4).

Finally, for completeness we present a variant of Hedge (Algorithm 4) that can be extracted from [Hazan and Kale, 2010, Steinhardt and Liang, 2014] and that satisfies Eq. (3).

###### Proposition 9.

Algorithm 4 satisfies Eq. (3).

###### Proof.

Define where with and . The goal is to show , which implies for any , and thus Eq. (3) after rearranging. Indeed, for any we have

 Φt−Φt−1 =exp(ηw⊤tct)∑iexp(ηRt−1(i)−η2Gt−1(i))(exp(−ηct(i)−η2c2t(i))−exp(−ηw⊤tct)) ≤exp(ηw⊤tct)∑iexp(ηRt−1(i)−η2Gt−1(i))(1−ηct(i)−exp(−ηw⊤tct)) ≤exp(ηw⊤tct)∑iexp(ηRt−1(i)−η2Gt−1(i))ηrt(i) =0,

where the first inequality uses the fact for any , the second inequality uses the fact for any , and the last equality holds since and . ∎

### a.2 Condition 2

We first note that the three algorithms we include in Proposition 2 all work for an arbitrary number of actions (instead of just two actions) and the general guarantee will be in the same form of Eq. (5),  (6), and (7) except that is replaced by .

Fixed-share [Herbster and Warmuth, 1998] with learning rate satisfies Eq. (7) and the proof can be extracted from the proof of [Auer et al., 2002, Theorem 8.1] or [Luo et al., 2018, Theorem 2]. AdaNormalHedge.TV [Luo and Schapire, 2015] is again a parameter-free algorithm and achieves the bound of (6) using similar tricks mentioned earlier for Condition 1.

Finally we provide a variant of Fixed-share that satisfies Eq. (5). The pseudocode is in Algorithm 5, where we adopt the notation from Condition 2 ( for distribution, for loss, for action index) but present the general case with actions.

###### Proposition 10.

Algorithm 5 satisfies Eq. (5).

###### Proof.

We first write the algorithm as an instance of Online Mirror Descent. Let be the entropy regularizer, and be such that where represents the element-wise square. Then one can verify and , where is the Bregman divergence associated with . Now we have for any ,

 ⟨qt−q,ηht+η2h2t⟩ =⟨qt−q,∇ψ(qt)−∇ψ(¯qt+1)⟩ =Dψ(q,qt)−Dψ(q,¯qt+1)+Dψ(qt,¯qt+1) ≤Dψ(q,qt)−Dψ(q,~qt+1)+Dψ(qt,¯qt+1) =Dψ(q,qt)−Dψ(q,~qt+1)+K∑b=1qt(b)(ηht(b)+η2h2t(b)+exp(−ηht(b)−η2h2t(b))−1) ≤Dψ(q,qt)−Dψ(q,~qt+1)+η2K∑b=1qt(b)h2t(b) ≤Dψ(q,qt)−Dψ(q,qt+1)+2γ+η2K∑b=1qt(b)h2t(b),

where the first inequality is by the generalized Pythagorean theorem, the second inequality is by the fact for all , and the last one is by the definition of and the fact for any . Rearranging then gives

 ⟨qt−q,ht⟩≤Dψ(q,qt)−Dψ(q,qt+1)+2γη