1 Introduction
Prediction with expert advice is perhaps the single most fundamental problem in online learning and sequential decision making. In this problem, the goal of a learner is to aggregate decisions from multiple experts and achieve performance that approaches that of the best individual expert in hindsight. The standard performance criterion is the regret: the difference between the loss of the learner and that of the best single expert. The experts problem is often considered in the socalled adversarial setting, where the losses of the individual experts may be virtually arbitrary and even be chosen by an adversary so as to maximize the learner’s regret. The canonical algorithm in this setup is the Multiplicative Weights algorithm (Littlestone and Warmuth, 1989; Freund and Schapire, 1995), that guarantees an optimal regret of in any problem with experts and decision rounds.
A long line of research in online learning has focused on obtaining better regret guarantees, often referred to as “fast rates,” on benign problem instances in which the loss generation process behaves more favourably than in a fully adversarial setup. A prototypical example of such an instance is the stochastic setting of the experts problem, where the losses of the experts are drawn i.i.d. over time from a fixed and unknown distribution, and there is a constant gap between the mean losses of the best and secondbest experts. In this setting, it has been established that the optimal expected regret scales as , and in particular, is bounded by a constant independent of the number of rounds (De Rooij et al., 2014; Koolen et al., 2016). More recently, Mourtada and Gaïffas (2019) have shown that this optimal regret is in fact achieved by an adaptive variant of the multiplicative weights algorithm. Other works have studied various intermediate regimes between stochastic and adversarial, where the challenge is to adapt to the complexity of the problem with little or no prior knowledge (e.g., CesaBianchi et al., 2007; Hazan and Kale, 2010; Chiang et al., 2012; Rakhlin and Sridharan, 2013; Koolen et al., 2014; Sani et al., 2014).
In this work, we consider a different, natural intermediate regime of the experts problem: an adversariallycorrupted stochastic setting. In this setting, the adversary can modify the stochastic losses with arbitrary corruptions, as long as the sum of the corruptions is bounded by a parameter , which is unknown to the learner. The injection of adversarial corruptions implies that the learner observes losses which are not distributed i.i.d. across time steps. In principle, one could use the adversarial online learning approach to overcome this challenge, but this will result in significantly inferior regret bounds that scale polynomially with the time horizon. The challenge is then to extend the favourable constant bounds on the regret achievable in the purely stochastic setting to allow for moderate adversarial corruptions.
In the closely related MultiArmed Bandit (MAB) partialinformation model in online learning, the adversariallycorrupted stochastic setting has recently received considerable attention (Lykouris et al., 2018; Gupta et al., 2019; Zimmert and Seldin, 2019; Jun et al., 2018; Kapoor et al., 2019; Liu and Shroff, 2019). Yet, the natural question of determining the optimal regret rate in the analogous fullinformation problem remained open. In particular, given that the optimal bounds in the bandit setting scale linearly with the number of experts (or “arms” in the context of MAB), it becomes a fundamental question if this dependence can be reduced to logarithmic with fullinformation, while preserving the dependence on the other parameters of the problem.
Indeed, our main result shows that the optimal regret in the adversariallycorrupted stochastic setting scales as independently of the horizon , and moreover, this optimal bound is attained by a simple adaptive variant of the classic multiplicative weights algorithm, that does not require knowing the corruption level in advance. In fact, it turns out that this simple algorithm performs optimally in all three regimes simultaneously: the pure stochastic setting, the adversariallycorrupted setting, and the fullyadversarial setting.
Our strategy for proving these results is based on a novel and delicate analysis of the multiplicative weights algorithm in the stochastic case, which can be seen as analogous to the approach taken by Zimmert and Seldin (2019)
in multiarmed bandits. The first step in this analysis adapts a standard worstcase regret bound for multiplicative weights with an explicit dependence on the secondmoments of the losses (often called a “second order” regret bound), to the case of an adaptive stepsize sequence. Then, a key observation is that the secondorder terms admit a “selfbounding” property and their sum can be bounded by the (pseudo)regret itself. The other expression in the regret bound, which is a sum of entropy terms that arises from the changing step sizes and captures the stability of the algorithm, is more challenging to handle but can also be shown to be selfbounded by the regret up to exponentiallydecreasing terms that sum up to a constant. Putting these observations together yield a constant regret bound in the stochastic case, which is also shown to be directly robust to corruptions.
An interesting byproduct of our analysis is a surprising disparity between two common online learning metaalgorithms: Follow the Regularized Leader (FTRL) and Online Mirror Descent (OMD). We show that while both FTRL and OMD give rise to optimal (adaptive) multiplicative weights algorithms in the pure stochastic experts setting,^{1}^{1}1More precisely, the algorithm derived from OMD achieves a nearoptimal (yet still constant, independent of ) bound, which is tight up to and factors. the OMD variant becomes strictly inferior to the FTRL variant once corruptions are introduced, and has a much weaker regret of for a fixed number of experts . In contrast, the nonadaptive (i.e., fixed step size) variants of the metaalgorithms are wellknown to be equivalent in the more general setting of online linear optimization. We also show a few basic numerical simulations in which this gap is clearly visible and tightly supports our theoretical bounds.
2 Preliminaries
2.1 Problem setup
We consider the classic problem of prediction with expert advice, with a set of experts indexed by . In each time step
the learner chooses a probability vector
from the simplex . Thereafter, a loss vector is revealed. We will consider three variants of the problem, as follows.In the adversarial (nonstochastic) setting, the loss vectors are entirely arbitrary and may be chosen by an adversary. The goal of the learner is to minimize the regret, given by
In the stochastic setting, the loss vectors are drawn i.i.d. from a fixed (and unknown) distribution. We denote the vector of the mean losses by and let be the index of the best expert, which we assume is unique. The gap between any expert and best one is denoted , and we let . The goal of the learner in the stochastic setting is to minimize the pseudo regret, defined as
(1) 
Finally, in the adversariallycorrupted stochastic setting (following Lykouris et al., 2018; Gupta et al., 2019), which is the main focus of this paper, loss vectors are drawn i.i.d. from a fixed and unknown distribution as in the stochastic setting with mean rewards , and the same definitions of best expert and gap . Subsequently, an adversary is allowed to manipulate the feedback observed by the learner, up to some budget which we refer to as the corruption level. Formally, on each round :

[label=(0)]

A stochastic loss vector is drawn i.i.d. from a fixed and unknown distribution;

The adversary observes the loss vector and generates corrupted losses ;

The player picks a distribution over experts, suffers the loss , and observes only the corrupted loss vector .
Notice that we allow the adversary to be fully adaptive, in the sense that the corruption on round may depend on past choices of the learner (before round ) as well as on the realizations of the random loss vectors in all rounds up to (and including) round .
We consider the following measure of corruption, which we assume to be unknown to the learner:
(2) 
Like in the stochastic setting, the goal of the learner is to minimize the pseudo regret (defined in Eq. 1). Note that, crucially, the pseudo regret of the learner depends only on the (means of) the stochastic losses and the adversarial corruption appears only in the feedback observed by the learner.
2.2 Multiplicative Weights
We recall two variants of the classic Multiplicative Weights (MW) algorithm that we revisit in this work. The standard MW algorithm (Littlestone and Warmuth, 1989; Freund and Schapire, 1995) is parameterized by a fixed stepsize parameter . For an arbitrary sequence of loss vectors , it admits the following update rule, on every round :
(3) 
For the basic, fixed stepsize version of our results, we will need a standard secondorder regret bound for MW.
Lemma 1 (CesaBianchi et al., 2007; see also Arora et al., 2012).
If for all and , the regret of the MW updates in Eq. 3 is bounded as
In particular, the bound implies the wellknown optimal regret bound for MW in the adversarial setting, if the step size is properly tuned to . In particular, the right setting of depends on the time horizon .
An adaptive variant of the MW algorithm that does not require knowledge of was proposed in Auer et al. (2002). This variant employs a diminishing step size sequence, and takes the form:
(4) 
with for all . This algorithm was shown to obtain the optimal regret in the adversarial setup for any (Auer et al., 2002; CesaBianchi and Lugosi, 2006). We will show that, remarkably, the adaptive MW algorithm also achieves the optimal performance in the adversariallycorrupted experts setting, for any level of corruption.
We remark that the MW algorithm in Eq. 4 is in fact an instantiation of the canonical FollowtheRegularized Leader (FTRL) framework in online optimization with entropy as regularization, when one allows the magnitude of regularization to change from round to round. MW can also be obtained by instantiating the closely related Online Mirror Descent (OMD) metaalgorithm, that also allows for the regularization to vary across rounds. (For more background on online optimization, FTRL and OMD, see Section 4.1 below.) When the regularization is fixed, it is a wellknown fact that the two frameworks are generically equivalent and give rise to precisely the same algorithm, presented in Eq. 3. However, when the regularization is timedependent, they produce different algorithms. We discuss the disparities between these different variants in more details in Section 3.3.
3 Main Results
In this section, we consider the adversariallycorrupted stochastic setting and present our main results. As a warmup, we analyze the Multiplicative Weights algorithm with fixed step sizes while assuming the minimal gap is known to the learner. Then, we consider the general case where neither the gap nor the corruption level are known, and prove that the adaptive multiplicative weights algorithm attains optimal performance.
3.1 A warmup analysis for known minimal gap
We begin with an easier case where the gap is known to the learner, and can be used to tune the step size parameter of multiplicative weights (Eq. 3). In this case, a fixed stepsize algorithm suffices and we have the following.
Theorem 2.
The Multiplicative Weights algorithm (Eq. 3) with in the adversariallycorrupted stochastic regime with corruption level over rounds, achieves constant expected pseudo regret.
Two key observations in the analysis are the following. The first observation gives a straightforward bound on the corrupted losses of an expert in terms of its pseudo regret.
Observation 3.
For any and the following holds
Proof.
For , note that , and on the other hand, since . Moreover, for we have and .
The second observation relates the regret with respect to the corrupted and uncorrupted losses.
Observation 4.
For any probability vectors the following holds
Proof.
Denoting as the corruption for expert at time step , we get
By definition of the corruption and therefore . Using the triangle inequality implies that
We now turn to prove the theorem.
Proof (of Theorem 2).
We start off with the basic bound of (fixed step size) MW in Lemma 1:
First, note that the regret of playing a fixed sequence is not affected by an additive translation of the form for any constant such that . In addition, for the Multiplicative Weights algorithm the sequences are also not affected by additive translation. Thus, taking yields
Applying Observations 4 and 3 and rearranging terms implies
Taking expectation while using the fact that and are independent we obtain
Finally, by setting and rearranging we can conclude that
3.2 General analysis with decreasing step sizes
We now formally state and prove our main result: a constant regret bound in the adversariallycorrupted case for the adaptive MW algorithm (in Eq. 4), that does not require the learner to know neither the gap nor the corruption level .
Theorem 5.
The adaptive MW algorithm in Eq. 4 with in the adversariallycorrupted stochastic regime with corruption level over rounds, achieves constant expected pseudo regret.
Note that this result is tight (up to constants): a lower bound of was shown by Mourtada and Gaïffas (2019), and a lower bound of is straighforward: consider an instance with experts, means and (assigned randomly to the experts) and an adversary that corrupts the first rounds and assigns a loss of zero to both experts on those rounds; the learner receives no information about the identity of the best expert (whose mean loss is the smallest) during the first rounds and thus incurs, in expectation, at least pseudo regret over these rounds.
For the proof of Theorem 5 we require two main lemmas. The first lemma is a secondorder regret bound for adaptive MW, analogous to the one stated in Lemma 1 for the fixed step size case. Here and throughout the section, we use to denote the entropy of a probability vector, that is, .
Lemma 6.
For any sequence of loss vectors , the regret of the adaptive MW algorithm in Eq. 4 satisfies
provided that for all and .
The lemma is obtained from a more general bound for FollowtheRegularized Leader, and follows from standard arguments adapted to the case of timevarying regularization. For completeness, we give the proof of this general bound in Appendix A, and use it to derive the lemma in Section 4.
The second lemma, key to our refined analysis of adaptive MW, shows that a properly scaled version of the entropy of any probability vector is upper bounded by the instantaneous pseudo regret of , up to an exponentially decaying additive term.
Lemma 7.
For any , and , we have the following bound for the entropy of any probability vector and any :
We prove the lemma below, but first let us show how it is used to derive our main theorem.
Proof (of Theorem 5).
Applying Lemma 6 on the corrupted loss vectors and introducing additive translations of as before, yields the bound
In Lemma 12 below we bound the last term in the bound in terms of the pseudo regret (similarly to the proof of Theorem 2), as follows:
For bounding the first summation in the bound, we use Lemma 7. Summing the lemma’s bound over and bounding the sum of the exponential terms by an integral (refer to Lemma 13 for the details), we obtain
Plugging the two inequalities into the regret bound, we obtain
Using Observation 4 and taking expectation we get
Rearranging terms gives the theorem.
We conclude this section with a proof of our key lemma.
Proof (of Lemma 7).
We split the analysis of the sum for and . Considering first the case , we apply the inequality for to obtain, for ,
Next, we examine the remaining terms with . The main idea is to look at two different regimes: one when and the other for . In the former case, we have
For the latter case, we can use the inequality of for to obtain
Combining both observations for implies
Finally, note that for it holds that . This together with our first inequality concludes the proof.
3.3 Gap between Follow the Regularized Leader and Online Mirror Descent
Here we present a surprising contrast between the variants of the adaptive MW algorithm obtained by instantiating the Follow the Regularized Leader (FTRL) and Online Mirror Descent (OMD) metaalgorithms, in the adversarially corrupted regime. We show that while both give optimal algorithms in the stochastic experts setting, the OMD variant becomes strictly inferior to the FTRL variant once corruptions are introduced.
As remarked above, when the step size (i.e., the magnitude of regularization) is fixed, the two metaalgorithms are equivalent, and produce the classic MW algorithm in Eq. 3 when their regularization is set to the negative entropy function over the probability simplex. (For more background and references, see Section 4.1.) Once one allows the stepsizes to vary across rounds, the FTRL gives the adaptive MW algorithm in Eq. 4, while OMD yields the following updates:
(5) 
First, we show that the OMD variant of MW in Eq. 5 obtains the same constant regret bound in the pure stochastic regime, up to small and factors. (The proof appears in Section 4.4.)
Theorem 8.
The adaptive MW variant in Eq. 5 with in the stochastic regime (with no corruption), achieves constant expected pseudo regret for any .
On the other hand, we give a simple example which demonstrates that the OMD variant of MW exhibits a strictly inferior performance compared to the FTRL variant (see Eq. 4) when adversarial corruptions are present. For simplicity, assume that the corruption level is a positive integer. Consider the following corrupted stochastic instance with experts. The mean loss of expert is while the mean loss of expert is . The adversary introduces corruption over the first rounds, and modifies the first losses of expert to ’s and those of expert to ’s.
For this simple problem instance, we show the following (see Section 4.3 for the proof).
Theorem 9.
The expected pseudo regret of the adaptive MW algorithm in Eq. 5 with where on the instance described above for rounds is at least .
In particular, if the learner does not have nontrivial bounds on the corruption level and gap (that is, is a constant independent of and ), then the regret is necessarily at least or is exponentially large in .
3.4 Numerical simulations
We conducted a basic numerical experiment to illustrate our regret bounds and the gap between OMD and FTRL discussed above. The experiment setup consists of two experts with different gaps . The losses were taken as Bernoullis and the corruption strategy injected contamination in the first rounds up to a total budget of , inflicting maximal loss on the best expert while zeroing the losses of the other expert. The results, shown in Fig. 1, demonstrate that for the stochastic case without corruption () OMD achieves better pseudo regret, but is substantially outperformed by FTRL when . In Fig. 2 we further show the inverse dependence of the pseudoregret on the minimal gap , which precisely supports our theoretical finding discussed in Section 3.3.
4 Proofs
4.1 Preliminaries: Online optimization with timedependent regularization
We give a brief background on Follow the Regularized Leader and Online Mirror Descent algorithmic templates, in the case where the regularization is varying and timedependent.
The setup is the standard setup of online linear optimization. Let be a convex domain. On each prediction round , the learner has to produce a prediction based on , and subsequently observes a new loss vector and incurs the loss . The goal is to minimize the regret compared to any , given by .
Follow the Regularized Leader (FTRL).
The FTRL template generates predictions , for , as follows:
(6) 
Here, is a sequence of twicedifferentiable, strictly convex functions.
The derivation and analysis of FTRLtype algorithms is standard; see, e.g., ShalevShwartz and others (2012); Hazan (2016). In our analysis, however, we require a particular regret bound that we could not find stated explicitly in the literature; for completeness, we provide the bound here with a proof in Appendix A.
Theorem 10.
Suppose that for all for some strictly convex , with . Then there exists a sequence of points such that the following regret bound holds for all :
where is the local norm induced by at an appropriate , and is its dual norm.
Online Mirror Descent (OMD).
The closelyrelated OMD framework produces predictions via the following procedure: initialize , and for , compute
(7) 
Here, is a sequence of twicedifferentiable, strictly convex functions and is the Bregman divergence of a convex function at point .
The proof of the following regret bound (which is again a somewhat specialized variant of standard bounds for OMD) appears in Appendix A.
Theorem 11.
Suppose that for all for some strictly convex , with . Then there exists a sequence of points such that the following regret bound holds for all :
where is the local norm induced by at an appropriate , and is its dual norm.
4.2 Upper bounds for FTRL
Proof (of Lemma 6).
We observe that Eq. 4 is an instantiation of FTRL with as regularizations, where is the negative entropy. Hence, we can invoke Theorem 10
to bound the regret compared to any probability distribution
. It suffices to bound the regret for that minimizes , which is always a pointmass on a single expert , for which . Therefore, Theorem 10 in our case readsNow set . For the first two terms in the bound, observe that , and further, that
(8) 
For the final sum, we have to evaluate the Hessian at a point . A straightforward differentiation shows that this matrix is diagonal, with diagonal elements . Thus,
(9) 
The final sum can be divided and bounded as follows
Where we used the fact that . To conclude the proof it suffices to show that for . To see this, denote and write
For , the following relations hold:
Hence, for we have
and consequently
Since , the same inequality holds for ; that is, for all , and the proof is complete.
Lemma 12.
For the adaptive MW algorithm in Eq. 4 with loss vectors , we have
Proof.
By setting and we obtain
where in the final inequality we used Observation 3. To conclude we note that , thus we can modify the last summation to range over .
Lemma 13.
For the adaptive MW algorithm in Eq. 4, we have
Comments
There are no comments yet.