# Prediction with Corrupted Expert Advice

We revisit the fundamental problem of prediction with expert advice, in a setting where the environment is benign and generates losses stochastically, but the feedback observed by the learner is subject to a moderate adversarial corruption. We prove that a variant of the classical Multiplicative Weights algorithm with decreasing step sizes achieves constant regret in this setting and performs optimally in a wide range of environments, regardless of the magnitude of the injected corruption. Our results reveal a surprising disparity between the often comparable Follow the Regularized Leader (FTRL) and Online Mirror Descent (OMD) frameworks: we show that for experts in the corrupted stochastic regime, the regret performance of OMD is in fact strictly inferior to that of FTRL.

## Authors

• 3 publications
• 3 publications
• 32 publications
• 15 publications
• 64 publications
• ### Constant Regret, Generalized Mixability, and Mirror Descent

We consider the setting of prediction with expert advice; a learner make...
02/20/2018 ∙ by Zakaria Mhammedi, et al. ∙ 0

• ### Generalized Mixability Constant Regret, Generalized Mixability, and Mirror Descent

We consider the setting of prediction with expert advice; a learner make...
02/20/2018 ∙ by Zakaria Mhammedi, et al. ∙ 0

• ### Follow the Compressed Leader: Faster Online Learning of Eigenvectors and Faster MMWU

The online problem of computing the top eigenvector is fundamental to ma...
01/06/2017 ∙ by Zeyuan Allen-Zhu, et al. ∙ 0

• ### Adaptive Hedging under Delayed Feedback

The article is devoted to investigating the application of hedging strat...
02/27/2019 ∙ by Alexander Korotin, et al. ∙ 0

• ### Best-of-All-Worlds Bounds for Online Learning with Feedback Graphs

We study the online learning with feedback graphs framework introduced b...
07/20/2021 ∙ by Liad Erez, et al. ∙ 0

• ### Anytime Hedge achieves optimal regret in the stochastic regime

This paper is about a surprising fact: we prove that the anytime Hedge a...

• ### Predictive Power of Nearest Neighbors Algorithm under Random Perturbation

We consider a data corruption scenario in the classical k Nearest Neighb...
02/13/2020 ∙ by Yue Xing, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Prediction with expert advice is perhaps the single most fundamental problem in online learning and sequential decision making. In this problem, the goal of a learner is to aggregate decisions from multiple experts and achieve performance that approaches that of the best individual expert in hindsight. The standard performance criterion is the regret: the difference between the loss of the learner and that of the best single expert. The experts problem is often considered in the so-called adversarial setting, where the losses of the individual experts may be virtually arbitrary and even be chosen by an adversary so as to maximize the learner’s regret. The canonical algorithm in this setup is the Multiplicative Weights algorithm (Littlestone and Warmuth, 1989; Freund and Schapire, 1995), that guarantees an optimal regret of in any problem with experts and decision rounds.

A long line of research in online learning has focused on obtaining better regret guarantees, often referred to as “fast rates,” on benign problem instances in which the loss generation process behaves more favourably than in a fully adversarial setup. A prototypical example of such an instance is the stochastic setting of the experts problem, where the losses of the experts are drawn i.i.d. over time from a fixed and unknown distribution, and there is a constant gap between the mean losses of the best and second-best experts. In this setting, it has been established that the optimal expected regret scales as , and in particular, is bounded by a constant independent of the number of rounds  (De Rooij et al., 2014; Koolen et al., 2016). More recently, Mourtada and Gaïffas (2019) have shown that this optimal regret is in fact achieved by an adaptive variant of the multiplicative weights algorithm. Other works have studied various intermediate regimes between stochastic and adversarial, where the challenge is to adapt to the complexity of the problem with little or no prior knowledge (e.g., Cesa-Bianchi et al., 2007; Hazan and Kale, 2010; Chiang et al., 2012; Rakhlin and Sridharan, 2013; Koolen et al., 2014; Sani et al., 2014).

In this work, we consider a different, natural intermediate regime of the experts problem: an adversarially-corrupted stochastic setting. In this setting, the adversary can modify the stochastic losses with arbitrary corruptions, as long as the sum of the corruptions is bounded by a parameter , which is unknown to the learner. The injection of adversarial corruptions implies that the learner observes losses which are not distributed i.i.d. across time steps. In principle, one could use the adversarial online learning approach to overcome this challenge, but this will result in significantly inferior regret bounds that scale polynomially with the time horizon. The challenge is then to extend the favourable constant bounds on the regret achievable in the purely stochastic setting to allow for moderate adversarial corruptions.

In the closely related Multi-Armed Bandit (MAB) partial-information model in online learning, the adversarially-corrupted stochastic setting has recently received considerable attention (Lykouris et al., 2018; Gupta et al., 2019; Zimmert and Seldin, 2019; Jun et al., 2018; Kapoor et al., 2019; Liu and Shroff, 2019). Yet, the natural question of determining the optimal regret rate in the analogous full-information problem remained open. In particular, given that the optimal bounds in the bandit setting scale linearly with the number of experts (or “arms” in the context of MAB), it becomes a fundamental question if this dependence can be reduced to logarithmic with full-information, while preserving the dependence on the other parameters of the problem.

Indeed, our main result shows that the optimal regret in the adversarially-corrupted stochastic setting scales as independently of the horizon , and moreover, this optimal bound is attained by a simple adaptive variant of the classic multiplicative weights algorithm, that does not require knowing the corruption level in advance. In fact, it turns out that this simple algorithm performs optimally in all three regimes simultaneously: the pure stochastic setting, the adversarially-corrupted setting, and the fully-adversarial setting.

Our strategy for proving these results is based on a novel and delicate analysis of the multiplicative weights algorithm in the stochastic case, which can be seen as analogous to the approach taken by Zimmert and Seldin (2019)

in multi-armed bandits. The first step in this analysis adapts a standard worst-case regret bound for multiplicative weights with an explicit dependence on the second-moments of the losses (often called a “second order” regret bound), to the case of an adaptive step-size sequence. Then, a key observation is that the second-order terms admit a “self-bounding” property and their sum can be bounded by the (pseudo-)regret itself. The other expression in the regret bound, which is a sum of entropy terms that arises from the changing step sizes and captures the stability of the algorithm, is more challenging to handle but can also be shown to be self-bounded by the regret up to exponentially-decreasing terms that sum up to a constant. Putting these observations together yield a constant regret bound in the stochastic case, which is also shown to be directly robust to corruptions.

An interesting byproduct of our analysis is a surprising disparity between two common online learning meta-algorithms: Follow the Regularized Leader (FTRL) and Online Mirror Descent (OMD). We show that while both FTRL and OMD give rise to optimal (adaptive) multiplicative weights algorithms in the pure stochastic experts setting,111More precisely, the algorithm derived from OMD achieves a near-optimal (yet still constant, independent of ) bound, which is tight up to and factors. the OMD variant becomes strictly inferior to the FTRL variant once corruptions are introduced, and has a much weaker regret of for a fixed number of experts . In contrast, the non-adaptive (i.e., fixed step size) variants of the meta-algorithms are well-known to be equivalent in the more general setting of online linear optimization. We also show a few basic numerical simulations in which this gap is clearly visible and tightly supports our theoretical bounds.

## 2 Preliminaries

### 2.1 Problem setup

We consider the classic problem of prediction with expert advice, with a set of experts indexed by . In each time step

the learner chooses a probability vector

from the simplex . Thereafter, a loss vector is revealed. We will consider three variants of the problem, as follows.

In the adversarial (non-stochastic) setting, the loss vectors are entirely arbitrary and may be chosen by an adversary. The goal of the learner is to minimize the regret, given by

 RT:=T∑t=1pt⋅ℓt−mini∈[N]T∑t=1ℓt,i.

In the stochastic setting, the loss vectors are drawn i.i.d. from a fixed (and unknown) distribution. We denote the vector of the mean losses by and let be the index of the best expert, which we assume is unique. The gap between any expert and best one is denoted , and we let . The goal of the learner in the stochastic setting is to minimize the pseudo regret, defined as

 ¯¯¯¯¯RT:=T∑t=1pt⋅μ−T∑t=1μi⋆=T∑t=1N∑i=1pt,i(μi−μi⋆). (1)

Finally, in the adversarially-corrupted stochastic setting (following Lykouris et al., 2018; Gupta et al., 2019), which is the main focus of this paper, loss vectors are drawn i.i.d. from a fixed and unknown distribution as in the stochastic setting with mean rewards , and the same definitions of best expert and gap . Subsequently, an adversary is allowed to manipulate the feedback observed by the learner, up to some budget which we refer to as the corruption level. Formally, on each round :

1. [label=(0)]

2. A stochastic loss vector is drawn i.i.d. from a fixed and unknown distribution;

3. The adversary observes the loss vector and generates corrupted losses ;

4. The player picks a distribution over experts, suffers the loss , and observes only the corrupted loss vector .

Notice that we allow the adversary to be fully adaptive, in the sense that the corruption on round  may depend on past choices of the learner (before round ) as well as on the realizations of the random loss vectors  in all rounds up to (and including) round .

We consider the following measure of corruption, which we assume to be unknown to the learner:

 C=T∑t=1∥~ℓt−ℓt∥∞. (2)

Like in the stochastic setting, the goal of the learner is to minimize the pseudo regret (defined in Eq. 1). Note that, crucially, the pseudo regret of the learner depends only on the (means of) the stochastic losses and the adversarial corruption appears only in the feedback observed by the learner.

### 2.2 Multiplicative Weights

We recall two variants of the classic Multiplicative Weights (MW) algorithm that we revisit in this work. The standard MW algorithm (Littlestone and Warmuth, 1989; Freund and Schapire, 1995) is parameterized by a fixed step-size parameter . For an arbitrary sequence of loss vectors , it admits the following update rule, on every round :

 pt,i=e−η∑t−1s=1gs,i∑Nj=1e−η∑t−1s=1gs,j,∀ i∈[N]. (3)

For the basic, fixed step-size version of our results, we will need a standard second-order regret bound for MW.

###### Lemma 1 (Cesa-Bianchi et al., 2007; see also Arora et al., 2012).

If for all and , the regret of the MW updates in Eq. 3 is bounded as

 T∑t=1N∑i=1pt,i(gt,i−gt,i⋆)≤logNη+ηT∑t=1N∑i=1pt,ig2t,i.

In particular, the bound implies the well-known optimal regret bound for MW in the adversarial setting, if the step size is properly tuned to . In particular, the right setting of depends on the time horizon .

An adaptive variant of the MW algorithm that does not require knowledge of was proposed in Auer et al. (2002). This variant employs a diminishing step size sequence, and takes the form:

 pt,i=e−ηt∑t−1s=1gs,i∑Nj=1e−ηt∑t−1s=1gs,i,∀ i∈[N], (4)

with for all . This algorithm was shown to obtain the optimal regret in the adversarial setup for any (Auer et al., 2002; Cesa-Bianchi and Lugosi, 2006). We will show that, remarkably, the adaptive MW algorithm also achieves the optimal performance in the adversarially-corrupted experts setting, for any level of corruption.

We remark that the MW algorithm in Eq. 4 is in fact an instantiation of the canonical Follow-the-Regularized Leader (FTRL) framework in online optimization with entropy as regularization, when one allows the magnitude of regularization to change from round to round. MW can also be obtained by instantiating the closely related Online Mirror Descent (OMD) meta-algorithm, that also allows for the regularization to vary across rounds. (For more background on online optimization, FTRL and OMD, see Section 4.1 below.) When the regularization is fixed, it is a well-known fact that the two frameworks are generically equivalent and give rise to precisely the same algorithm, presented in Eq. 3. However, when the regularization is time-dependent, they produce different algorithms. We discuss the disparities between these different variants in more details in Section 3.3.

## 3 Main Results

In this section, we consider the adversarially-corrupted stochastic setting and present our main results. As a warm-up, we analyze the Multiplicative Weights algorithm with fixed step sizes while assuming the minimal gap is known to the learner. Then, we consider the general case where neither the gap nor the corruption level are known, and prove that the adaptive multiplicative weights algorithm attains optimal performance.

### 3.1 A warm-up analysis for known minimal gap

We begin with an easier case where the gap is known to the learner, and can be used to tune the step size parameter of multiplicative weights (Eq. 3). In this case, a fixed step-size algorithm suffices and we have the following.

###### Theorem 2.

The Multiplicative Weights algorithm (Eq. 3) with in the adversarially-corrupted stochastic regime with corruption level over rounds, achieves constant expected pseudo regret.

Two key observations in the analysis are the following. The first observation gives a straightforward bound on the corrupted losses of an expert in terms of its pseudo regret.

###### Observation 3.

For any and the following holds

 (~ℓt,i−~ℓt,i⋆)2≤1Δ(μi−μi⋆).
###### Proof.

For , note that , and on the other hand, since . Moreover, for we have and .

The second observation relates the regret with respect to the corrupted and uncorrupted losses.

###### Observation 4.

For any probability vectors the following holds

 T∑t=1N∑i=1pt,i(ℓt,i−ℓt,i⋆)≤T∑t=1N∑i=1pt,i(~ℓt,i−~ℓt,i⋆)+2C.
###### Proof.

Denoting as the corruption for expert at time step , we get

 T∑t=1N∑i=1pt,i(~ℓt,i−~ℓt,i⋆) =T∑t=1N∑i=1pt,i(ℓt,i−ℓt,i⋆)+T∑t=1N∑i=1pt,i(δt,i−δt,i⋆).

By definition of the corruption and therefore . Using the triangle inequality implies that

We now turn to prove the theorem.

###### Proof (of Theorem 2).

We start off with the basic bound of (fixed step size) MW in Lemma 1:

 T∑t=1N∑i=1pt,i(~ℓt,i−~ℓt,i⋆)≤logNη+ηT∑t=1N∑i=1pt,i~ℓ2t,i.

First, note that the regret of playing a fixed sequence is not affected by an additive translation of the form for any constant such that . In addition, for the Multiplicative Weights algorithm the sequences are also not affected by additive translation. Thus, taking yields

 T∑t=1N∑i=1pt,i(~ℓt,i−~ℓt,i⋆) ≤logNη+ηT∑t=1N∑i=1pt,i(~ℓt,i−~ℓt,i⋆)2.

Applying Observations 4 and 3 and rearranging terms implies

 T∑t=1N∑i=1pt,i(ℓt,i−ℓt,i⋆)≤logNη+2C+ηΔT∑t=1N∑i=1pt,i(μi−μi⋆).

Taking expectation while using the fact that and are independent we obtain

 E[¯¯¯¯¯RT] ≤logNη+2C+ηΔE[¯¯¯¯¯RT].

Finally, by setting and rearranging we can conclude that

 E[¯¯¯¯¯RT] ≤4logNΔ+4C.

### 3.2 General analysis with decreasing step sizes

We now formally state and prove our main result: a constant regret bound in the adversarially-corrupted case for the adaptive MW algorithm (in Eq. 4), that does not require the learner to know neither the gap nor the corruption level .

###### Theorem 5.

The adaptive MW algorithm in Eq. 4 with in the adversarially-corrupted stochastic regime with corruption level over rounds, achieves constant expected pseudo regret.

Note that this result is tight (up to constants): a lower bound of was shown by Mourtada and Gaïffas (2019), and a lower bound of is straighforward: consider an instance with experts, means and (assigned randomly to the experts) and an adversary that corrupts the first rounds and assigns a loss of zero to both experts on those rounds; the learner receives no information about the identity of the best expert (whose mean loss is the smallest) during the first rounds and thus incurs, in expectation, at least pseudo regret over these rounds.

For the proof of Theorem 5 we require two main lemmas. The first lemma is a second-order regret bound for adaptive MW, analogous to the one stated in Lemma 1 for the fixed step size case. Here and throughout the section, we use to denote the entropy of a probability vector, that is, .

###### Lemma 6.

For any sequence of loss vectors , the regret of the adaptive MW algorithm in Eq. 4 satisfies

 T∑t=1N∑i=1pt,i(gt,i−gt,i⋆)≤2logN+12logNT∑t=1ηtH(pt+1)+5T∑t=1ηtN∑i=1pt,ig2t,i,

provided that for all and .

The lemma is obtained from a more general bound for Follow-the-Regularized Leader, and follows from standard arguments adapted to the case of time-varying regularization. For completeness, we give the proof of this general bound in Appendix A, and use it to derive the lemma in Section 4.

The second lemma, key to our refined analysis of adaptive MW, shows that a properly scaled version of the entropy of any probability vector is upper bounded by the instantaneous pseudo regret of , up to an exponentially decaying additive term.

###### Lemma 7.

For any , and , we have the following bound for the entropy of any probability vector and any :

 1√τH(p) ≤58∑i≠i⋆piΔ+2√τe−18Δ√τ.

We prove the lemma below, but first let us show how it is used to derive our main theorem.

###### Proof (of Theorem 5).

Applying Lemma 6 on the corrupted loss vectors and introducing additive translations of as before, yields the bound

 T∑t=1N∑i=1pt,i(~ℓt,i−~ℓt,i⋆)≤2logN+12logNT∑t=1ηtH(pt+1)+5T∑t=1ηtN∑i=1pt,i(~ℓt,i−~ℓt,i⋆)2.

In Lemma 12 below we bound the last term in the bound in terms of the pseudo regret (similarly to the proof of Theorem 2), as follows:

 T∑t=1ηtN∑i=1pt,i(~ℓt,i−~ℓt,i⋆)2 ≤16logNΔ+18¯¯¯¯¯RT.

For bounding the first summation in the bound, we use Lemma 7. Summing the lemma’s bound over and bounding the sum of the exponential terms by an integral (refer to Lemma 13 for the details), we obtain

 1logNT∑t=1ηtH(pt+1)≤50logNΔ+58¯¯¯¯¯RT.

Plugging the two inequalities into the regret bound, we obtain

 T∑t=1N∑i=1pt,i(~ℓt,i−~ℓt,i⋆)≤107logNΔ+1516¯¯¯¯¯RT.

Using Observation 4 and taking expectation we get

 E[¯¯¯¯¯RT]≤107logNΔ+2C+1516E[¯¯¯¯¯RT].

Rearranging terms gives the theorem.

We conclude this section with a proof of our key lemma.

###### Proof (of Lemma 7).

We split the analysis of the sum for and . Considering first the case , we apply the inequality for to obtain, for ,

 1√τpi⋆log1pi⋆≤1√τ(1−pi⋆)≤18∑i≠i⋆piΔ.

Next, we examine the remaining terms with . The main idea is to look at two different regimes: one when and the other for . In the former case, we have

 1√τpilog1pi≤12√τpiΔ√τ=12piΔ.

For the latter case, we can use the inequality of for to obtain

 1√τpilog1pi≤2√τ√pi≤2√τe−14Δ√τ.

Combining both observations for implies

 1√τ∑i≠i⋆pilog1pi≤12∑i≠i⋆piΔ+2N√τe−14Δ√τ.

Finally, note that for it holds that . This together with our first inequality concludes the proof.

### 3.3 Gap between Follow the Regularized Leader and Online Mirror Descent

Here we present a surprising contrast between the variants of the adaptive MW algorithm obtained by instantiating the Follow the Regularized Leader (FTRL) and Online Mirror Descent (OMD) meta-algorithms, in the adversarially corrupted regime. We show that while both give optimal algorithms in the stochastic experts setting, the OMD variant becomes strictly inferior to the FTRL variant once corruptions are introduced.

As remarked above, when the step size (i.e., the magnitude of regularization) is fixed, the two meta-algorithms are equivalent, and produce the classic MW algorithm in Eq. 3 when their regularization is set to the negative entropy function over the probability simplex. (For more background and references, see Section 4.1.) Once one allows the step-sizes to vary across rounds, the FTRL gives the adaptive MW algorithm in Eq. 4, while OMD yields the following updates:

 pt,i=e−∑t−1s=1ηsℓs,i∑Nj=1e−∑t−1s=1ηsℓs,i,∀ i∈[N]. (5)

First, we show that the OMD variant of MW in Eq. 5 obtains the same constant regret bound in the pure stochastic regime, up to small and factors. (The proof appears in Section 4.4.)

###### Theorem 8.

The adaptive MW variant in Eq. 5 with in the stochastic regime (with no corruption), achieves constant expected pseudo regret for any .

On the other hand, we give a simple example which demonstrates that the OMD variant of MW exhibits a strictly inferior performance compared to the FTRL variant (see Eq. 4) when adversarial corruptions are present. For simplicity, assume that the corruption level is a positive integer. Consider the following corrupted stochastic instance with experts. The mean loss of expert is while the mean loss of expert is . The adversary introduces corruption over the first rounds, and modifies the first losses of expert to ’s and those of expert to ’s.

For this simple problem instance, we show the following (see Section 4.3 for the proof).

###### Theorem 9.

The expected pseudo regret of the adaptive MW algorithm in Eq. 5 with where on the instance described above for rounds is at least .

In particular, if the learner does not have non-trivial bounds on the corruption level and gap  (that is, is a constant independent of and ), then the regret is necessarily at least or is exponentially large in .

### 3.4 Numerical simulations

We conducted a basic numerical experiment to illustrate our regret bounds and the gap between OMD and FTRL discussed above. The experiment setup consists of two experts with different gaps . The losses were taken as Bernoullis and the corruption strategy injected contamination in the first rounds up to a total budget of , inflicting maximal loss on the best expert while zeroing the losses of the other expert. The results, shown in Fig. 1, demonstrate that for the stochastic case without corruption () OMD achieves better pseudo regret, but is substantially outperformed by FTRL when . In Fig. 2 we further show the inverse dependence of the pseudo-regret on the minimal gap , which precisely supports our theoretical finding discussed in Section 3.3.

## 4 Proofs

### 4.1 Preliminaries: Online optimization with time-dependent regularization

We give a brief background on Follow the Regularized Leader and Online Mirror Descent algorithmic templates, in the case where the regularization is varying and time-dependent.

The setup is the standard setup of online linear optimization. Let be a convex domain. On each prediction round , the learner has to produce a prediction based on , and subsequently observes a new loss vector and incurs the loss . The goal is to minimize the regret compared to any , given by .

The FTRL template generates predictions , for , as follows:

 wt=argminw∈W{w⋅t−1∑s=1gs+Rt(w)}. (6)

Here, is a sequence of twice-differentiable, strictly convex functions.

The derivation and analysis of FTRL-type algorithms is standard; see, e.g., Shalev-Shwartz and others (2012); Hazan (2016). In our analysis, however, we require a particular regret bound that we could not find stated explicitly in the literature; for completeness, we provide the bound here with a proof in Appendix A.

###### Theorem 10.

Suppose that for all for some strictly convex , with . Then there exists a sequence of points such that the following regret bound holds for all :

where is the local norm induced by at an appropriate , and is its dual norm.

#### Online Mirror Descent (OMD).

The closely-related OMD framework produces predictions via the following procedure: initialize , and for , compute

 w′t+1=argminw{gt⋅w+DRt(w,wt)}=(∇Rt)−1(∇Rt(wt)−gt);wt+1=argminw∈WDRt(w,w′t+1). (7)

Here, is a sequence of twice-differentiable, strictly convex functions and is the Bregman divergence of a convex function at point .

The proof of the following regret bound (which is again a somewhat specialized variant of standard bounds for OMD) appears in Appendix A.

###### Theorem 11.

Suppose that for all for some strictly convex , with . Then there exists a sequence of points such that the following regret bound holds for all :

where is the local norm induced by at an appropriate , and is its dual norm.

### 4.2 Upper bounds for FTRL

###### Proof (of Lemma 6).

We observe that Eq. 4 is an instantiation of FTRL with as regularizations, where is the negative entropy. Hence, we can invoke Theorem 10

to bound the regret compared to any probability distribution

. It suffices to bound the regret for that minimizes , which is always a point-mass on a single expert , for which . Therefore, Theorem 10 in our case reads

 T∑t=1N∑i=1pt,i(gt,i−gt,i⋆)≤−1η1R(p1)−T∑t=1(1ηt+1−1ηt)R(pt+1)+12T∑t=1ηt(∥gt∥∗t)2.

Now set . For the first two terms in the bound, observe that , and further, that

 1ηt+1−1ηt=1√logN1√t+√t+1≤12√tlogN=ηt2logN. (8)

For the final sum, we have to evaluate the Hessian at a point . A straightforward differentiation shows that this matrix is diagonal, with diagonal elements . Thus,

 (9)

The final sum can be divided and bounded as follows

 T∑t=1ηt(p′t⋅g2t) =logN∑t=1ηt(p′t⋅g2t)+T∑t=1+logNηt(p′t⋅g2t) ≤2logN+T∑t=1+logNηt(p′t⋅g2t).

Where we used the fact that . To conclude the proof it suffices to show that for . To see this, denote and write

 e−ηt+1Gt+1,ie−ηtGt,i=e−ηt+1gt,ie(ηt−ηt+1)Gt,i.

For , the following relations hold:

 0 <ηt+1|gt,i|≤ηt+1≤1; 0 <(ηt−ηt+1)|Gt,i|≤√logN√t+1−√t√t(t+1)t≤√logN√t+√t+1≤ηt≤1.

Hence, for we have

 13≤e−ηt+1Gt+1,ie−ηtGt,i≤3,

and consequently

 pt+1,i=e−ηt+1Gt+1,i∑Nj=1e−ηt+1Gt+1,j≤9e−ηtGt,i∑Nj=1e−ηtGt,j=9pt,i.

Since , the same inequality holds for ; that is, for all , and the proof is complete.

###### Lemma 12.

For the adaptive MW algorithm in Eq. 4 with loss vectors , we have

 T∑t=1ηtN∑i=1pt,i(~ℓt,i−~ℓt,i⋆)2 ≤16logNΔ+18¯¯¯¯¯RT.
###### Proof.

By setting and we obtain

 T∑t=1ηtN∑i=1pt,i(~ℓt,i−~ℓt,i⋆)2 ≤t0∑t=1ηt+T∑t=t0+1ηt0N∑i=1pt,i(~ℓt,i−~ℓt,i⋆)2 ≤2c√t0+Δ8T∑t=t0+1N∑i=1pt,i(~ℓt,i−~ℓt,i⋆)2 ≤16logNΔ+18T∑t=t0+1N∑i=1pt,i(μi−μi⋆),

where in the final inequality we used Observation 3. To conclude we note that , thus we can modify the last summation to range over .

###### Lemma 13.

For the adaptive MW algorithm in Eq. 4, we have

 1logNT∑t=1ηtH(pt+1)≤50logNΔ+58¯¯¯¯¯RT.
###### Proof.

First we split the sum as follows,

 1logNT∑t=1ηtH(pt+1)=1logNt0∑t=1ηtH(pt+1)+1logNT∑t=t0+1ηtH(pt+1),

where . For the summation of we use Lemma 7 with to obtain

 1logNT∑t=t0+1ηtH(pt+1) =T∑t=t0+11√tlogNN∑i=1pt+1,ilog1pt+1,i ≤58T∑t=t0+1∑i≠i⋆pt+1,iΔ+2T∑t=t0+11√tlogNe−18Δ√tlogN ≤58T∑t=t0+1N∑i=1pt,i(μi−μi⋆)+Δ+2T∑t=t0+11√tlog