# Proximal Online Gradient is Optimum for Dynamic Regret

In online learning, the dynamic regret metric chooses the reference (optimal) solution that may change over time, while the typical (static) regret metric assumes the reference solution to be constant over the whole time horizon. The dynamic regret metric is particularly interesting for applications such as online recommendation (since the customers' preference always evolves over time). While the online gradient method has been shown to be optimal for the static regret metric, the optimal algorithm for the dynamic regret remains unknown. In this paper, we show that proximal online gradient (a general version of online gradient) is optimum to the dynamic regret by showing that the proved lower bound matches the upper bound that slightly improves existing upper bound.

## Authors

• 6 publications
• 20 publications
• 103 publications
• ### Tracking Slowly Moving Clairvoyant: Optimal Dynamic Regret of Online Learning with True and Noisy Gradient

This work focuses on dynamic regret of online convex optimization that c...
05/16/2016 ∙ by Tianbao Yang, et al. ∙ 0

• ### Dynamic Assortment Selection under the Nested Logit Models

We study a stylized dynamic assortment planning problem during a selling...
06/27/2018 ∙ by Xi Chen, et al. ∙ 0

• ### Regret Analysis of the Anytime Optimally Confident UCB Algorithm

I introduce and analyse an anytime version of the Optimally Confident UC...
03/29/2016 ∙ by Tor Lattimore, et al. ∙ 0

• ### Understand Dynamic Regret with Switching Cost for Online Decision Making

As a metric to measure the performance of an online method, dynamic regr...
11/28/2019 ∙ by Yawei Zhao, et al. ∙ 0

• ### Online Learning over Dynamic Graphs via Distributed Proximal Gradient Algorithm

We consider the problem of tracking the minimum of a time-varying convex...
05/16/2019 ∙ by Rishabh Dixit, et al. ∙ 0

• ### Robust Dynamic Assortment Optimization in the Presence of Outlier Customers

We consider the dynamic assortment optimization problem under the multin...
10/09/2019 ∙ by Xi Chen, et al. ∙ 0

• ### Learning to Cache With No Regrets

This paper introduces a novel caching analysis that, contrary to prior w...
04/22/2019 ∙ by Georgios S. Paschos, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Online learning (Zinkevich, 2003; Shalev-Shwartz, 2012; Hazan, 2016; Mohri and Yang, 2016; Zhang et al., 2018; Jun et al., 2017; Jain et al., 2012; Zhang et al., 2017b) is a hot research topic for the last decade of years, due to its application in practices such as online recommendation (Chaudhuri and Tewari, 2016), online collaborative filtering (Liu et al., 2017; Awerbuch and Hayes, 2007), moving object detection (Nair and Clark, 2004) and many others, as well as its close connection with other research areas such as stochastic optimization (Rakhlin et al., 2011; Liu et al., 2018)

(Gao et al., 2017), multiple kernel learning (Lu et al., 2016; Shen et al., 2018), and bandit problems (Flaxman et al., 2005; Arora et al., 2012; Kwon and Perchet, 2017; Kocák et al., 2016), etc.

The typical objective in online learning is to minimize the (static) regret defined below

 T∑t=1ft(xt)−minx∈XT∑t=1ft(x)the optimal reference, (1)

where is the decision made at step after receiving the information before that (e.g., ). The optimal reference is chosen at the point that minimizes the sum of all component functions up to time . However, the way to decide the optimal reference may not fit some important applications in practice. For example, in the recommendation task, is the regret at time decided by the -th coming customer and our recommendation strategy . Based on the definition of regret in (1), it implicitly assumes that the optimal recommendation strategy is constant over time, which is not necessarily true for the recommendation task (as well as many other applications) since the costumers’ preference usually evolves over time.

Zinkevich (2003) proposed to use the dynamic regret as the metric for online learning, that allows the optimal strategy changing over time. More specifically, it is defined by

 RAT:= T∑t=1ft(xt)−min{yt}Tt=1∈LTD0T∑t=1ft(yt), (2)

where denotes the algorithm that decides iteratively, is short for a sequence , and the dynamics upper bound is defined by

 LTD0:={{yt}Tt=1:T−1∑t=1∥∥yt+1−yt∥∥≤D0(T)}. (3)

It was shown that the dynamic regret of Online Gradient (OG) is bounded (Hall and Willett, 2013, 2015; Zinkevich, 2003) by

 R\textscOGT=O(√T+√TD0(T)). (4)

This reminds people to ask a few fundamental questions:

• As we know the dependence on is tight, since OG is optimum for static regret. But, is the dependence to the dynamics tight? In other words, Is OG also optimal for dynamic regret?

• Is this bound tight enough, how to design a “smarter” algorithm to follow the dynamics?

• How difficult to follow dynamics in online learning?

Although the dynamic regret receives more and more attention recently (Mokhtari et al., 2016; Yang et al., 2016; Zhang et al., 2017a; Shahrampour and Jadbabaie, 2018; Hall and Willett, 2015; Jadbabaie et al., 2015) and some successive studies claim to improve this result by considering specific functions types (e.g., strongly convex ), or restricting the definition of dynamic regret, these fundamental questions still remain unsolved.

In this paper, we consider a more general setup for the problem

 ft(x)=Ft(x)+H(x), (5)

with and being only convex and closed, and a more general definition for dynamic constraint in (6)

 LTDβ:={{yt}Tt=1:T−1∑t=1tβ⋅∥∥yt+1−yt∥∥≤Dβ(T)}. (6)

We show that the upper bound of the Proximal Online Gradient (POG) algorithm, which is a general version of online gradient, can be improved to

 R\textscPOGT=O(√T+√T1−β⋅Dβ(T)), (7)

To understand the difficulty of following dynamics in online learning, we derive the lower bound (that measures the dynamic regret by the optimal algorithm) and show that the proved upper bound for POG matches the lower bound, which indicates POG is the optimal algorithm even for dynamic regret (not just for static regret).

## 2 Related work

We outline and review the previous researches by the regret in static and dynamic environments briefly.

### 2.1 Static regrets

Online gradient in the static environment has been extensively investigated for the last decade of years (Shalev-Shwartz, 2012; Hazan, 2016; Duchi et al., 2011). Specifically, when is strongly convex, the regret of online gradient is . When is only convex, the regret of online gradient is .

### 2.2 Regrets bounded by other dynamics

Zinkevich (2003) obtains the regret in the order of for the convex function . Similarly, assume the dynamic constraint is defined by the inequality,

where provides the prediction about the dynamic environment. When predict the dynamic environment accurately, Hall and Willett (2013, 2015) obtains a better regret than (Zinkevich, 2003), but it is still bounded by .

Additionally, assume is strongly convex and smooth, and the dynamic constraint is defined by

 D∗(T):=T−1∑t=1∥∥y∗t+1−y∗t∥∥,~{}where~{}y∗t:=\operatornamewithlimitsargminy∈Xft(y).

Mokhtari et al. (2016) obtains regret. When querying noisy gradient, Bedi et al. (2018) obtains regret, where is the cumulative gradient error. Yang et al. (2016); Gao et al. (2018) extends it for non-strongly convex and non-convex functions, respectively. Shahrampour and Jadbabaie (2018) extends it to the decentrialized setting222The definition of is changed slightly in the decentrialized setting.. Furthermore, define

 S∗(T):=T−1∑t=1∥∥y∗t+1−y∗t∥∥2,~{}where~{}y∗t:=\operatornamewithlimitsargminy∈Xft(y).

When querying with gradients for every iteration, Zhang et al. (2017a) improves the dynamic regret to be . Comparing with the previous work, we obtain a tight regret by using a more general definition of the dynamic constraint, i.e., (6), and our analysis does not assume the smoothness and strong convexity of .

Other regularities including the functional variation (Jenatton et al., 2016a; Zhu and Xu, 2015; Besbes et al., 2015; Zhang et al., 2018), the gradient variation (Chiang et al., 2012), and the mixed regularity (Jadbabaie et al., 2015; Chen et al., 2017; Jenatton et al., 2016b) have been investigated to bound the dynamic regret. Those different regularities cannot be compared directly because that they measure different aspects of the variation in the dynamic environment. In the paper, we use (6) to bound the regret, and it is the future work to extend our analysis to other regularities.

### 2.3 Shifting regret

There does exist a connection between the dynamic regret and the shifting regret (György and Szepesvári, 2016). The shifting regret uses norm as the metric to take account the dynamics; while the dynamic regret in our paper uses norm as the metric. Also note that the shifting regret can be somehow considered as a special of the dynamic regret in the following sense. Our result indeed can implies the upper bound of shifting regret. In particular, by restricting in the region , our upper bound for dynamic regret implies that the shifting regret is bounded by , which is consistent with the result in (Jun et al., 2017).

## 3 Notations and Assumptions

In this section, we introduce notations and important assumptions for the online learning algorithm used throughout this paper.

### 3.1 Notations

• represents the family of all possible online algorithms.

• represents the family of loss functions available to the adversary, where for any loss function

, satisfies the three following assumptions. denotes the function product space by .

• represents a sequence of vectors, namely, . denotes a sequence of functions, which is .

• is the regret for a loss function sequence with a learning algorithm where can be POG or OG.

• denotes the norm. represents the norm by default.

• represents the subgradient operator. represents the mathematical expectation.

### 3.2 Assumptions

We use the following assumptions to analyze the regret of the online gradient.

###### Assumption 1.

Functions for all and are convex and closed but possibly nonsmooth. Particularly, is defined as .

###### Assumption 2.

The convex compact set is the domain for and , and for any .

###### Assumption 3.

For any and function , , where .

## 4 Algorithm

We use the proximal online gradient (POG) for solving the online learning problem with in the form of (5). The POG algorithm is a general version of OG for taking care of the regularizer component in . The complete POG algorithm is presented in Algorithm 1. Line 4 of Algorithm 1 is the proximal gradient descent step defined by

 xt+1=proxH,ηt(xt−ηtGt(xt)),

where the proximal operator is defined as

 proxH,ηt(x′):=argminx∈X{H(x)+12ηt∥x−x′∥2}.

Therefore, the update of is also equivalent to

 xt+1=proxH,ηt(xt−ηtGt(xt))=argminx∈X⟨Gt(xt),x⟩+12ηt∥x−xt∥2+H(x).

The POG algorithm reduces to the OG algorithm when is a constant function.

## 5 Theoretical results

Comparing with the previous definition of the dynamic constraint defined in (3). We consider a more general dynamic constraint, which is defined in (6).

When , reduces to the previous definition of the dynamic constraint. Comparing with the previous definition, when , allocates larger weights for the future parts of the dynamics than the previous parts.

In this section, we will first prove an upper bound for the regret based on our general dynamic constraints via proximal online gradient, which can also slightly to improve the existing upper bound. Then we present an lower bound which was not well studied in previous literature to our best knowledge. We will show that our proved upper bound matches the lower bound, implying the optimality of proximal online gradient algorithm.

### 5.1 Upper bound

###### Theorem 1.

Let , and choose the positive learning rate sequence in Algorithm 1 to be non-increasing, the following upper bound for the dynamic regret holds

 sup{ft}Tt=1∈FTR\textscPOGT≤√RmaxηTt=1{1ηt⋅tβ}⋅Dβ(T)+R2ηT+G2T∑t=1ηt+H(x1)−H(xT+1). (8)

To the make the dynamic regret more clear, we choose the learning rate appropriately, which leads to the following result

###### Corollary 1.

For any , we choose an appropriate such that and . Then, set the learning rate by

 ηt=t−γ⋅ ⎷(1−γ)(2√RT2γ−β−1Dβ(T)+RT2γ−1)G

in Algorithm 1. We have

 supfTt=1∈FTR\textscPOGT=O(√Dβ(T)⋅T1−β+√T). (9)

Letting , our upper bound is , which slightly improves the known regret (Zinkevich, 2003; Hall and Willett, 2013, 2015) in the sense that it has a better dependence on .

### 5.2 Lower bound

Once we obtain the upper bound for dynamic regret via POG, namely , there still remains a question, whether our upper bound’s dependency on and is tight enough or even optimal.

Unfortunately, to our best knowledge, this question has not been fully investigated in any existing literature, even for the case of the dynamic regret defined with .

To answer this question, we attempt to explore the value of for the optimal algorithm , which is formally written as . If a lower bound for matches the upper bound in (9), then we can say that POG is optimum for dynamic regret in online learning.

###### Theorem 2.

For any , the lower bound for our problem with dynamic regret is

 infA∈Asup{ft}Tt=1∈FTR\textscAT≥Ω(√Dβ(T)⋅T1−β+√T),

where is the set of all possible learning algorithms. , , with .

Theorem 2 shows that the lower bound matches with the upper bound in terms of the order of and . This theoretical result implies that the proximal online gradient is an optimal algorithm to find decisions in the dynamic environment defined by and our upper bound is also sufficiently tight. In addition, this lower bound also reveals the difficulty of following dynamics in online learning.

## 6 Conclusion

The online learning problem with dynamic regret metric is particularly interesting for many real sceneiros. Although the online gradient method has been shown to be optimal for the static regret metric, the optimal algorithm for the dynamic regret remains unknown. This paper studies this problem from a theoretical prespective. We show that proximal online gradient, a general version of online gradient, is optimum to the dynamic regret by showing that our proved lower bound matches the upper bound which slightly improves the existing upper bound.

## Appendix: Proofs

In this section, we present the detailed proofs for the necessary lemmas and the theorems in our paper. In particular, Lemma 1 and Lemma 2 are for the proofs of Theorem 1 and Corollary 1. Lemma 3 is for the proof of Theorem 2.

In our proofs, we abuse the notations of a little bit to represent any vector in the subgradient of . still represents any vector in .

We use to denote Bregman divergence w.r.t. the function .

###### Lemma 1.

Given any sequence , and setting any in Algorithm 1, we have

 T∑t=1(Ft(xt)+H(xt+1)−Ft(yt)−H(yt))≤T∑t=112ηt(∥yt−xt∥22−∥yt−xt+1∥22)+12T∑t=1ηt∥Gt(xt)∥2.
###### Proof.

Define , and , according to the optimal condition, for any , we have

 0≤⟨x−xt+1,ηtGt(xt)⟩+⟨x−xt+1,∇ψ(xt+1)−∇ψ(xt)+ηt∂H(xt+1)⟩. (10)

Then, we have

 ηt(Ft(xt)+H(xt+1)−Ft(yt)−H(yt)) ≤ ηt⟨xt−yt,Gt(xt)⟩+ηt⟨xt+1−yt,∂H(xt+1)⟩ = ηt⟨xt+1−yt,Gt(xt)⟩+ηt⟨xt+1−yt,∂H(xt+1)⟩+ηt⟨xt−xt+1,Gt(xt)⟩ \textcircled1≤ ⟨yt−xt+1,∇ψ(xt+1)−∇ψ(xt)⟩+ηt⟨xt−xt+1,Gt(xt)⟩ \textcircled2= Bψ(yt,xt)−Bψ(xt+1,xt)−Bψ(yt,xt+1)+ηt⟨xt−xt+1,Gt(xt)⟩ \textcircled3≤ Bψ(yt,xt)−Bψ(yt,xt+1)+η2t2∥Gt(xt)∥2.

holds due to (10). holds due to three-point identity for Bregman divergence, which is, for any vectors , , and ,

 Bψ(x,y)=Bψ(x,z)+Bψ(z,y)−⟨x−z,∇ψ(y)−∇ψ(z)⟩.

holds due to , so that . Thus, we finally obtain

 T∑t=1(Ft(xt)+H(xt+1)−Ft(yt)−H(yt))≤ T∑t=1(1ηtBψ(yt,xt)−1ηtBψ(yt,xt+1))+12T∑t=1ηt∥Gt(xt)∥2 =

It completes the proof. ∎

###### Lemma 2.

Given any sequence , and setting a non-increasing series in Algorithm 1, we have

###### Proof.

According to the law of cosines, we have

 −∥yt−xt+1∥2+∥∥yt+1−xt+1∥∥2≤ ≤ ≤ 2√R∥∥yt+1−yt∥∥. (11)

Thus, we obtain

 = T−1∑t=1(−1ηt∥yt−xt+1∥2+1ηt+1∥∥yt+1−xt+1∥∥2)+1η1∥y1−x1∥2−1ηT∥yT−xT+1∥2 ≤ ≤ T−1∑t=1(−1ηt∥yt−xt+1∥2+1ηt∥∥yt+1−xt+1∥∥2)+RT−1∑t=1(1ηt+1−1ηt)+Rη1 \textcircled1≤ 2√RT−1∑t=11ηt(∥∥yt+1−yt∥∥)+RηT.

holds due to (11). The proof is completed. ∎

Proof of Theorem 1:

###### Proof.

For any sequence of loss functions , we have

 T∑t=1(Ft(xt)+H(xt)−Ft(yt)−H(yt))=T∑t=1(Ft(xt)+H(xt+1)−Ft(yt)−H(yt))I0+H(x1)−H(xT+1).

According to Lemma 1, we have

 I0= T∑t=1(Ft(xt)+H(xt+1)−Ft(yt)−H(yt)) ≤ \textcircled1≤ √RT−1∑t=11ηt(∥∥yt+1−yt∥∥)+R2ηT+G2T∑t=1ηt ≤ √RmaxηTt=1{1ηt⋅tβ}⋅Dβ(T)+R2ηT+G2T∑t=1ηt.

holds due to Lemma 2. Thus, we have

 T∑t=1(Ft(xt)+H(xt)−Ft(yt)−H(yt)) (12) ≤ √RmaxηTt=1{1ηt⋅tβ}⋅Dβ(T)+R2ηT+G2T∑t=1ηt+H(x1)−H(xT+1).

Since (12) holds for any sequence of loss functions , thus,

 sup{ft}Tt=1∈FTR\textscPOGT≤ √RmaxηTt=1{1ηt⋅tβ}⋅Dβ(T)+R2ηT+G2T∑t=1ηt+H(x1)−H(xT+1)

It completes the proof. ∎

Proof of Corollary 1:

###### Proof.

Assume

 ηt:=t−γ⋅σ1,

where is a constant, and does not depend on . According to Theorem 1, when ,

 maxηTt=1{1ηt⋅tβ}=Tγ−βσ1.

Substituting it into (8), we have

 R\textscPOGT≤ √RDβ(T)σ1Tγ−β+R2σ1Tγ+G2(1+∫T1t−γdt)+H(x1) \textcircled1≤ √RDβ(T)σ1Tγ−β+R2σ1Tγ+Gσ12(1−γ)T1−γ+G2+H(x1).

holds due to . Choosing the optimal with

 σ1= ⎷(1−γ)(2√RT2γ−β−1Dβ(T)+RT2γ−1)G,

we have

 R\textscPOGT≤√2G√RDβ(T)T1−β1−γ+√GRT4(1−γ)+G2+H(x1)=O(√Dβ(T)⋅T1−β+√T).

It completes the proof.

###### Lemma 3.

Consider a sequence . For any , dimensions of are i.i.d. sampled from Rademacher distribution. We have

 E{vt}Tt=1∥∥ ∥∥T∑t=1vt∥∥ ∥∥1=Ω(d√T)
###### Proof.

We consider the left hand side

 E{vt}Tt=1∥∥ ∥∥T∑t=1vt∥∥ ∥∥1=E{vt}Tt=1d∑i=1∣∣∣T∑t=1vt(i)∣∣∣=d⋅E{vt(1)}Tt=1∣∣∣T∑t=1vt(1)∣∣∣, (13)

where denotes the -th dimension of , and . The second equality holds because that every dimension of is independent to each other.

Consider the sequence . If the event: is picked happens

times with the probability

, then the event : is picked happens times. Denote , and we have

 ST=m−(T−m)=2m−T.

Denote , and . Thus, we have

 P(ST=2m−T)=Pm=12T⋅(Tm),

and

 E|ST|=T∑m=0|2m−T|2T⋅(Tm)=12T⋅T∑m=0|2m−T|⋅T!m!⋅(T−m)!.

When is even, denote . Thus,

 E|ST|= 122J⋅T∑m=0|2m−2J|⋅(2J)!m!⋅(2J−m)! = (2J)!22J⋅2J∑m=0|2m−2J|m!⋅(2J−m)! \textcircled1= (2J)!22J−2⋅J∑n=1n(J+n)!⋅(J−n)! = 122J−2⋅(J∑n=0(n+J)(2JJ+n)−J∑n=0J(2JJ+n)) = 122J−2⋅(2J∑i=Ji(2Ji)−2J∑i=JJ(2Ji)) \textcircled2= 122J−2⋅(2J∑i=J2J(2J−1i−1)−2J∑i=JJ(2Ji)) \textcircled3= 2J22J−2⋅(2J−1∑k=J−1(2J−1k)−14(22J+(2JJ))) \textcircled4= 2J22J−2⋅(12(22J−1+(2J−1J−1))−14(22J+(2JJ))) = 2J22J−2⋅(14(2JJ)) = 2J4J⋅(2J)!J!⋅J! \textcircled5≥ 2J⋅12√J = Ω(√T).

Here, holds due to

 (2J)!22J⋅2J∑m=0|2m−2J|m!⋅(2J−m)! = (2J)!22J⋅(J∑m=02J−2mm!⋅(2J−m)!+2J∑m=J+12m−2Jm!⋅(2J−m)!) = (2J)!22J⋅(J∑n1=02n1(J−n1)!⋅(J+n1)!+J∑n2=12n2(J+n2