DeepAI

# Online Convex Optimization with Unbounded Memory

Online convex optimization (OCO) is a widely used framework in online learning. In each round, the learner chooses a decision in some convex set and an adversary chooses a convex loss function, and then the learner suffers the loss associated with their chosen decision. However, in many of the motivating applications the loss of the learner depends not only on the current decision but on the entire history of decisions until that point. The OCO framework and existing generalizations thereof fail to capture this. In this work we introduce a generalization of the OCO framework, “Online Convex Optimization with Unbounded Memory”, that captures long-term dependence on past decisions. We introduce the notion of p-effective memory capacity, H_p, that quantifies the maximum influence of past decisions on current losses. We prove a O(√(H_1 T)) policy regret bound and a stronger O(√(H_p T)) policy regret bound under mild additional assumptions. These bounds are optimal in terms of their dependence on the time horizon T. We show the broad applicability of our framework by using it to derive regret bounds, and to simplify existing regret bound derivations, for a variety of online learning problems including an online variant of performative prediction and online linear control.

• 4 publications
• 22 publications
• 1 publication
10/16/2020

### Online non-convex optimization with imperfect feedback

We consider the problem of online learning with non-convex losses. In te...
08/24/2020

### Online Convex Optimization Perspective for Learning from Dynamically Revealed Preferences

We study the problem of online learning (OL) from revealed preferences: ...
08/09/2021

### Online Multiobjective Minimax Optimization and Applications

We introduce a simple but general online learning framework, in which at...
12/03/2020

### Online learning with dynamics: A minimax perspective

We study the problem of online learning with dynamics, where a learner i...
09/07/2020

### Non-exponentially weighted aggregation: regret bounds for unbounded loss functions

We tackle the problem of online optimization with a general, possibly un...
01/23/2021

### Optimistic and Adaptive Lagrangian Hedging

In online learning an algorithm plays against an environment with losses...
09/13/2021

### Zeroth-order non-convex learning via hierarchical dual averaging

We propose a hierarchical version of dual averaging for zeroth-order onl...

## 1 Introduction

Online learning models sequential decision-making problems, where a learner must choose a sequence of decisions while interacting with the environment. We refer the reader to cesabianchi2006prediction; shalevshwartz2012online for more background. One of the most popular frameworks in online learning is online convex optimization (OCO) (see, for example, hazan2019oco).

The OCO framework is as follows. In each round, the learner chooses a decision from some convex set and an adversary chooses a convex loss function, and then the learner suffers the loss associated with their current decision. The performance of an algorithm is measured by regret: the difference between the algorithm’s total loss and that of the best fixed decision. This framework has numerous applications, such as prediction from expert advice (littlestone1989weighted; littlestone1994weighted), portfolio selection (cover1991universal), routing (awerbuch2008online), etc. There has been a lot of work in developing a variety of algorithms and proving optimal regret bounds, and we refer the reader to hazan2019oco for a survey. In many applications of the OCO framework the loss of the learner depends not only on the current decisions but on the entire history of decisions until that point. For example, consider the problem of online linear control (agarwal2019online), where in each round the learner chooses a “control policy” (i.e., decision), suffers a loss that is a function of the action taken by the chosen policy and the current state of the system, and the system’s state evolves according to linear dynamics. Note that the current state depends on the entire history of actions and, therefore, the current loss depends not only on the current decision but the entire history of decisions. The OCO framework fails to capture such long-term dependence of the current loss on the past decisions, as do existing generalizations of OCO that allow the loss to depend on a constant number of past decisions (anava2015online). Although a series of approximation arguments can be used to apply the finite memory generalization of OCO to the online linear control problem, there is no OCO framework that captures the complete long-term dependence of current losses on past decisions.

Contributions In this paper we introduce a generalization of the OCO framework, “Online Convex Optimization with Unbounded Memory”, that allows the loss in the current round to depend on the entire history of decisions until that point. We introduce the notion of -effective memory capacity, , that quantifies the maximum influence of past decisions on present losses. We prove a policy regret bound and a stronger policy regret bound under mild additional assumptions about the history (Section 3). Both bounds use the notion of policy regret with respect to constant action sequences, as defined in dekel2012online. Our bounds are optimal in terms of their dependence on the time horizon , and we show that our results match existing ones for the special case of OCO with finite memory (Section 4.1). Finally, we show the broad applicability of our framework to derive regret bounds, and to simplify existing regret bound derivations, for online learning problems including an online variant of performative prediction and online linear control (Section 4.3 and 4.4).

Related work The most closely related work to ours is the OCO with finite memory framework (anava2015online). They consider a generalization of the OCO framework that allows the current loss to depend on a constant number of past decisions. There have been a number of follow-up works that extend the framework in a variety of other ways, such as non-stationarity (zhao2020nonstationary), incorporating switching costs (shi2020online), etc. However, none of the existing works go beyond a constant memory length. In contrast, our framework allows the current loss to depend on an unbounded number of past decisions.

Reinforcement learning (sutton2018reinforcement) is another popular framework for sequential decision-making that considers very general state-action models of feedback and dynamics. In reinforcement learning one typically measures regret with respect to the best state-action policy from some policy class, rather than the best fixed decision as in online learning and OCO. In the special case of linear control, policies can be reformulated as decisions while preserving convexity; we discuss this application in Section 4. Considering the general framework is an active area of research. For example bhatia2020online provide only non-constructive upper bounds on regret.

We defer discussion of related work for specific applications to Section 4.

## 2 Framework

Let denote the time horizon. Let be a closed and convex subset of a Hilbert space, be a Banach space, and be linear operators, and be loss functions chosen by an oblivious adversary. The game between the learner and the adversary proceeds as follows. Let the initial history be . In each round , the learner chooses , the history is updated according to , and the learner suffers loss .

Consider a strategy that chooses the same action in every round, i.e. for some and for all rounds . The history after round for such a strategy is described by , which motivates the following definition.

Given a function , the function is defined by .

We define the policy regret of a learner as the difference between its total loss and the total loss of a strategy that plays the best fixed action in every round (dekel2012online). [Policy Regret] The policy regret of an algorithm is defined as

 RT(\calA)=T∑t=1ft(ht)−minx∈\calXT∑t=1\lx@overaccentset∘ft(x). (1)

### 2.1 Notation

We use to denote the norm associated with the Hilbert space and to denote the norm associated with the Banach space . The operator norm for a linear operator from space is defined as . For convenience, sometimes we simply use when the meaning is clear from the context. For a finite-dimensional matrix, we use , and to denote its Frobenius norm, operator norm, and

-th singular value respectively. When we say that

is a sequence space with the norm for , we mean that .

### 2.2 Assumptions

We make the following assumptions on and .

1. [label=A0]

2. The functions are -Lipschitz continuous, i.e.,

 |ft(h)−ft(~h)|≤L∥h−~h∥\calH.
3. The functions are differentiable. We use to denote its gradient at . Furthermore, the gradients are bounded, i.e., for all ,

 ∥∇\lx@overaccentset∘ft(x)∥≤L.
4. The functions are convex, i.e., for all ,

 \lx@overaccentset∘f(y)≥\lx@overaccentset∘f(x)+⟨∇\lx@overaccentset∘ft(x),y−x⟩.

Assumptions 12, and 3 are standard in the optimization literature. Given these, we assume the following feedback model for the learner.

1. [label=A0]

2. There exists a gradient oracle such that at the end of each round , the learner receives the gradient .

Assumption 4 is a standard assumption in the literature on first-order optimization methods. Before presenting our final assumption, we introduce the notion of -effective memory capacity that quantifies the maximum influence of past decisions on present losses.

[-Effective Memory Capacity] Consider an online convex optimization with unbounded memory problem specified by the Hilbert space , Banach space , and linear operators and . For , the -effective memory capacity of this problem is defined as

 Hp(\calX,\calH,A,B)=(∞∑k=0kp∥AkB∥p)1p. (2)

When the meaning is clear from the context we use simply use to denote the -effective memory capacity. One way to understand this definition is the following. Consider two sequences of decisions whose elements differ by no more than at time : . Then the histories generated by the the sequences have difference bounded as . A similar bound holds with instead when is a sequence space with the norm. Therefore, the -effective memory capacity is an upper bound on the difference in histories for two sequences of decisions whose difference grows at most linearly in time.

1. [label=A0]

2. The -effective memory capacity is well-defined, i.e., .

When is a sequence space with the norm for , then note that is well-defined under assumption 5. In this case we obtain stronger regret bounds that depend on instead of .

## 3 Regret Analysis

In this section we present a regret analysis for OCO with unbounded memory. The approach consists of writing the regret as

 RT(\calA)=T∑t=1ft(ht)−minx∈\calXT∑t=1\lx@overaccentset∘ft(x) =T∑t=1ft(ht)−\lx@overaccentset∘ft(xt)+T∑t=1\lx@overaccentset∘ft(xt)−minx∈\calXT∑t=1\lx@overaccentset∘ft(x) ≤T∑t=1ft(ht)−\lx@overaccentset∘ft(xt)(a)+maxx∈\calX(T∑t=1⟨∇\lx@overaccentset∘ft(xt),xt−x⟩)(b),

where the last inequality follows from the convexity of  (assumption 3). The second term (b) can be bounded by using the follow-the-regularized-leader (FTRL) algorithm  (Sections 3, 3 and 3), which we present in the context of our framework in Algorithm 1. The main novelty comes from bounding the first term (a) by measuring the difference betwen the histories and in terms of the -effective memory capacity (Section 3).

Let be an -strongly-convex regularizer. Since we assume is a Hilbert space, such an -strongly-convex regularizer exists. The FTRL algorithm  (DBLP:conf/colt/AbernethyHR08; DBLP:conf/colt/Shalev-ShwartzS06; hazan2019oco) chooses as the minimizer of the past losses plus a regularizer. In our framework, this corresponds to

 xt∈\argminx∈\calXt−1∑s=1⟨∇\lx@overaccentset∘fs(xs),x⟩+R(x)η.

For notational convenience, let for , and .

Consider an online convex optimization with unbounded memory problem specified by the decision space , history space , and linear operators and . Let the regularizer be -strongly-convex and satisfy for all Algorithm 1 with step-size satisfies the following regret bound

 RT(FTRL)≤Dη+ηTL2α+ηTL2αH1. (3)

If we set , then the regret bound satisfies

 RT(FTRL)≤3√DL2(1+H1)α√T. (4)

When is a sequence space with the norm for , then the above bounds hold with instead of .

We prove this theorem using a sequence of lemmas that formalize the approach mentioned at the start of this section. The proofs of  Sections 3, 3 and 3 are standard in the literature on FTRL, but we include them in the supplementary material for completeness (Appendix A).

, .

If the regularizer satisfies for all , then Algorithm 1 with step-size satisfies

 T∑t=1gt(xt)−T∑t=1gt(x∗)≤T∑t=1gt(xt)−gt(xt+1)+Dη.

If the regularizer is -strongly-convex and satisfies for all , then Algorithm 1 with step-size satisfies and

 T∑t=1gt(xt)−T∑t=1gt(x∗)≤Dη+ηTL2α.

The following lemma bounds the difference between and in terms of the -effective memory capacity (Section 2.2) using the Lipschitz continuity of  (assumption 1) and the bound on from Section 3.

Consider an online convex optimization with unbounded memory problem specified by the decision space , history space , and linear operators and . If the decisions are generated by Algorithm 1, then

 |ft(ht)−\lx@overaccentset∘ft(xt)|≤ηL2αH1 (5)

for all rounds . When is a sequence space with the norm for , then the above bound holds with instead of .

###### Proof.

By our Lipschitz continuity assumption (assumption 1), we have

 ∣∣ft(ht)−\lx@overaccentset∘ft(xt)∣∣ =∣∣ ∣∣ft(ht)−ft(t−1∑k=0AkBxt)∣∣ ∣∣ ≤L∥∥ ∥∥ht−t−1∑k=0AkBxt∥∥ ∥∥ ≤L∥∥ ∥∥t−1∑k=0AkBxt−k−t−1∑k=0AkBxt∥∥ ∥∥ (6) (7)

where Eq. 6 follows by definition of . In the general case we can bound the term (a) as

 ∥∥ ∥∥t−1∑k=0AkB(xt−xt−k)∥∥ ∥∥ ≤t−1∑k=0∥AkB∥∥xt−xt−k∥ ≤t−1∑k=0∥AkB∥kηLα (8) ≤ηLαH1,

where Eq. 8 follows by Section 3. Combining this with Eq. 7 completes the proof for the general case.

When is a sequence space with the norm for , then we can bound the term (a) as

 ∥∥ ∥∥t−1∑k=0AkB(xt−xt−k)∥∥ ∥∥ =(t−1∑k=0∥Ak(xt−xt−k)∥p)1p ≤(t−1∑k=0∥Ak∥p∥xt−xt−k∥p)1p ≤ηLα(t−1∑k=0kp∥Ak∥p)1p (9) ≤ηLα(t−1∑k=0kp∥AkB∥p)1p (10) ≤ηLαHp,

where Eq. 9 follows by Section 3 and Eq. 10 follows because . Combining this with Eq. 7 completes the proof for the case when is a sequence space with the norm for . ∎

###### Proof of Section 3.

As stated at the start of this section, we can write the regret as

 RT(FTRL) ≤T∑t=1ft(ht)−\lx@overaccentset∘ft(xt)(a)+T∑t=1gt(xt)−minx∈\calXT∑t=1gt(x)(b) ≤ηTL2αH1+Dη+ηTL2α,

where the last inequality follows by bounding term (a) using Section 3 and bounding term (b) using Section 3. When is a sequence space with the norm for , we can bound term (a) using Section 3 by instead. This completes the proof. ∎

## 4 Applications

### 4.1 Special Case: OCO with Finite Memory

We start by showing how OCO with finite memory (anava2015online) is a special case of our framework. 111The only difference is that the summation in our definition of regret starts at whereas it starts at in the OCO with finite memory framework. However, this difference only adds to the regret and is a constant in the OCO with finite memory framework, so we can ignore it. Formally, let be closed and convex with . Let with , where . Define the linear operators as and as and respectively. In words, the history consists of the past decisions and gets updated as . Note that is a sequence space with the norm. Therefore, we can apply Section 3 with . It remains to bound .

Let with and with . By definition, we have , for we have , and for we have . Therefore, . Plugging this into Section 3, we obtain a regret bound that matches existing results (anava2015online).

### 4.2 OCO with Infinite Memory

We now show how our framework can model OCO with infinite memory problems that are not modelled by existing works. Formally, let be closed and convex with . Let consist of satisfying with . Let . Define the linear operators and as and respectively. Note that is a sequence space with the norm. Therefore, we can apply Section 3 with . It remains to bound .

Let with and with . By definition, we have that , and . Therefore,

 H2≤ ⎷∞∑k=1k2ρ2k≤O⎛⎜⎝ρ4(1−ρ2)32⎞⎟⎠.

A useful extension of the above is when elements of are matrices and the linear operator is defined as , where satisfies . An even more general extension of the above is the crux of our simplified regret bounds for online linear control (Section 4.4).

### 4.3 Online Performative Prediction

#### 4.3.1 Background

In many applications of machine learning the algorithm’s decisions influence the data distribution, for example, online labor markets

(DBLP:conf/kdd/Anagnostopoulos18; horton2010online), predictive policing (lum2016predict), on-street parking DBLP:journals/tits/DowlingRZ20; pierce2018sfpark, and vehicle sharing markets (banerjee2015pricing) to name a few. Motivated by such applications, several works have studied the problem of performative prediction, which models the data distribution as a function of the decision-maker’s decision (DBLP:conf/icml/PerdomoZMH20; DBLP:conf/nips/Mendler-DunnerP20; DBLP:conf/icml/MillerPZ21; DBLP:conf/aistats/BrownHK22; DBLP:conf/aaai/RayRDF22). These works view the problem as a stochastic optimization problem; we refer the reader to these citations for more details. As a natural extension to existing works, we introduce an online learning variant of performative prediction with geometric decay (DBLP:conf/aaai/RayRDF22) that differs from the original formulation in a few key ways.

Let denote the decision set with . Let denote the initial data distribution over the instance space . In each round , the learner chooses a decision and an oblivious adversary chooses a loss function , and then the learner suffers the loss , where is the data distribution in round . We also use the shorthand .

The goal in our online learning setting is to minimize the difference between the algorithm’s total loss and the total loss of the best fixed decision,

 T∑t=1\Ez∼pt[lt(xt,z)]−minx∈\calXT∑t=1\Ez∼pt(x)[lt(x,z)],

as a natural generalization of performative optimality (DBLP:conf/icml/PerdomoZMH20) for our online learning formulation.

We make the following assumptions about the distributions and loss functions. The data distribution satisfies , and for all , , where and is a distribution over that depends on the decision . The distributions are assumed to belong to the following location-scale family: iff , where satisfies and

is a random variable with mean

and covariance . The loss functions are assumed to be convex, differentiable, and -Lipschitz continuous.

Our problem formulation differs from existing work in the following ways. First, we adopt an online learning perspective, whereas DBLP:conf/aaai/RayRDF22 adopt a stochastic optimization perspective. Therefore, we assume that the loss functions are adversarially chosen, whereas DBLP:conf/aaai/RayRDF22 assume are fixed. Second, our gradient oracle (assumption 4) assumes that the dynamics ( and ) are known, whereas DBLP:conf/aaai/RayRDF22estimate the gradient using samples from this distribution. We believe that an appropriate extension of our framework that can deal with unknown linear operators and can be applied to this more difficult setting, and we leave this as future work.

Before formulating this problem in our OCO with unbounded memory framework, we state the definition of -Wasserstein distance that we use in our regret analysis. Informally, the

-Wasserstein distance is a measure of the distance between two probability measures.

[1-Wasserstein Distance] Let be a metric space. Let denote the set of Radon probability measures on

with finite first moment. That is, there exists

such that . The -Wasserstein distance between two probability measures is defined as

 W1(ν,ν′)=sup{\Ez∼ν[f(z)]−\Ez∼ν′[f(z)]},

where the supremum is taken over all -Lipschitz continuous functions .

#### 4.3.2 Formulation as OCO with Unbounded Memory

Let be closed and convex with . Let consist of sequences , where and satisfy , with .

Define the linear operators and as and respectively. Note that is a sequence space with the norm.

By definition, given a sequence of decisions , the history at the end of round is given by . Define the functions by , where is the distribution that satisfies iff

 z∼t−1∑k=1(1−ρ)ρk−1(ξ+Axt−k)+ρtp0. (11)

Note that the distribution is a function of the history . The above follows from the recursive definition of and parametric assumption about .

With the above formulation and definition of , the original goal of minimizing the difference between the algorithm’s total loss and the total loss of the best fixed decision is equivalent to minimizing the regret,

 T∑t=1ft(ht)−minx∈\calXT∑t=1\lx@overaccentset∘ft(x).

#### 4.3.3 Regret Analysis

In order to apply our regret bounds, we must bound , the Lipschitz continuity constant (assumption 1), and the norm of the gradients (assumption 2).

We have .

###### Proof.

Note that and . Therefore,

 H2= ⎷∞∑k=0k2∥AkB∥2≤ ⎷∞∑k=0k2ρ2k≤O⎛⎜⎝ρ4(1−ρ2)32⎞⎟⎠.\qed

We have the following bound on the Lipschitz continuity constant: .

###### Proof.

Let and be two sequences of decisions, where . Let and be the corresponding histories, and and be the corresponding distributions at the end of round . We have

 ∣∣ft(ht)−ft(\wtht)∣∣ =∣∣\Ez∼pt[lt(xt,z)]−\Ez∼\wtpt[lt(\wtxt,z)]∣∣ =|\Ez∼pt[lt(xt,z)]−\Ez∼pt[lt(\wtxt,z)] +\Ez∼pt[lt(\wtxt,z)]−\Ez∼\wtpt[lt(\wtxt,z)]| ≤\wtL∥xt−\wtxt∥2+\wtLW1(pt,\wtpt),

where the last inequality follows from the assumptions about the functions and the definition of the Wasserstein distance . By definition of  Eq. 11, we have that

 W1(pt,\wtpt) ≤t−1∑k=11−ρρρk∥A∥2∥xt−k−\wtxt−k∥2 ≤1−ρρ∥A∥2∥ht−\wtht∥2,

where the last inequality follows from the definition of . Therefore, . ∎

We have the following bound on the norm of the gradient: .

###### Proof.

We have (DBLP:conf/aaai/RayRDF22)

 ∇\lx@overaccentset∘ft(x)=\Ez∼pt[∇xlt(x,z)+(1−ρt)A⊤∇zlt(x,z)].

The results follows because the functions are assumed to be -Lipschitz continuous in . ∎

For the online performative prediction problem, we have

 RT(FTRL)≤O⎛⎜⎝R\calX\wtLL′1−ρρ ⎷1+ρ4(1−ρ2)32√T⎞⎟⎠,

where .

###### Proof.

Choosing the as the -strongly convex regularizer, we have that for all ,

 |R(x)−R(x′)|≤R2\calX=D.

Recall the regret bound from Section 3. Using the above bound on and using Sections 4.3.3, 4.3.3 and 4.3.3 to bound the other constants, we obtain the result. ∎

### 4.4 Online Linear Control

#### 4.4.1 Background

Now we use our framework to simplify regret bound derivations for online linear control (agarwal2019online). Existing works model it (and its extensions, such as,  DBLP:conf/nips/AgarwalHS19; DBLP:conf/alt/HazanKS20; DBLP:conf/aaai/LiD021 to name a few) as an OCO with finite memory problem. This necessitates a lot of truncation and error analysis that constitutes the bulk of existing proofs. On the other hand, this problem fits very naturally into our framework because it is inherently an OCO with unbounded memory problem. We refer the reader to existing literature (agarwal2019online) for more details and background on online linear control. Here, we introduce the basic mathematical setup of the problem.

Let denote the state space, denote the control space, and and denote the state and control at time with being the inital state. The system evolves according to the linear dynamics , where are matrices satisfying and is an adversarially chosen “disturbance” with . We assume, without loss of generality and for notational convenience, that and also define .

For , let be convex loss functions chosen by an oblivous adversary. The functions are assumed to satisfy the following Lipschitz condition: if , then . The goal in online linear control is to choose a sequence of controls to minimize the following regret

 RT(Π)=T−1∑t=0ct(st,ut)−minπ∗∈ΠT−1∑t=0ct(sπ∗t,uπ∗t), (12)

where evolves according to linear dynamics as stated above, denotes an arbitrary policy class, and denote the state and control at time when the controls are chosen according to the policy .

Like previous work (agarwal2019online) we focus on the policy class of -strongly stable linear controllers, , where satisfies with and . Given such a controller, the controls are chosen as linear functions of the current state, i.e., . The motivation for considering this policy class comes from the fact that for an infinite horizon problem with quadratic costs, the optimal controller is a fixed linear controller.

Unfortunately, parameterizing directly with a linear controller as