1 Introduction
Online learning models sequential decision-making problems, where a learner must choose a sequence of decisions while interacting with the environment. We refer the reader to cesabianchi2006prediction; shalevshwartz2012online for more background. One of the most popular frameworks in online learning is online convex optimization (OCO) (see, for example, hazan2019oco).
The OCO framework is as follows. In each round, the learner chooses a decision from some convex set and an adversary chooses a convex loss function, and then the learner suffers the loss associated with their current decision. The performance of an algorithm is measured by regret: the difference between the algorithm’s total loss and that of the best fixed decision. This framework has numerous applications, such as prediction from expert advice (littlestone1989weighted; littlestone1994weighted), portfolio selection (cover1991universal), routing (awerbuch2008online), etc. There has been a lot of work in developing a variety of algorithms and proving optimal regret bounds, and we refer the reader to hazan2019oco for a survey. In many applications of the OCO framework the loss of the learner depends not only on the current decisions but on the entire history of decisions until that point. For example, consider the problem of online linear control (agarwal2019online), where in each round the learner chooses a “control policy” (i.e., decision), suffers a loss that is a function of the action taken by the chosen policy and the current state of the system, and the system’s state evolves according to linear dynamics. Note that the current state depends on the entire history of actions and, therefore, the current loss depends not only on the current decision but the entire history of decisions. The OCO framework fails to capture such long-term dependence of the current loss on the past decisions, as do existing generalizations of OCO that allow the loss to depend on a constant number of past decisions (anava2015online). Although a series of approximation arguments can be used to apply the finite memory generalization of OCO to the online linear control problem, there is no OCO framework that captures the complete long-term dependence of current losses on past decisions.
Contributions In this paper we introduce a generalization of the OCO framework, “Online Convex Optimization with Unbounded Memory”, that allows the loss in the current round to depend on the entire history of decisions until that point. We introduce the notion of -effective memory capacity, , that quantifies the maximum influence of past decisions on present losses. We prove a policy regret bound and a stronger policy regret bound under mild additional assumptions about the history (Section 3). Both bounds use the notion of policy regret with respect to constant action sequences, as defined in dekel2012online. Our bounds are optimal in terms of their dependence on the time horizon , and we show that our results match existing ones for the special case of OCO with finite memory (Section 4.1). Finally, we show the broad applicability of our framework to derive regret bounds, and to simplify existing regret bound derivations, for online learning problems including an online variant of performative prediction and online linear control (Section 4.3 and 4.4).
Related work The most closely related work to ours is the OCO with finite memory framework (anava2015online). They consider a generalization of the OCO framework that allows the current loss to depend on a constant number of past decisions. There have been a number of follow-up works that extend the framework in a variety of other ways, such as non-stationarity (zhao2020nonstationary), incorporating switching costs (shi2020online), etc. However, none of the existing works go beyond a constant memory length. In contrast, our framework allows the current loss to depend on an unbounded number of past decisions.
Reinforcement learning (sutton2018reinforcement) is another popular framework for sequential decision-making that considers very general state-action models of feedback and dynamics. In reinforcement learning one typically measures regret with respect to the best state-action policy from some policy class, rather than the best fixed decision as in online learning and OCO. In the special case of linear control, policies can be reformulated as decisions while preserving convexity; we discuss this application in Section 4. Considering the general framework is an active area of research. For example bhatia2020online provide only non-constructive upper bounds on regret.
We defer discussion of related work for specific applications to Section 4.
2 Framework
Let denote the time horizon. Let be a closed and convex subset of a Hilbert space, be a Banach space, and be linear operators, and be loss functions chosen by an oblivious adversary. The game between the learner and the adversary proceeds as follows. Let the initial history be . In each round , the learner chooses , the history is updated according to , and the learner suffers loss .
Consider a strategy that chooses the same action in every round, i.e. for some and for all rounds . The history after round for such a strategy is described by , which motivates the following definition.
Given a function , the function is defined by .
We define the policy regret of a learner as the difference between its total loss and the total loss of a strategy that plays the best fixed action in every round (dekel2012online). [Policy Regret] The policy regret of an algorithm is defined as
(1) |
2.1 Notation
We use to denote the norm associated with the Hilbert space and to denote the norm associated with the Banach space . The operator norm for a linear operator from space is defined as . For convenience, sometimes we simply use when the meaning is clear from the context. For a finite-dimensional matrix, we use , and to denote its Frobenius norm, operator norm, and
-th singular value respectively. When we say that
is a sequence space with the norm for , we mean that .2.2 Assumptions
We make the following assumptions on and .
-
[label=A0]
-
The functions are -Lipschitz continuous, i.e.,
-
The functions are differentiable. We use to denote its gradient at . Furthermore, the gradients are bounded, i.e., for all ,
-
The functions are convex, i.e., for all ,
Assumptions 1, 2, and 3 are standard in the optimization literature. Given these, we assume the following feedback model for the learner.
-
[label=A0]
-
There exists a gradient oracle such that at the end of each round , the learner receives the gradient .
Assumption 4 is a standard assumption in the literature on first-order optimization methods. Before presenting our final assumption, we introduce the notion of -effective memory capacity that quantifies the maximum influence of past decisions on present losses.
[-Effective Memory Capacity] Consider an online convex optimization with unbounded memory problem specified by the Hilbert space , Banach space , and linear operators and . For , the -effective memory capacity of this problem is defined as
(2) |
When the meaning is clear from the context we use simply use to denote the -effective memory capacity. One way to understand this definition is the following. Consider two sequences of decisions whose elements differ by no more than at time : . Then the histories generated by the the sequences have difference bounded as . A similar bound holds with instead when is a sequence space with the norm. Therefore, the -effective memory capacity is an upper bound on the difference in histories for two sequences of decisions whose difference grows at most linearly in time.
-
[label=A0]
-
The -effective memory capacity is well-defined, i.e., .
When is a sequence space with the norm for , then note that is well-defined under assumption 5. In this case we obtain stronger regret bounds that depend on instead of .
3 Regret Analysis
In this section we present a regret analysis for OCO with unbounded memory. The approach consists of writing the regret as
where the last inequality follows from the convexity of (assumption 3). The second term (b) can be bounded by using the follow-the-regularized-leader (FTRL) algorithm (Sections 3, 3 and 3), which we present in the context of our framework in Algorithm 1. The main novelty comes from bounding the first term (a) by measuring the difference betwen the histories and in terms of the -effective memory capacity (Section 3).
Let be an -strongly-convex regularizer. Since we assume is a Hilbert space, such an -strongly-convex regularizer exists. The FTRL algorithm (DBLP:conf/colt/AbernethyHR08; DBLP:conf/colt/Shalev-ShwartzS06; hazan2019oco) chooses as the minimizer of the past losses plus a regularizer. In our framework, this corresponds to
For notational convenience, let for , and .
Consider an online convex optimization with unbounded memory problem specified by the decision space , history space , and linear operators and . Let the regularizer be -strongly-convex and satisfy for all . Algorithm 1 with step-size satisfies the following regret bound
(3) |
If we set , then the regret bound satisfies
(4) |
When is a sequence space with the norm for , then the above bounds hold with instead of .
We prove this theorem using a sequence of lemmas that formalize the approach mentioned at the start of this section. The proofs of Sections 3, 3 and 3 are standard in the literature on FTRL, but we include them in the supplementary material for completeness (Appendix A).
, .
If the regularizer satisfies for all , then Algorithm 1 with step-size satisfies
If the regularizer is -strongly-convex and satisfies for all , then Algorithm 1 with step-size satisfies and
The following lemma bounds the difference between and in terms of the -effective memory capacity (Section 2.2) using the Lipschitz continuity of (assumption 1) and the bound on from Section 3.
Consider an online convex optimization with unbounded memory problem specified by the decision space , history space , and linear operators and . If the decisions are generated by Algorithm 1, then
(5) |
for all rounds . When is a sequence space with the norm for , then the above bound holds with instead of .
Proof.
Proof of Section 3.
4 Applications
4.1 Special Case: OCO with Finite Memory
We start by showing how OCO with finite memory (anava2015online) is a special case of our framework. 111The only difference is that the summation in our definition of regret starts at whereas it starts at in the OCO with finite memory framework. However, this difference only adds to the regret and is a constant in the OCO with finite memory framework, so we can ignore it. Formally, let be closed and convex with . Let with , where . Define the linear operators as and as and respectively. In words, the history consists of the past decisions and gets updated as . Note that is a sequence space with the norm. Therefore, we can apply Section 3 with . It remains to bound .
Let with and with . By definition, we have , for we have , and for we have . Therefore, . Plugging this into Section 3, we obtain a regret bound that matches existing results (anava2015online).
4.2 OCO with Infinite Memory
We now show how our framework can model OCO with infinite memory problems that are not modelled by existing works. Formally, let be closed and convex with . Let consist of satisfying with . Let . Define the linear operators and as and respectively. Note that is a sequence space with the norm. Therefore, we can apply Section 3 with . It remains to bound .
Let with and with . By definition, we have that , and . Therefore,
A useful extension of the above is when elements of are matrices and the linear operator is defined as , where satisfies . An even more general extension of the above is the crux of our simplified regret bounds for online linear control (Section 4.4).
4.3 Online Performative Prediction
4.3.1 Background
In many applications of machine learning the algorithm’s decisions influence the data distribution, for example, online labor markets
(DBLP:conf/kdd/Anagnostopoulos18; horton2010online), predictive policing (lum2016predict), on-street parking DBLP:journals/tits/DowlingRZ20; pierce2018sfpark, and vehicle sharing markets (banerjee2015pricing) to name a few. Motivated by such applications, several works have studied the problem of performative prediction, which models the data distribution as a function of the decision-maker’s decision (DBLP:conf/icml/PerdomoZMH20; DBLP:conf/nips/Mendler-DunnerP20; DBLP:conf/icml/MillerPZ21; DBLP:conf/aistats/BrownHK22; DBLP:conf/aaai/RayRDF22). These works view the problem as a stochastic optimization problem; we refer the reader to these citations for more details. As a natural extension to existing works, we introduce an online learning variant of performative prediction with geometric decay (DBLP:conf/aaai/RayRDF22) that differs from the original formulation in a few key ways.Let denote the decision set with . Let denote the initial data distribution over the instance space . In each round , the learner chooses a decision and an oblivious adversary chooses a loss function , and then the learner suffers the loss , where is the data distribution in round . We also use the shorthand .
The goal in our online learning setting is to minimize the difference between the algorithm’s total loss and the total loss of the best fixed decision,
as a natural generalization of performative optimality (DBLP:conf/icml/PerdomoZMH20) for our online learning formulation.
We make the following assumptions about the distributions and loss functions. The data distribution satisfies , and for all , , where and is a distribution over that depends on the decision . The distributions are assumed to belong to the following location-scale family: iff , where satisfies and
is a random variable with mean
and covariance . The loss functions are assumed to be convex, differentiable, and -Lipschitz continuous.Our problem formulation differs from existing work in the following ways. First, we adopt an online learning perspective, whereas DBLP:conf/aaai/RayRDF22 adopt a stochastic optimization perspective. Therefore, we assume that the loss functions are adversarially chosen, whereas DBLP:conf/aaai/RayRDF22 assume are fixed. Second, our gradient oracle (assumption 4) assumes that the dynamics ( and ) are known, whereas DBLP:conf/aaai/RayRDF22estimate the gradient using samples from this distribution. We believe that an appropriate extension of our framework that can deal with unknown linear operators and can be applied to this more difficult setting, and we leave this as future work.
Before formulating this problem in our OCO with unbounded memory framework, we state the definition of -Wasserstein distance that we use in our regret analysis. Informally, the
-Wasserstein distance is a measure of the distance between two probability measures.
[1-Wasserstein Distance] Let be a metric space. Let denote the set of Radon probability measures on
with finite first moment. That is, there exists
such that . The -Wasserstein distance between two probability measures is defined aswhere the supremum is taken over all -Lipschitz continuous functions .
4.3.2 Formulation as OCO with Unbounded Memory
Let be closed and convex with . Let consist of sequences , where and satisfy , with .
Define the linear operators and as and respectively. Note that is a sequence space with the norm.
By definition, given a sequence of decisions , the history at the end of round is given by . Define the functions by , where is the distribution that satisfies iff
(11) |
Note that the distribution is a function of the history . The above follows from the recursive definition of and parametric assumption about .
With the above formulation and definition of , the original goal of minimizing the difference between the algorithm’s total loss and the total loss of the best fixed decision is equivalent to minimizing the regret,
4.3.3 Regret Analysis
In order to apply our regret bounds, we must bound , the Lipschitz continuity constant (assumption 1), and the norm of the gradients (assumption 2).
We have .
Proof.
Note that and . Therefore,
We have the following bound on the Lipschitz continuity constant: .
Proof.
Let and be two sequences of decisions, where . Let and be the corresponding histories, and and be the corresponding distributions at the end of round . We have
where the last inequality follows from the assumptions about the functions and the definition of the Wasserstein distance . By definition of Eq. 11, we have that
where the last inequality follows from the definition of . Therefore, . ∎
We have the following bound on the norm of the gradient: .
Proof.
We have (DBLP:conf/aaai/RayRDF22)
The results follows because the functions are assumed to be -Lipschitz continuous in . ∎
For the online performative prediction problem, we have
where .
Proof.
Choosing the as the -strongly convex regularizer, we have that for all ,
Recall the regret bound from Section 3. Using the above bound on and using Sections 4.3.3, 4.3.3 and 4.3.3 to bound the other constants, we obtain the result. ∎
4.4 Online Linear Control
4.4.1 Background
Now we use our framework to simplify regret bound derivations for online linear control (agarwal2019online). Existing works model it (and its extensions, such as, DBLP:conf/nips/AgarwalHS19; DBLP:conf/alt/HazanKS20; DBLP:conf/aaai/LiD021 to name a few) as an OCO with finite memory problem. This necessitates a lot of truncation and error analysis that constitutes the bulk of existing proofs. On the other hand, this problem fits very naturally into our framework because it is inherently an OCO with unbounded memory problem. We refer the reader to existing literature (agarwal2019online) for more details and background on online linear control. Here, we introduce the basic mathematical setup of the problem.
Let denote the state space, denote the control space, and and denote the state and control at time with being the inital state. The system evolves according to the linear dynamics , where are matrices satisfying and is an adversarially chosen “disturbance” with . We assume, without loss of generality and for notational convenience, that and also define .
For , let be convex loss functions chosen by an oblivous adversary. The functions are assumed to satisfy the following Lipschitz condition: if , then . The goal in online linear control is to choose a sequence of controls to minimize the following regret
(12) |
where evolves according to linear dynamics as stated above, denotes an arbitrary policy class, and denote the state and control at time when the controls are chosen according to the policy .
Like previous work (agarwal2019online) we focus on the policy class of -strongly stable linear controllers, , where satisfies with and . Given such a controller, the controls are chosen as linear functions of the current state, i.e., . The motivation for considering this policy class comes from the fact that for an infinite horizon problem with quadratic costs, the optimal controller is a fixed linear controller.
Unfortunately, parameterizing directly with a linear controller as