# Online Control with Adversarial Disturbances

We study the control of a linear dynamical system with adversarial disturbances (as opposed to statistical noise). The objective we consider is one of regret: we desire an online control procedure that can do nearly as well as that of a procedure that has full knowledge of the disturbances in hindsight. Our main result is an efficient algorithm that provides nearly tight regret bounds for this problem. From a technical standpoint, this work generalizes upon previous work in two main aspects: our model allows for adversarial noise in the dynamics, and allows for general convex costs.

## Authors

• 21 publications
• 10 publications
• 46 publications
• 48 publications
• 24 publications
• ### Adaptive Regret for Control of Time-Varying Dynamics

We consider regret minimization for online control with time-varying lin...
07/08/2020 ∙ by Paula Gradu, et al. ∙ 0

• ### The Nonstochastic Control Problem

We consider the problem of controlling an unknown linear dynamical syste...
11/27/2019 ∙ by Elad Hazan, et al. ∙ 15

• ### Logarithmic Regret for Adversarial Online Control

We introduce a new algorithm for online linear-quadratic control in a kn...
02/29/2020 ∙ by Dylan J. Foster, et al. ∙ 0

• ### Black-Box Control for Linear Dynamical Systems

We consider the problem of controlling an unknown linear time-invariant ...
07/13/2020 ∙ by Xinyi Chen, et al. ∙ 0

• ### Making Non-Stochastic Control (Almost) as Easy as Stochastic

Recent literature has made much progress in understanding online LQR: a ...
06/10/2020 ∙ by Max Simchowitz, et al. ∙ 0

• ### Information Theoretic Regret Bounds for Online Nonlinear Control

This work studies the problem of sequential control in an unknown, nonli...
06/22/2020 ∙ by Sham Kakade, et al. ∙ 14

• ### Bandit Linear Control

We consider the problem of controlling a known linear dynamical system u...
07/01/2020 ∙ by Asaf Cassel, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

This paper studies the robust control of linear dynamical systems. A linear dynamical system is governed by the dynamics equation

 xt+1=Axt+But+wt, (1.1)

where is the state, is the control and is a disturbance to the system. At every time step , the controller suffers a cost to enforce the control. In this paper, we consider the setting of online control with arbitrary disturbances. Formally, the setting involves, at every time step , an adversary selecting a convex cost function and a disturbance , and the goal of the controller is to generate a sequence of controls such that a sequence of convex costs is minimized.

The above setting generalizes a fundamental problem in control theory (including the Linear Quadratic Regulator) which has been studied over several decades, surveyed below. However, despite the significant research literature on the problem, our generalization and results address several challenges that have remained.

Challenge 1. Perhaps the most important challenge we address is in dealing with arbitrary disturbances in the dynamics. This is a difficult problem, and so standard approaches almost exclusively assume i.i.d. Gaussian noise. Worst-case approaches in the control literature, also known as -control and its variants, are overly pessimistic. Instead, we take an online (adaptive) approach to dealing with adversarial disturbances.

Challenge 2. Another limitation for efficient methods is the classical assumption that the costs are quadratic, as is the case for the linear quadratic regulator. Part of the focus in the literature on the quadratic costs is due to special properties that allow for efficient computation of the best linear controller in hindsight. One of our main goals is to introduce a more general technique that allows for efficient algorithms even when faced with arbitrary convex costs.

##### Our contributions.

In this paper, we tackle both challenges outlined above: coping with adversarial noise, and general loss functions in an online setting. For this we turn to the time-trusted methodology of regret minimization in online learning. In the field of online learning, regret minimization is known to be more robust and general than statistical learning, and a host of convex relaxation techniques are readily available. To define the performance metric, denote for any control algorithm

,

 JT(A)=T∑t=1ct(xt,ut).

The standard comparator in control is a linear controller, which generates a control signal as a linear function of the state, i.e. . Let denote the cost of a linear controller from a certain class . For an algorithm , we define the regret as the sub-optimality of its cost with respect to the best linear controller from a certain set

 Regret=JT(A)−minK∈KJT(K).

Our main result is an efficient algorithm for control which achieves regret in the setting described above. A similar setting has been considered in literature before [9], but our work generalizes previous work in the following ways:

1. Our algorithm achieves regret even in the presence of bounded adversarial disturbances. Previous regret bounds needed to assume that the disturbances

are drawn from a distribution with zero mean and bounded variance.

2. Our regret bounds apply to any sequence of adversarially chosen convex loss functions. Previous efficient algorithms applied to convex quadratic costs only.

Our results above are obtained using a host of techniques from online learning and online convex optimization, notably online learning for loss functions with memory and improper learning using convex relaxation.

## 2 Related Work

##### Online Learning:

Our approach stems from the study of regret minimization in online learning, this paper advocates for worst-case regret as a robust performance metric in the presence of adversarial nosie. A special case of our study is that of regret minimization in stateless (with

) systems, which is a well studied problem in machine learning. See books and surveys on the subject

[8, 15, 20]. Of particular interest to our study is the setting of online learning with memory [4].

##### Learning and Control in Linear Dynamical Systems:

The modern setting for linear dynamical systems arose in the seminal work of Kalman [18]

, who introduced the Kalman filter as a recursive least-squares solution for maximum likelihood estimation (MLE) of Gaussian perturbations to the system in latent-state systems. The framework and filtering algorithm have proven to be a mainstay in control theory and time-series analysis; indeed, the term

Kalman filter model is often used interchangeably with LDS. We refer the reader to the classic survey [19], and the extensive overview of recent literature in [14]

. Most of this literature, as well as most of classical control theory, deals with zero-mean random noise, mostly Normally distributed.

Recently, there has been a renewed interest in learning both fully-observable & latent-state linear dynamical systems. Sample complexity and regret bounds (for Gaussian noise) were obtained in [2, 1]. The fully-observable and convex cases were revisited in [10, 21]. The technique of spectral filtering for learning and controlling non-observable systems was introduced and studied in [16, 6, 17]. Provable control in the Gaussian noise setting was also studied in [13].

##### Robust Control:

The most notable attempts to handle adversarial perturbations in the dynamics are called control [25, 22]. In this setting, the controller solves for the best linear controller assuming worst case noise to come, i.e.

 minK1maxε1:TminK2...minKtmaxεT∑tct(xt,ut),

assuming similar linear dynamics as in equation (1.1). In comparison, we do not solve for the entire noise trajectory in advance, but adjust for it iteratively. Another difference is computational: the above mathematical program may be hard to compute for general cost functions, as compared to our efficient gradient-based algorithm.

##### Non-stochastic MDPs:

The setting we consider, control in systems with linear transition dynamics [7] in presence of adversarial disturbances, can be cast as that of planning in an adversarially changing MDP [5, 11]. The results obtained via this reduction are unsatisfactory because these regret bounds scale with the size of the state space, which is usually exponential in the dimension of the system. In addition, the regret in these scale as . In comparison, [24, 12] solve the online planning problem for MDPs with fixed dynamics and changing costs. The satisfying aspect of their result is that the regret bound does not explicitly depend on the size of the state space, and scales as . However, the dynamics are fixed and without (adversarial) noise.

##### LQR with changing costs:

For the Linear Quadratic Regulator problem, [9] consider changing quadratic costs with stochastic noise to get a regret bound. This work is well aligned with results, and the present paper employs some notions developed therein (eg. strong stability). However, the techniques used in [9] (eg. the SDP formulation for a linear controller) are strongly reliant on the quadratic nature of the cost functions and stochasticity of the disturbances. In particular, even for the offline problem, to the best of our knowledge, there does not exist a SDP formulation to determine the best linear controller for convex losses. In an earlier work, [3] considers a more restricted setting with fixed, deterministic dynamics (hence, noiseless) and changing quadratic costs.

## 3 Problem Setting

### 3.1 Interaction Model

The Linear Dynamical System is a Markov decision process on continuous state and action spaces, with linear transition dynamics. In each round

, the learner outputs an action on observing the state and incurs a cost of , where is convex. The system then transitions to a new state according to

 xt+1=Axt+But+wt.

In the above definition, is the disturbance sequence the system suffers at each time step. In this paper, we make no distributional assumptions on . The sequence is not made known to the learner in advance.

For any algorithm , the cost we attribute to it is

 JT(A)=T∑t=1ct(xt,ut)

where and . With some abuse of notation, we shall use to denote the cost of a linear controller which chooses the action as .

### 3.2 Assumptions

We make the following assumptions throughout the paper. We remark that they are less restrictive, and hence, allow for more general systems than those considered by the previous works. In particular, we allow for adversarial (rather than i.i.d. stochastic) noise, and convex cost functions. Also, the non-stochastic nature of the disturbances permits, without loss of generality, the assumption that .

###### Assumption 3.1.

The matrices that govern the dynamics are bounded, ie., . The perturbation introduced per time step is bounded, ie., .

###### Assumption 3.2.

The costs are convex. Further, as long as it is guaranteed that , it holds that

 |ct(x,u)|≤βD2,∥∇xct(x,u)∥,∥∇uct(x,u)∥≤GD.

Following the definitions in [9], we work on the following class of linear controllers.

###### Definition 3.3.

A linear policy is -strongly stable if there exist matrices satisfying , such that following two conditions are met:

1. The spectral norm of is strictly smaller than unity, ie., .

2. The controller and the transforming matrices are bounded, ie., and .

### 3.3 Regret Formulation

Let . For an algorithm , the regret is the sub-optimality of its cost with respect to a best linear controller.

 Regret=JT(A)−minK∈KJT(K).

### 3.4 Proof Techniques and Overview

##### Choice of Policy Class:

We begin by parameterizing the policy we execute at every step as a linear function of the disturbances in the past in Definition 4.1. Similar parameterization has been considered in the system level synthesis framework (see [23]). This leads to a convex relaxation of the problem. Optimization on alternative paramterizations including an SDP based framework [9] or a direct parametrization [13] have been studied in literature but they seem unable to capture general convex functions as well as adversarial disturbance or lead to a non-convex loss. To avoid a linear dependence on time for the number of parameters in our policy we additionally include a stable linear controller in our policy allowing us to effectively consider only previous perturbations. Lemma 5.2 makes this notion of approximation precise.

##### Reduction to OCO with memory:

The choice of the policy class with an appropriately chosen horizon allows us to reduce the problem to compete with functions with truncated memory. This naturally falls under the class of online convex optimization with memory (see Section 4.5). Theorem 5.3 makes this reduction precise. Finally to bound the regret on truncated functions we use the Online Gradient Descent based approach specified in [4], which requires a bound on Lipschitz constants which we provide in Section 5.3.1. This reduction is inspired from the ideas introduced in [12].

The next section provides the suite of definition and notation required to define our algorithm and regret bounds. Section 5 contains our main algorithm 1 and the main regret guarantees 5.1 followed by the proof and the requisite lemmas and their respective proofs.

## 4 Preliminaries

In this section, we establish some important definitions that will prove useful throughout the paper.

### 4.1 Notation

We reserve the letters for states and for control actions. We denote by , i.e., a bound on the dimensionality of the problem. We reserve capital letters for matrices associated with the system and the policy. Other capital letters are reserved for universal constants in the paper.

### 4.2 A Disturbance-Action Policy Class

We put forth the notion of a disturbance-action controller which chooses the action as a linear map of the past disturbances. Any disturbance-action controller ensures that the state of a system executing such a policy may be expressed as a linear function of the parameters of the policy. This property is convenient in that it permits efficient optimization over the parameters of such a policy. The situation may be contrasted with that of a linear controller. While the action recommended by a linear controller is also linear in past disturbances (a consequence of being linear in the current state), the state sequence produced on the execution of a linear policy is a not a linear function of its parameters.

###### Definition 4.1 (Disturbance-Action Policy).

A disturbance-action policy is specified by parameters and a fixed matrix . At every time , such a policy chooses the recommended action at a state 111 is completely determined given . Hence, the use of only serves to ease presentation., defined as

 ut=−Kxt+H∑i=1Miwt−i.

For notational convenience, here it may be considered that for all .

We refer to the policy played at time as where the subscript refers to the time index and the superscript refers to the action of on . Note that such a policy can be executed because is perfectly determined on the specification of as . It shall be established in later sections that such a policy class can approximate any linear policy with a strongly stable matrix in terms of the total cost suffered.

### 4.3 Evolution of State

In this section, we reason about the evolution of the state of a linear dynamical system under a non-stationary policy composed of policies, where each is specified by . Again, with some abuse of notation, we shall use to denote such a non-stationary policy.

The following definitions serve to ease the burden of notation.

1. Define . shall be helpful in describing the evolution of state starting from a non-zero state in the absence of disturbances.

2. is the state attained by the system upon execution of a non-stationary policy . We drop the arguments and the from the definition of when it is clear from the context. If the same policy is used across all time steps, we compress the notation to . Note that refers to running the linear policy in the standard way.

3. is a transfer matrix that describes the effect of on the state , formally defined below. When the arguments to are clear from the context, we drop the arguments. When is the same across all arguments we suppress the notation to .

###### Definition 4.2.

Define the disturbance-state transfer matrix to be

 ΨKt,i(Mt−H,…,Mt−1)=~AiK1i≤H+H∑j=1~AjKBM[i−j]t−j1i−j∈[1,H].

It will be worthwhile to note that is linear in .

###### Lemma 4.3.

If is chosen as a non-stationary policy recommends, then the state sequence is governed as follows:

 xt+1=t∑i=0Ψt,iwt−i, (4.1)

which can equivalently be written as

 xt+1=~AH+1Kxt−H+2H∑i=0Ψt,iwt−i. (4.2)

### 4.4 Idealized Setting

Note that the counter-factual nature of regret in the control setting implies in the loss at a time step , depends on all the choices made in the past. To efficiently deal with this we propose that our optimization problem only consider the effect of the past steps while planning, forgetting about the state, the system was at time . We will show later that the above scheme tracks the true cost suffered upto a small additional loss. To formally define this idea, we need the following definition on ideal state.

###### Definition 4.4 (Ideal State & Action).

Define an ideal state which is the state the system would have reached if it played the non-stationary policy at all time steps from to , assuming the state at is . Similarly, define to be an idealized action that would have been executed at time if the state observed at time is . Formally,

 yKt+1(Mt−H,…,Mt) =2H∑i=0Ψt,iwt−i, vKt(Mt−H,…,Mt) =−KyKt+H∑i=1M[i]twt−i.

We can now consider the loss of the ideal state and the ideal action.

###### Definition 4.5 (Ideal Cost).

Define the idealized cost function to be the cost associated with the idealized state and idealized action, i.e.,

 ft(Mt−H,…,Mt)=ct(yKt(Mt−H,…,Mt−1),vKt(Mt−H,…,Mt)).

The linearity of in past controllers and the linearity of in its immediate state implies that

is a convex function of a linear transformation of

and hence convex in . This renders it amenable to algorithms for online convex optimization.

In Theorem 5.3 we show that and on a sequence are close by and this reduction allows us to only consider the truncated while planning allowing for efficiency. The precise notion of minimizing regret such truncated was considered in online learning literature [4] before as online convex optimization(OCO) with memory. We present an overview of this framework next.

### 4.5 OCO with Memory

We now present an overview of the online convex optimization (OCO) with memory framework, as established by [4]. In particular, we consider the setting where, for every , an online player chooses some point , a loss function is revealed, and the learner suffers a loss of . We assume a certain coordinate-wise Lipschitz regularity on of the form such that, for any , for any ,

 ∣∣ft(x1,…,xj,…,xH)−ft(x1,…,~xj,…,xH)∣∣≤L∥xj−~xj∥. (4.3)

In addition, we define , and we let

 Gf=supt∈{1,…,T},x∈K∥∇~ft(x)∥, D=supx,y∈K∥x−y∥. (4.4)

The resulting goal is to minimize the policy regret [5], which is defined as

 PolicyRegret=T∑t=Hft(xt−H,…,xt)−minx∈KT∑t=Hft(x,…,x).

As shown by [4], by running a memory-based OGD, we may bound the policy regret by the following theorem.

###### Theorem 4.6.

Let be Lipschitz continuous loss functions with memory such that are convex, and let , , and be as defined in (4.3) and (4.4). Then, Algorithm 2 generates a sequence such that

 T∑t=Hft(xt−H,…,xt)−minx∈KT∑t=Hft(x,…,x)≤D2η+TG2fη+LH2ηGfT.

Furthermore, setting implies that

 \emph{PolicyRegret}≤O(D√Gf(Gf+LH2)T).

## 5 Algorithm & Main Result

Algorithm 1 describes our proposed algorithm for controlling linear dynamical systems with adversarial disturbances which at all times maintains a disturbance-action controller. The algorithm implements the memory based OGD on the loss as described in the previous section. The algorithm requires the specification of a -strongly stable matrix once before the online game. Such a matrix can be obtained offline using an SDP relaxation as described in [9]. The following theorem states the regret bound Algorithm 1 guarantees.

###### Theorem 5.1 (Main Theorem).

Suppose Algorithm 1 is executed with , on an LDS satisfying Assumption 3.1 with control costs satisfying Assumption  3.2. Then, it holds true that

 JT(A)−minK∈KJT(K)≤O(GW2√Tlog(T)),

Furthermore, the algorithm maintains at most parameters can be implemented in time per time step. Here , contain polynomial factors in .

###### Proof of Theorem 5.1.

Note that by the definition of the algorithm we have that all , where

 M={M={M[1]…M[H]}:∥M[i]∥≤κ3κB(1−γ)i}.

Let be defined as

 D≜W(κ2+HκBκ2a)γ(1−κ2(1−γ)H+1)+κBκ3Wγ.

Let be the optimal linear policy in hindsight. By definition is a -strongly stable matrix. Using Lemma 5.2 and Theorem 5.3, we have that

 minM∗∈M(T∑t=0ft(M∗,…,M∗))−T∑t=0ct(xK∗t(0),uK∗t(0)) (5.1) ≤minM∗∈M(T∑t=0ct(xKt(M∗),uKt(M∗)))−T∑t=0ct(xK∗t(0),uK∗t(0))+2TGD2κ3(1−γ)H+1 ≤2TGD(1−γ)H+1(WHκ2Bκ5γ+Dκ3). (5.2)

Let be the sequence of policies played by the algorithm. Note that by definition of the constraint set , we have that

 ∀t∈[T],∀i∈[H]∥M[i]t∥≤κBκ3(1−γ)i.

Using Theorem 5.3 we have that

 T∑t=0ct(xKt,uKt)−T∑t=0ft(Mt−H…Mt)≤2TGD2κ3(1−γ)H+1. (5.3)

Finally using Theorem 4.6 and using Lemmas 5.6, 5.7 to bound the constants and associated with the function and by noting that

 maxM1,M2∈M∥M1−M2∥≤κBκ3√dγ,

we have that

 T∑t=0ft(Mt−H…Mt)−minM∗∈MT∑t=0ft(M∗,…,M∗)≤8GWDd3/2κ2Bκ6H2.5γ−1√T. (5.4)

Summing up (5.1), (5.3) and (5.4), and using the condition that , we get the result.∎

### 5.1 Sufficiency of Disturbance-Action Policies

The class of policies described in Definition 4.1 is powerful enough in its representational capacity to capture any fixed linear policy. Lemma 5.2 establishes this equivalence in terms of the state and action sequence each policy produces.

###### Lemma 5.2 (Sufficiency).

For any two -strongly stable matrices , there exists a policy , with defined as

 M[i]∗=(K∗−K)(A−BK∗)i−1

such that

 T∑t=0(ct(xKt(M∗),uKt(M∗))−ct(xK∗t(0),uK∗t(0)))≤T⋅2GDWHκ2Bκ5(1−γ)H+1γ (5.5)
###### Proof of Lemma 5.2.

By definition we have that

 xt+1(K∗)=t∑i=0~AiKwt−i

Consider the following calculation for with and for any . We have that

 ΨKt,i(M∗) =~AiK+i∑j=1~Ai−jKBM[j] =~AiK+i∑j=1~Ai−jKB(K∗−K)~Aj−1K∗ =~AiK+i∑j=1~Ai−jK(~AK∗−~AK)~Aj−1K∗ =~AiK+i∑j=1(~Ai−jK~AjK∗−~Ai−j+1K~Aj−1K∗) =~AiK∗

The final equality follows as the sum telescopes. Therefore, we have that

 xKt+1(M∗)=H∑i=0~AiK∗wt−i+t∑i=H+1ΨKt,i(M∗)wt−i.

From the above we get that

 ∥xK∗t(0)−xKt(M∗)∥≤Wt∑i=H+1∥ΨKt,i(M∗)∥≤WHκ2Bκ5(1−γ)H+1γ, (5.6)

where the last inequality follows from using Lemma 5.4 and using the fact that .

Further comparing the actions taken by the two policies we get that

 ∥uK∗t−uKt(M∗)∥ =∥∥ ∥∥−K∗xK∗t+KxKt(M∗)−t∑i=0(K∗−K)~AiK∗wt−i∥∥ ∥∥ ≤∥∥ ∥∥t∑i=H+1K(~AiK∗+ΨKt,i(M∗))wt−i∥∥ ∥∥ ≤2WHκ2Bκ5(1−γ)H+1γ.

Using the above, Assumption 3.2 and Lemma 5.5, we get that

 T∑t=0(ct(xKt(M∗),uKt(M∗))−ct(xK∗t,uK∗t))≤T⋅2GDWHκ2Bκ5(1−γ)H+1γ.\qed (5.7)

### 5.2 Approximation Theorems

The following theorem relates the cost of with the actual cost .

###### Theorem 5.3.

For any -strongly stable , any number and any sequence of policies satisfying , if the perturbations are bounded by , we have that

 T∑t=1ft(Mt−H,…Mt)−T∑t=1ct(xKt,uKt)≤2TGD2κ3(1−γ)H+1 (5.8)

where

 D≜W(κ2+HκBκ2a)γ(1−κ2(1−γ)H+1)+aWγ

Before giving the proof of the above theorem, we will need a few lemmas which will be useful.

###### Lemma 5.4.

Let be a -strongly stable matrix, be any number and be a sequence such that for all , we have , then we have that for all

 ∥ΨKt,i∥≤κ2(1−γ)i⋅1i≤H+HκBκ2a(1−γ)i
###### Proof of Lemma 5.4.

The proof follows by noticing that

 ∥ΨKt,i∥ ≤∥~AiK∥1i≤H+H∑j=1∥~AjK∥∥B∥∥M[i−j]t−j∥1i−j∈[1,H] ≤κ2(1−γ)i⋅1i≤H+H∑j=1κBκ2a(1−γ)i ≤κ2(1−γ)i⋅1i≤H+HκBκ2a(1−γ)i,

where the second and the third inequalities follow by using the fact that is a -strongly stable matrix and the conditions on the spectral norm of . ∎

We now derive a bound on the norm of each of the states.

###### Lemma 5.5.

Suppose the system satisfies Assumption 3.1 and let be a sequence such that for all , we have that for a number . Define

 D≜W(κ2+HκBκ2a)γ(1−κ2(1−γ)H+1)+aWγ

Further suppose is a -strongly stable matrix. We have that for all

 max(∥xKt∥,∥yKt(Mt−H−1…Mt−1)∥,∥xt(K∗)∥)≤D
 max(∥uKt∥,∥vKt(Mt−H…Mt)∥)≤D
 ∥xKt−yKt(Mt−H−1…Mt−1)∥≤κ2(1−γ)H+1D
 ∥uKt−vKt(Mt−H…Mt)∥≤κ3(1−γ)H+1D
###### Proof of Lemma 5.5.

Using the definition of we have that

 ∥xKt∥ ≤κ2(1−γ)H+1∥xt−H∥+W⋅(2H∑i=0∥Ψt,i∥) ≤κ2(1−γ)H+1∥xt−H∥+W⋅(κ2+HκBκ2aγ)

The above recurrence can be seen to easily satisfy the following upper bound.

 ∥xKt∥≤W(κ2+HκBκ2a)γ(1−κ2(1−γ)H+1)≤D (5.9)

A similar bound can easily be established for

 ∥yKt(Mt−H−1…Mt−1)∥≤W⋅(κ2+HκBκ2aγ)≤D (5.10)

It is also easy to see via the definitions that

 ∥xKt−yKt(Mt−H−1…Mt−1)∥≤∥~AiK∥∥xt−H∥≤κ2(1−γ)H+1D (5.11)

We can finally bound

 ∥xK∗t(0)∥≤Wκ2γ≤D

For the actions we can use the definitions to bound the actions as follows using (5.9) and (5.10)

 ∥uKt∥≤∥Kxt∥+H∑i=1∥M[i]twt−i∥≤κ∥xKt∥+aWγ≤D
 ∥vKt(Mt−H…Mt)∥≤∥KyKt(Mt−H−1…Mt−1)∥+H∑i=1∥M[i]twt−i∥≤D.

We also have that using (5.11)

 ∥uKt−vKt(Mt−H…M)∥ =K(xKt−yKt(Mt−H−1…Mt−1)) ≤κ3(1−γ)H+1D.\qed

Finally, we prove Theroem 5.3.

###### Proof of Theorem 5.3.

Using the above lemmas we can now bound the approximation error between and using Assumption 3.2

 |ct(xt,ut)−ft(Mt−H…Mt)| =|ct(xt,ut)−ct(yKt(Mt−H−1,…Mt−1),vKt(Mt−H,…Mt))| ≤GD∥xt−yKt(Mt−H−1,…Mt−1)∥+GD∥ut−vKt(Mt−H,…Mt))∥ ≤2GD2κ3(1−γ)H+1.

This finishes the proof of Theorem 5.3. ∎

### 5.3 Bounding the properties of the OCO game with Memory

#### 5.3.1 Bounding the Lipschitz Constant

###### Lemma 5.6.

Consider two policy sequences and which differ in exactly one policy played at a time step for . Then we have that

 |ft(Mt−H…Mt−k…Mt)−ft(Mt−H…~Mt−k…Mt)|≤2GDWκBκ3(1−γ)kH∑i=0(∥M[i]t−k−~M[i]t−k∥).
###### Proof of Lemma 5.6.

For the rest of the proof, we will denote as and as . Similarly define and . It follows immediately from the definitions that

 ∥yKt−~yKt∥ =∥~AkKB2H∑i=0(M[i−k]t−k−