1 Introduction
This paper studies the robust control of linear dynamical systems. A linear dynamical system is governed by the dynamics equation
(1.1) 
where is the state, is the control and is a disturbance to the system. At every time step , the controller suffers a cost to enforce the control. In this paper, we consider the setting of online control with arbitrary disturbances. Formally, the setting involves, at every time step , an adversary selecting a convex cost function and a disturbance , and the goal of the controller is to generate a sequence of controls such that a sequence of convex costs is minimized.
The above setting generalizes a fundamental problem in control theory (including the Linear Quadratic Regulator) which has been studied over several decades, surveyed below. However, despite the significant research literature on the problem, our generalization and results address several challenges that have remained.
Challenge 1. Perhaps the most important challenge we address is in dealing with arbitrary disturbances in the dynamics. This is a difficult problem, and so standard approaches almost exclusively assume i.i.d. Gaussian noise. Worstcase approaches in the control literature, also known as control and its variants, are overly pessimistic. Instead, we take an online (adaptive) approach to dealing with adversarial disturbances.
Challenge 2. Another limitation for efficient methods is the classical assumption that the costs are quadratic, as is the case for the linear quadratic regulator. Part of the focus in the literature on the quadratic costs is due to special properties that allow for efficient computation of the best linear controller in hindsight. One of our main goals is to introduce a more general technique that allows for efficient algorithms even when faced with arbitrary convex costs.
Our contributions.
In this paper, we tackle both challenges outlined above: coping with adversarial noise, and general loss functions in an online setting. For this we turn to the timetrusted methodology of regret minimization in online learning. In the field of online learning, regret minimization is known to be more robust and general than statistical learning, and a host of convex relaxation techniques are readily available. To define the performance metric, denote for any control algorithm
,The standard comparator in control is a linear controller, which generates a control signal as a linear function of the state, i.e. . Let denote the cost of a linear controller from a certain class . For an algorithm , we define the regret as the suboptimality of its cost with respect to the best linear controller from a certain set
Our main result is an efficient algorithm for control which achieves regret in the setting described above. A similar setting has been considered in literature before [9], but our work generalizes previous work in the following ways:

Our algorithm achieves regret even in the presence of bounded adversarial disturbances. Previous regret bounds needed to assume that the disturbances
are drawn from a distribution with zero mean and bounded variance.

Our regret bounds apply to any sequence of adversarially chosen convex loss functions. Previous efficient algorithms applied to convex quadratic costs only.
Our results above are obtained using a host of techniques from online learning and online convex optimization, notably online learning for loss functions with memory and improper learning using convex relaxation.
2 Related Work
Online Learning:
Our approach stems from the study of regret minimization in online learning, this paper advocates for worstcase regret as a robust performance metric in the presence of adversarial nosie. A special case of our study is that of regret minimization in stateless (with
) systems, which is a well studied problem in machine learning. See books and surveys on the subject
[8, 15, 20]. Of particular interest to our study is the setting of online learning with memory [4].Learning and Control in Linear Dynamical Systems:
The modern setting for linear dynamical systems arose in the seminal work of Kalman [18]
, who introduced the Kalman filter as a recursive leastsquares solution for maximum likelihood estimation (MLE) of Gaussian perturbations to the system in latentstate systems. The framework and filtering algorithm have proven to be a mainstay in control theory and timeseries analysis; indeed, the term
Kalman filter model is often used interchangeably with LDS. We refer the reader to the classic survey [19], and the extensive overview of recent literature in [14]. Most of this literature, as well as most of classical control theory, deals with zeromean random noise, mostly Normally distributed.
Recently, there has been a renewed interest in learning both fullyobservable & latentstate linear dynamical systems. Sample complexity and regret bounds (for Gaussian noise) were obtained in [2, 1]. The fullyobservable and convex cases were revisited in [10, 21]. The technique of spectral filtering for learning and controlling nonobservable systems was introduced and studied in [16, 6, 17]. Provable control in the Gaussian noise setting was also studied in [13].
Robust Control:
The most notable attempts to handle adversarial perturbations in the dynamics are called control [25, 22]. In this setting, the controller solves for the best linear controller assuming worst case noise to come, i.e.
assuming similar linear dynamics as in equation (1.1). In comparison, we do not solve for the entire noise trajectory in advance, but adjust for it iteratively. Another difference is computational: the above mathematical program may be hard to compute for general cost functions, as compared to our efficient gradientbased algorithm.
Nonstochastic MDPs:
The setting we consider, control in systems with linear transition dynamics [7] in presence of adversarial disturbances, can be cast as that of planning in an adversarially changing MDP [5, 11]. The results obtained via this reduction are unsatisfactory because these regret bounds scale with the size of the state space, which is usually exponential in the dimension of the system. In addition, the regret in these scale as . In comparison, [24, 12] solve the online planning problem for MDPs with fixed dynamics and changing costs. The satisfying aspect of their result is that the regret bound does not explicitly depend on the size of the state space, and scales as . However, the dynamics are fixed and without (adversarial) noise.
LQR with changing costs:
For the Linear Quadratic Regulator problem, [9] consider changing quadratic costs with stochastic noise to get a regret bound. This work is well aligned with results, and the present paper employs some notions developed therein (eg. strong stability). However, the techniques used in [9] (eg. the SDP formulation for a linear controller) are strongly reliant on the quadratic nature of the cost functions and stochasticity of the disturbances. In particular, even for the offline problem, to the best of our knowledge, there does not exist a SDP formulation to determine the best linear controller for convex losses. In an earlier work, [3] considers a more restricted setting with fixed, deterministic dynamics (hence, noiseless) and changing quadratic costs.
3 Problem Setting
3.1 Interaction Model
The Linear Dynamical System is a Markov decision process on continuous state and action spaces, with linear transition dynamics. In each round
, the learner outputs an action on observing the state and incurs a cost of , where is convex. The system then transitions to a new state according toIn the above definition, is the disturbance sequence the system suffers at each time step. In this paper, we make no distributional assumptions on . The sequence is not made known to the learner in advance.
For any algorithm , the cost we attribute to it is
where and . With some abuse of notation, we shall use to denote the cost of a linear controller which chooses the action as .
3.2 Assumptions
We make the following assumptions throughout the paper. We remark that they are less restrictive, and hence, allow for more general systems than those considered by the previous works. In particular, we allow for adversarial (rather than i.i.d. stochastic) noise, and convex cost functions. Also, the nonstochastic nature of the disturbances permits, without loss of generality, the assumption that .
Assumption 3.1.
The matrices that govern the dynamics are bounded, ie., . The perturbation introduced per time step is bounded, ie., .
Assumption 3.2.
The costs are convex. Further, as long as it is guaranteed that , it holds that
Following the definitions in [9], we work on the following class of linear controllers.
Definition 3.3.
A linear policy is strongly stable if there exist matrices satisfying , such that following two conditions are met:

The spectral norm of is strictly smaller than unity, ie., .

The controller and the transforming matrices are bounded, ie., and .
3.3 Regret Formulation
Let . For an algorithm , the regret is the suboptimality of its cost with respect to a best linear controller.
3.4 Proof Techniques and Overview
Choice of Policy Class:
We begin by parameterizing the policy we execute at every step as a linear function of the disturbances in the past in Definition 4.1. Similar parameterization has been considered in the system level synthesis framework (see [23]). This leads to a convex relaxation of the problem. Optimization on alternative paramterizations including an SDP based framework [9] or a direct parametrization [13] have been studied in literature but they seem unable to capture general convex functions as well as adversarial disturbance or lead to a nonconvex loss. To avoid a linear dependence on time for the number of parameters in our policy we additionally include a stable linear controller in our policy allowing us to effectively consider only previous perturbations. Lemma 5.2 makes this notion of approximation precise.
Reduction to OCO with memory:
The choice of the policy class with an appropriately chosen horizon allows us to reduce the problem to compete with functions with truncated memory. This naturally falls under the class of online convex optimization with memory (see Section 4.5). Theorem 5.3 makes this reduction precise. Finally to bound the regret on truncated functions we use the Online Gradient Descent based approach specified in [4], which requires a bound on Lipschitz constants which we provide in Section 5.3.1. This reduction is inspired from the ideas introduced in [12].
3.5 Roadmap
4 Preliminaries
In this section, we establish some important definitions that will prove useful throughout the paper.
4.1 Notation
We reserve the letters for states and for control actions. We denote by , i.e., a bound on the dimensionality of the problem. We reserve capital letters for matrices associated with the system and the policy. Other capital letters are reserved for universal constants in the paper.
4.2 A DisturbanceAction Policy Class
We put forth the notion of a disturbanceaction controller which chooses the action as a linear map of the past disturbances. Any disturbanceaction controller ensures that the state of a system executing such a policy may be expressed as a linear function of the parameters of the policy. This property is convenient in that it permits efficient optimization over the parameters of such a policy. The situation may be contrasted with that of a linear controller. While the action recommended by a linear controller is also linear in past disturbances (a consequence of being linear in the current state), the state sequence produced on the execution of a linear policy is a not a linear function of its parameters.
Definition 4.1 (DisturbanceAction Policy).
A disturbanceaction policy is specified by parameters and a fixed matrix . At every time , such a policy chooses the recommended action at a state ^{1}^{1}1 is completely determined given . Hence, the use of only serves to ease presentation., defined as
For notational convenience, here it may be considered that for all .
We refer to the policy played at time as where the subscript refers to the time index and the superscript refers to the action of on . Note that such a policy can be executed because is perfectly determined on the specification of as . It shall be established in later sections that such a policy class can approximate any linear policy with a strongly stable matrix in terms of the total cost suffered.
4.3 Evolution of State
In this section, we reason about the evolution of the state of a linear dynamical system under a nonstationary policy composed of policies, where each is specified by . Again, with some abuse of notation, we shall use to denote such a nonstationary policy.
The following definitions serve to ease the burden of notation.

Define . shall be helpful in describing the evolution of state starting from a nonzero state in the absence of disturbances.

is the state attained by the system upon execution of a nonstationary policy . We drop the arguments and the from the definition of when it is clear from the context. If the same policy is used across all time steps, we compress the notation to . Note that refers to running the linear policy in the standard way.

is a transfer matrix that describes the effect of on the state , formally defined below. When the arguments to are clear from the context, we drop the arguments. When is the same across all arguments we suppress the notation to .
Definition 4.2.
Define the disturbancestate transfer matrix to be
It will be worthwhile to note that is linear in .
Lemma 4.3.
If is chosen as a nonstationary policy recommends, then the state sequence is governed as follows:
(4.1) 
which can equivalently be written as
(4.2) 
4.4 Idealized Setting
Note that the counterfactual nature of regret in the control setting implies in the loss at a time step , depends on all the choices made in the past. To efficiently deal with this we propose that our optimization problem only consider the effect of the past steps while planning, forgetting about the state, the system was at time . We will show later that the above scheme tracks the true cost suffered upto a small additional loss. To formally define this idea, we need the following definition on ideal state.
Definition 4.4 (Ideal State & Action).
Define an ideal state which is the state the system would have reached if it played the nonstationary policy at all time steps from to , assuming the state at is . Similarly, define to be an idealized action that would have been executed at time if the state observed at time is . Formally,
We can now consider the loss of the ideal state and the ideal action.
Definition 4.5 (Ideal Cost).
Define the idealized cost function to be the cost associated with the idealized state and idealized action, i.e.,
The linearity of in past controllers and the linearity of in its immediate state implies that
is a convex function of a linear transformation of
and hence convex in . This renders it amenable to algorithms for online convex optimization.In Theorem 5.3 we show that and on a sequence are close by and this reduction allows us to only consider the truncated while planning allowing for efficiency. The precise notion of minimizing regret such truncated was considered in online learning literature [4] before as online convex optimization(OCO) with memory. We present an overview of this framework next.
4.5 OCO with Memory
We now present an overview of the online convex optimization (OCO) with memory framework, as established by [4]. In particular, we consider the setting where, for every , an online player chooses some point , a loss function is revealed, and the learner suffers a loss of . We assume a certain coordinatewise Lipschitz regularity on of the form such that, for any , for any ,
(4.3) 
In addition, we define , and we let
(4.4) 
The resulting goal is to minimize the policy regret [5], which is defined as
As shown by [4], by running a memorybased OGD, we may bound the policy regret by the following theorem.
5 Algorithm & Main Result
Algorithm 1 describes our proposed algorithm for controlling linear dynamical systems with adversarial disturbances which at all times maintains a disturbanceaction controller. The algorithm implements the memory based OGD on the loss as described in the previous section. The algorithm requires the specification of a strongly stable matrix once before the online game. Such a matrix can be obtained offline using an SDP relaxation as described in [9]. The following theorem states the regret bound Algorithm 1 guarantees.
Theorem 5.1 (Main Theorem).
Proof of Theorem 5.1.
Note that by the definition of the algorithm we have that all , where
Let be defined as
Let be the optimal linear policy in hindsight. By definition is a strongly stable matrix. Using Lemma 5.2 and Theorem 5.3, we have that
(5.1)  
(5.2) 
Let be the sequence of policies played by the algorithm. Note that by definition of the constraint set , we have that
Using Theorem 5.3 we have that
(5.3) 
5.1 Sufficiency of DisturbanceAction Policies
The class of policies described in Definition 4.1 is powerful enough in its representational capacity to capture any fixed linear policy. Lemma 5.2 establishes this equivalence in terms of the state and action sequence each policy produces.
Lemma 5.2 (Sufficiency).
For any two strongly stable matrices , there exists a policy , with defined as
such that
(5.5) 
Proof of Lemma 5.2.
By definition we have that
Consider the following calculation for with and for any . We have that
The final equality follows as the sum telescopes. Therefore, we have that
From the above we get that
(5.6) 
where the last inequality follows from using Lemma 5.4 and using the fact that .
5.2 Approximation Theorems
The following theorem relates the cost of with the actual cost .
Theorem 5.3.
For any strongly stable , any number and any sequence of policies satisfying , if the perturbations are bounded by , we have that
(5.8) 
where
Before giving the proof of the above theorem, we will need a few lemmas which will be useful.
Lemma 5.4.
Let be a strongly stable matrix, be any number and be a sequence such that for all , we have , then we have that for all
Proof of Lemma 5.4.
The proof follows by noticing that
where the second and the third inequalities follow by using the fact that is a strongly stable matrix and the conditions on the spectral norm of . ∎
We now derive a bound on the norm of each of the states.
Lemma 5.5.
Suppose the system satisfies Assumption 3.1 and let be a sequence such that for all , we have that for a number . Define
Further suppose is a strongly stable matrix. We have that for all
Proof of Lemma 5.5.
Using the definition of we have that
The above recurrence can be seen to easily satisfy the following upper bound.
(5.9) 
A similar bound can easily be established for
(5.10) 
It is also easy to see via the definitions that
(5.11) 
Finally, we prove Theroem 5.3.
5.3 Bounding the properties of the OCO game with Memory
5.3.1 Bounding the Lipschitz Constant
Lemma 5.6.
Consider two policy sequences and which differ in exactly one policy played at a time step for . Then we have that
Proof of Lemma 5.6.
For the rest of the proof, we will denote as and as . Similarly define and . It follows immediately from the definitions that