1 Introduction
Classical control theory assumes that nature evolves according to wellspecified dynamics that is perturbed by i.i.d. noise. While this approximation has proven very useful for controlling some real world systems, it does not allow for construction of truly robust controllers. The focus of this paper is the construction of truly robust controllers even when the underlying system is unknown.
Specifically, we consider the case in which the underlying system is linear, but has potentially adversarial perturbations (that can model deviations from linearity), i.e.
(1) 
where is the (observed) system state, is a learnerchosen control and is an adversarial disturbance. The goal of the controller is to minimize a sum of sequentially revealed adversarial cost functions .
The goal in this gametheoretic setting is to minimize policy regret, or the regret compared to the best controller, from a class , that is made aware of the system dynamics, the cost sequence, and all the disturbances ahead of time:
It may be noted that the cost of the benchmark is measured on the counterfactual stateaction sequence that the policy in consideration visits, as opposed to the statesequence visited by the the learner. In contrast to the worstcase optimality of robust control, achieving low regret demands stronger promises of instancewise (near) optimality on every perturbation sequence.
Nonstochastic Control: Without knowledge of adversarial system matrices and perturbations , iteratively generate controls to minimize regret over sequentially revealed adversarial convex costs.
The above constitutes a powerful (nonstochastic) adversarial generalization of stochastic control.
Our starting point is the recent work of [ABH+19a], where the authors proposed a novel class of policies, namely choosing actions as a linear combination of past disturbances, . They demonstrate that learning the coefficients , via online convex optimization, allows their controller to compete with the class of all linear policies in the state. This latter class is important, since it is known to be optimal for the standard setting of normal i.i.d noise and quadratic loss functions, also known as the Linear Quadratic Regulator (LQR), and associated robust control settings (see [BB08] for examples).
The caveat in [ABH+19a] is that the system matrices need to be known. In the case of a known system, the disturbances can be simply computed via observations of the state, ie. . However, if the system is unknown, it is not clear how to generalize their approach. Fundamentally, the important component that is difficulty in identifying the system, or the matrices , from the observations. This is nontrivial since the noise is assumed to be adversarial, and was posed as a question in [TU19].
In this paper we show how to overcome this difficulty and obtain sublinear regret for controlling an unknown system in the presence of adversarial noise and adversarial loss functions. The regret notion we adopt is policy regret against linear policies, exactly as in [ABH+19a]. The crucial component we introduce is adversarial sysid: an efficient method for uncovering the underlying system even in the presence of adversarial perturbations. This method is not based on naive least squares method of regressing on . In particular, without independent, zeromean
’s, the latter approach can produce inconsistent estimates of the system matrices.
Informally, the main result is:
Theorem 1 (Informal Statement).
For an unknown linear dynamical system as (1), where are chosen by an oblivious adversary to be bounded in the range , there exists an efficient algorithm that generates an adaptive sequence of controls for which
1.1 Related Work
There has been a resurgence of literature on control of linear dynamical systems in the recent machine learning venues. The case of known systems was extensively studied in the control literature, see the survey
[STE94]. Sample complexity and regret bounds for control (under Gaussian noise) were obtained in [AS11, DMM+18, ALS19, MTR19, CKM19]. The works of [ABK14], [CHK+18] and [AHS19b] allow for control of LDS with adversarial loss functions. Provable control in the Gaussian noise setting via the policy gradient method was studied in [FGK+18]. These works operate in the absence of perturbations or assume that the same are i.i.d., as opposed to our adversarial.The most relevant reformulation of the control problem that enables our result is the recent work of [ABH+19a], who use online learning techniques and convex relaxation to obtain provable bounds for LDS with adversarial perturbations. However, the result and the algorithm make extensive use of the availability of the system matrices.
Recently [SBR19] showed how to use least squares to learn an underlying Markov operator in lieu of the system and in the presence of noise. It is possible that their recovery technique can also be used to generate perturbation estimates, and then apply the techniques of [ABH+19a] for control. However, the conditions on the system they assume are even more general than ours, and it is not clear if they are sufficient for control. For system identification in the stochastic noise setting, [OO19] prove sample complexity bounds for the HoKalman algorithm [HK66]. A stronger result that holds under partial observability was shown in [SRD19]. While these results apply to stochastic noise, parameter recovery in the setting of adversarial noise was recently studied in the contextual bandits literature [KWS18]. Other relevant work from the machine learning literature includes the technique of spectral filtering for learning and openloop control of partially observable systems [HSZ17, AHL+18, HLS+18].
2 Problem Setting
Linear Dynamical Systems
We consider the setting of linear dynamical systems with timeinvariant dynamics, i.e.
where and . The perturbation sequence may be adversarially chosen at the beginning of the interaction, and is unknown to the learner. Likewise, the system is augmented with timevarying convex cost functions . The total cost associated with a sequence of (random) controls, derived through an algorithm , is
With some abuse of notation, we will denote by the cost associated with the execution of controls as a linear controller would suggest, ie. . The following conditions are assumed on the cost and the perturbations.
Assumption 2.
The perturbation sequence is bounded, ie. , and chosen at the start of the interaction, implying that this sequence does not depend on the choice of .
Assumption 3.
As long as , the convex costs admit .
The fundamental Linear Quadratic Regulator problem is a specialization of the above to the case when the perturbations are i.i.d. Gaussian and the cost functions are positive quadratics, ie.
Objective
We consider the setting where the learner has no knowledge of and the perturbation sequence . In this case, any inference of these quantities may only take place indirectly through the observation of the state . Furthermore, the learner is made aware of the cost function only once the choice of has been made.
Under such constraints, the objective of the algorithm is to choose an (adaptive) sequence of controls that ensure that the cost suffered in this manner is comparable to that of the best choice of a linear controller with complete knowledge of system dynamics and the foreknowledge of the cost and perturbation sequences . Formally, we measure regret as
is the set of strongly stable linear controllers defined below. The notion of strong stability, introduced in [CHK+18], offers a quantification of the classical notion of a stable controller in manner that permits a discussion on nonasymptotic regret bounds.
Definition 4 (Strong Stability).
A linear controller is strongly stable for a linear dynamical system specified via if there exists a decomposition of with , and .
We also assume the learner has access to a fixed stabilizing controller for the transition matrices . When operating under unknown transition matrices, the knowledge of a stabilizing controller permits the learner to prevent an inflation of the size of the state beyond reasonable bounds.
Assumption 5.
The learner knows a linear controller that is strongly stable for the true, but unknown, transition matrices .
The nontriviality of the regret guarantee rests on the benchmark set not being empty. As noted in [CHK+18], a sufficient condition to ensure the existence of a strongly stable controller is the controllability of the linear system . Informally, controllability for a linear system is characterized by the ability to drive the system to any desired state through appropriate control inputs in the presence of deterministic dynamics, ie. .
Definition 6 (Strong Controllability).
For a linear dynamical system , define, for , a matrix as
A linear dynamical system is controllable with controllability index if has full rowrank. In addition, such a system is also strongly controllable if .
As with stability, a quantitative analog of controllability first suggested in [CHK+18] is presented above. It is useful to note that, as a consequence of the CayleyHamiltion theorem, for a controllable system the controllability index is always at most the dimension of the state space. We adopt the assumption that the system is strongly controllable.
Assumption 7.
The linear dynamical system is strongly controllable.
3 Preliminaries
This section sets up the concepts that aid the algorithmic description and the analysis.
3.1 Parameterization of the Controller
The total cost objective of a linear controller is nonconvex in the canonical parameterization [FGK+18], ie. is not convex in . To remedy this, we use an alternative perturbationbased parameterization for controller, recently proposed in [ABH+19a], where the advised control is linear in the past perturbations (as opposed to the state). This permits that the offline search for an optimal controller may be posed as a convex program.
Definition 8.
A perturbationbased policy chooses the recommended action at a state as
3.2 State Evolution
Under the execution of a stationary policy
, the state may be expressed as a linear transformation of the perturbations, where the linear transformation itself is linear in the parameterizing matrices. We set up this notation below.
Definition 9.
For a matrix pair , define the stateperturbation transfer matrix:
In [ABH+19a] the authors note that, under the linear dynamical system with perturbations , the state produced by this policy evolves as specified below. Of particular importance in this context is the observation that ’s are linear in ’s.
Following [ABH+19a], we adopt the definition of the surrogate setting.
Definition 10.
Define the surrogate state and the surrogate action as stated below. The surrogate cost as chosen to be the specialization of the th cost function with the surrogate stateaction pair as the argument.
4 The Algorithm
Our approach follows the explorethencommit paradigm, identifying the underlying the deterministicequivalent dynamics to within a specified accuracy using random inputs in the exploration phase. Such an approximate recovery of parameters permits an approximate recovery of the perturbations, thus facilitating the execution of the perturbationbased controller on the approximated perturbations.
Theorem 11.
Under the assumptions listed in Section 2, the regret incurred by Algorithm 1 admits the upper bound ^{1}^{1}1The unnatural scaling in occurs because in the analysis, we (sometimes) take . Also, the magnitude of is not tuned to produce the optimal dependence in parameters other than . stated below. In particular, this is the case when , , .
5 Regret Analysis
To present the proof concisely, we set up a few articles of use. For a generic algorithm operating on a generic linear dynamical system specified via a matrix pair and perturbations , let

be the cost of executing , as incurred on the last time steps,

be the state achieved at time step , and

be the control executed at time step .
We also note that following result from [ABH+19a] that applies to the case when the matrices that govern the underlying dynamics are made known to the algorithm. In such a case, an exact inference of is possible.
Theorem 12 ([ABH+19a]).
Proof of Theorem 11.
Define . Let be the contribution to the regret associated with the first rounds of exploration. By Lemma 19, we have that
Let , in the arguments below.
From this point on, let be the algorithm, from [ABH+19a], executed in Phase 2. By the Simulation Lemma (Lemma 13), we have
Regret  
The middle term in the regret can be upper bounded by the regret of algorithm on the fictional system and the perturbation sequence . Before we can invoke Theorem 12, observe that

By the Preservation of Stability Lemma (Lemma 14), is strongly stable on , as long as ,

Lemma 17 ensures that the iterates produced by the execution of Algorithm satisfy , as long as .
With the above observations in place, Theorem 12 guarantees
The last expression in the preceding line can be bound as Stability of Value Function Lemma (Lemma 15) indicates, while constraining our choices as , and , and observing that
The last line follows by Theorem 18, and suggest the optimal (by this analysis) choice of . Apart from this proof, all other proofs and statements list exact upper bounds on the polynomial factors. Here they have been omitted for ease of presentation. ∎
The regret minimizing algorithm, Phase 2 of Algorithm 1, chooses so as to optimize for the cost of the perturbationbased controller in a fictional linear dynamical system described via the matrix pair and the perturbation sequence . The following lemma shows that the choice of ensures that statecontrol sequence visited by Algorithm 1 coincides with the sequence visted by the regretminimizing algorithm in the fictional system.
Lemma 13 (Simulation Lemma).
Proof.
This proof follows by induction on . Note that at the start of , it is fed the initial state by the choice of . Say that for some , it happens that the inductive hypothesis is true. Consequently
This, in turn, implies by choice of that
The claim follows. ∎
The lemma stated below guarantees that the strong stability of is approximately preserved under small deviations of the system matrices.
Lemma 14 (Preservation of Stability).
If a linear controller is strongly stable for a linear dynamical system , ie.
then is strongly stable for a linear dynamical system , ie.
as long as . Furthermore, when in agreement with the said conditions, the transforming matrices that certify strong stability in both these cases coincide, and the transformed matrices obey .
Proof.
Let with , . Now
It suffices to note that . ∎
The next lemma establishes that if the same linear statefeedback policy is executed on the actual and the fictional linear dynamical system, the difference between the costs incurred in the two scenarios varies proportionally with some measure of distance between the two systems.
Lemma 15 (Stability of Value Function).
Let , . As long as it happens that and , it is true that
for any strongly stable controller with respect to .
Proof.
Under the action of a linear controller , which is strongly stable for , it may be verified that
Now, if , note that . By the Preservation of Stability Lemma (Lemma 14), is strongly stable for . Consequently, .
Finally, note
The last inequality follows from the addendum attached to Lemma 14. Lastly, observe, as a consequence of Lemma 16 and Lemma 14, that
∎
Lemma 16.
For any matrix pair , such that , we have
Proof.
We make the inductive claim that . The truth of this claim for is easily verifiable. Assuming the inductive claim for some , observe
Finally, observe that . ∎
While we have that is bounded by assumption, the next lemma bounds the possible value of . Please see the proof for the exact polynomial coefficients.
Lemma 17.
During the execution of Phase 2 of Algorithm 1, the iterates produced satisfy, as long as , that
Proof.
Consider a generic linear dynamical system that evolves as , where the control is chosen as . Such a choice entails that for any
Note that Phase 2 of Algorithm 1 is a specific instance of this setting where is chosen as . We put forward the (strong) inductive hypothesis that for all , we have
Now, observe that
The base case be verified via computation. To simplify the expression, choose , to obtain that as long as ,
∎
6 System Identification via Random Inputs
This section details the guarantees afforded by Algorithm 2. Define . The said algorithm attempts to identify the deterministicequivalent dynamics, ie. the matrix pair , by first identifying matrices of the form , and then recovering by solving an associated linear system of equations.
Theorem 18 (System Recovery).
Under the assumptions listed in Section 2, when Algorithm 2 is run for
steps, it is guaranteed that the output pair
satisfies, with probability
, thatProof.
The evolution of the state sequence during the identification phase in terms of the control perturbations is stated below. Following this, we state an upper bound on the said sequence.
(2) 
Lemma 19.
The states produced by Algorithm 2 satisfy
Proof.
The strong stability of suffices to establish this claim.
In addition to submultiplicativity of the norm, we use that . ∎
6.1 Step 1: Moment Recovery
The following lemma promises an approximate recovery of ’s through an appeal to arguments involving measures of concentration.
Lemma 20.
Algorithm 2 satisfies for all , with probability or more
Proof.
Let . With Equation 2, the fact that is zeromean with isotropic unit covariance, and that it is chosen independently of implies .
’s are bounded as