# The Nonstochastic Control Problem

We consider the problem of controlling an unknown linear dynamical system with non-stochastic adversarial perturbations and adversarial convex loss functions. The possibility of sub-linear regret was posed as an open question by [Tu 2019]. We answer this question in the affirmative by giving an efficient algorithm that guarantees a regret of T^(2/3). Crucial for our result is a system identification procedure that is provably effective under adversarial perturbations.

## Authors

• 46 publications
• 52 publications
• 26 publications
• ### Non-Stochastic Control with Bandit Feedback

We study the problem of controlling a linear dynamical system with adver...
08/12/2020 ∙ by Paula Gradu, et al. ∙ 0

• ### Improper Learning for Non-Stochastic Control

We consider the problem of controlling a possibly unknown linear dynamic...
01/25/2020 ∙ by Max Simchowitz, et al. ∙ 0

• ### Online Control with Adversarial Disturbances

We study the control of a linear dynamical system with adversarial distu...
02/23/2019 ∙ by Naman Agarwal, et al. ∙ 8

• ### Black-Box Control for Linear Dynamical Systems

We consider the problem of controlling an unknown linear time-invariant ...
07/13/2020 ∙ by Xinyi Chen, et al. ∙ 0

• ### Bandit Linear Control

We consider the problem of controlling a known linear dynamical system u...
07/01/2020 ∙ by Asaf Cassel, et al. ∙ 0

• ### No-Regret Prediction in Marginally Stable Systems

We consider the problem of online prediction in a marginally stable line...
02/06/2020 ∙ by Udaya Ghai, et al. ∙ 0

• ### A Unified Approach to Interpreting and Boosting Adversarial Transferability

In this paper, we use the interaction inside adversarial perturbations t...
10/08/2020 ∙ by Xin Wang, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Classical control theory assumes that nature evolves according to well-specified dynamics that is perturbed by i.i.d. noise. While this approximation has proven very useful for controlling some real world systems, it does not allow for construction of truly robust controllers. The focus of this paper is the construction of truly robust controllers even when the underlying system is unknown.

Specifically, we consider the case in which the underlying system is linear, but has potentially adversarial perturbations (that can model deviations from linearity), i.e.

 xt+1=Axt+But+wt, (1)

where is the (observed) system state, is a learner-chosen control and is an adversarial disturbance. The goal of the controller is to minimize a sum of sequentially revealed adversarial cost functions .

The goal in this game-theoretic setting is to minimize policy regret, or the regret compared to the best controller, from a class , that is made aware of the system dynamics, the cost sequence, and all the disturbances ahead of time:

 Regret=∑tct(xt,ut)−minπ∈Π∑tct(xπt,uπt).

It may be noted that the cost of the benchmark is measured on the counterfactual state-action sequence that the policy in consideration visits, as opposed to the state-sequence visited by the the learner. In contrast to the worst-case optimality of robust control, achieving low regret demands stronger promises of instance-wise (near) optimality on every perturbation sequence.

Non-stochastic Control: Without knowledge of adversarial system matrices and perturbations , iteratively generate controls to minimize regret over sequentially revealed adversarial convex costs.

The above constitutes a powerful (non-stochastic) adversarial generalization of stochastic control.

Our starting point is the recent work of [ABH+19a], where the authors proposed a novel class of policies, namely choosing actions as a linear combination of past disturbances, . They demonstrate that learning the coefficients , via online convex optimization, allows their controller to compete with the class of all linear policies in the state. This latter class is important, since it is known to be optimal for the standard setting of normal i.i.d noise and quadratic loss functions, also known as the Linear Quadratic Regulator (LQR), and associated robust control settings (see [BB08] for examples).

The caveat in [ABH+19a] is that the system matrices need to be known. In the case of a known system, the disturbances can be simply computed via observations of the state, ie. . However, if the system is unknown, it is not clear how to generalize their approach. Fundamentally, the important component that is difficulty in identifying the system, or the matrices , from the observations. This is non-trivial since the noise is assumed to be adversarial, and was posed as a question in [TU19].

In this paper we show how to overcome this difficulty and obtain sublinear regret for controlling an unknown system in the presence of adversarial noise and adversarial loss functions. The regret notion we adopt is policy regret against linear policies, exactly as in [ABH+19a]. The crucial component we introduce is adversarial sys-id: an efficient method for uncovering the underlying system even in the presence of adversarial perturbations. This method is not based on naive least squares method of regressing on . In particular, without independent, zero-mean

’s, the latter approach can produce inconsistent estimates of the system matrices.

Informally, the main result is:

###### Theorem 1 (Informal Statement).

For an unknown linear dynamical system as (1), where are chosen by an oblivious adversary to be bounded in the range , there exists an efficient algorithm that generates an adaptive sequence of controls for which

 Regret=O(poly(natural-parameters)T2/3).

### 1.1 Related Work

There has been a resurgence of literature on control of linear dynamical systems in the recent machine learning venues. The case of known systems was extensively studied in the control literature, see the survey

[STE94]. Sample complexity and regret bounds for control (under Gaussian noise) were obtained in [AS11, DMM+18, ALS19, MTR19, CKM19]. The works of [ABK14], [CHK+18] and [AHS19b] allow for control of LDS with adversarial loss functions. Provable control in the Gaussian noise setting via the policy gradient method was studied in [FGK+18]. These works operate in the absence of perturbations or assume that the same are i.i.d., as opposed to our adversarial.

The most relevant reformulation of the control problem that enables our result is the recent work of [ABH+19a], who use online learning techniques and convex relaxation to obtain provable bounds for LDS with adversarial perturbations. However, the result and the algorithm make extensive use of the availability of the system matrices.

Recently [SBR19] showed how to use least squares to learn an underlying Markov operator in lieu of the system and in the presence of noise. It is possible that their recovery technique can also be used to generate perturbation estimates, and then apply the techniques of [ABH+19a] for control. However, the conditions on the system they assume are even more general than ours, and it is not clear if they are sufficient for control. For system identification in the stochastic noise setting, [OO19] prove sample complexity bounds for the Ho-Kalman algorithm [HK66]. A stronger result that holds under partial observability was shown in [SRD19]. While these results apply to stochastic noise, parameter recovery in the setting of adversarial noise was recently studied in the contextual bandits literature [KWS18]. Other relevant work from the machine learning literature includes the technique of spectral filtering for learning and open-loop control of partially observable systems [HSZ17, AHL+18, HLS+18].

We make extensive use of techniques from online learning [CL06, So12, HAZ16].

## 2 Problem Setting

#### Linear Dynamical Systems

We consider the setting of linear dynamical systems with time-invariant dynamics, i.e.

 xt+1=Axt+But+wt,

where and . The perturbation sequence may be adversarially chosen at the beginning of the interaction, and is unknown to the learner. Likewise, the system is augmented with time-varying convex cost functions . The total cost associated with a sequence of (random) controls, derived through an algorithm , is

 J(A)=E[T∑t=1ct(xt,ut)].

With some abuse of notation, we will denote by the cost associated with the execution of controls as a linear controller would suggest, ie. . The following conditions are assumed on the cost and the perturbations.

###### Assumption 2.

The perturbation sequence is bounded, ie. , and chosen at the start of the interaction, implying that this sequence does not depend on the choice of .

###### Assumption 3.

As long as , the convex costs admit .

The fundamental Linear Quadratic Regulator problem is a specialization of the above to the case when the perturbations are i.i.d. Gaussian and the cost functions are positive quadratics, ie.

 ct(x,u)=x⊤Rx+u⊤Qu.

#### Objective

We consider the setting where the learner has no knowledge of and the perturbation sequence . In this case, any inference of these quantities may only take place indirectly through the observation of the state . Furthermore, the learner is made aware of the cost function only once the choice of has been made.

Under such constraints, the objective of the algorithm is to choose an (adaptive) sequence of controls that ensure that the cost suffered in this manner is comparable to that of the best choice of a linear controller with complete knowledge of system dynamics and the foreknowledge of the cost and perturbation sequences . Formally, we measure regret as

 Regret=J(A)−minK∈KJ(K).

is the set of -strongly stable linear controllers defined below. The notion of strong stability, introduced in [CHK+18], offers a quantification of the classical notion of a stable controller in manner that permits a discussion on non-asymptotic regret bounds.

###### Definition 4 (Strong Stability).

A linear controller is -strongly stable for a linear dynamical system specified via if there exists a decomposition of with , and .

We also assume the learner has access to a fixed stabilizing controller for the transition matrices . When operating under unknown transition matrices, the knowledge of a stabilizing controller permits the learner to prevent an inflation of the size of the state beyond reasonable bounds.

###### Assumption 5.

The learner knows a linear controller that is -strongly stable for the true, but unknown, transition matrices .

The non-triviality of the regret guarantee rests on the benchmark set not being empty. As noted in [CHK+18], a sufficient condition to ensure the existence of a strongly stable controller is the controllability of the linear system . Informally, controllability for a linear system is characterized by the ability to drive the system to any desired state through appropriate control inputs in the presence of deterministic dynamics, ie. .

###### Definition 6 (Strong Controllability).

For a linear dynamical system , define, for , a matrix as

 Ck=[B,AB,A2B…Ak−1B].

A linear dynamical system is controllable with controllability index if has full row-rank. In addition, such a system is also strongly controllable if .

As with stability, a quantitative analog of controllability first suggested in [CHK+18] is presented above. It is useful to note that, as a consequence of the Cayley-Hamiltion theorem, for a controllable system the controllability index is always at most the dimension of the state space. We adopt the assumption that the system is strongly controllable.

###### Assumption 7.

The linear dynamical system is strongly controllable.

## 3 Preliminaries

This section sets up the concepts that aid the algorithmic description and the analysis.

### 3.1 Parameterization of the Controller

The total cost objective of a linear controller is non-convex in the canonical parameterization [FGK+18], ie. is not convex in . To remedy this, we use an alternative perturbation-based parameterization for controller, recently proposed in [ABH+19a], where the advised control is linear in the past perturbations (as opposed to the state). This permits that the offline search for an optimal controller may be posed as a convex program.

###### Definition 8.

A perturbation-based policy chooses the recommended action at a state as

 ut=−Kxt+H∑i=1M[i−1]wt−i.

### 3.2 State Evolution

Under the execution of a stationary policy

, the state may be expressed as a linear transformation of the perturbations, where the linear transformation itself is linear in the parameterizing matrices. We set up this notation below.

###### Definition 9.

For a matrix pair , define the state-perturbation transfer matrix:

 Ψi(M|A,B)=(A−BK)i1i≤H+H∑j=0(A−BK)jBM[i−j−1]1i−j∈[1,H].

In [ABH+19a] the authors note that, under the linear dynamical system with perturbations , the state produced by this policy evolves as specified below. Of particular importance in this context is the observation that ’s are linear in ’s.

 xt+1=(A−BK)H+1xt−H+2H∑i=0Ψi(M|A,B)wt−i

Following [ABH+19a], we adopt the definition of the surrogate setting.

###### Definition 10.

Define the surrogate state and the surrogate action as stated below. The surrogate cost as chosen to be the specialization of the -th cost function with the surrogate state-action pair as the argument.

 yt+1(M|A,B,{w}) =2H∑i=0Ψi(M|A,B)wt−i vt(M|A,B,{w}) =−Kyt(M|A,B,{w})+H∑i=1M[i−1]wt−i ft(M|A,B,{w}) =ct(yt(M|A,B,{w}),vt(M|A,B,{w}))

## 4 The Algorithm

Our approach follows the explore-then-commit paradigm, identifying the underlying the deterministic-equivalent dynamics to within a specified accuracy using random inputs in the exploration phase. Such an approximate recovery of parameters permits an approximate recovery of the perturbations, thus facilitating the execution of the perturbation-based controller on the approximated perturbations.

###### Theorem 11.

Under the assumptions listed in Section 2, the regret incurred by Algorithm 1 admits the upper bound 111The unnatural scaling in occurs because in the analysis, we (sometimes) take . Also, the magnitude of is not tuned to produce the optimal dependence in parameters other than . stated below. In particular, this is the case when , , .

 Regret=O(poly(κ,γ−1,k,m,n,W)T2/3)

## 5 Regret Analysis

To present the proof concisely, we set up a few articles of use. For a generic algorithm operating on a generic linear dynamical system specified via a matrix pair and perturbations , let

1. be the cost of executing , as incurred on the last time steps,

2. be the state achieved at time step , and

3. be the control executed at time step .

We also note that following result from [ABH+19a] that applies to the case when the matrices that govern the underlying dynamics are made known to the algorithm. In such a case, an exact inference of is possible.

###### Theorem 12 ([ABH+19a]).

Let be a strong stable controller for a generic system , be a perturbation sequence with , and be costs satisfying Assumption 3. Then there exists an algorithm (specifically, Phase 2 of Algorithm 1 with a learning rate of and ), utilizing , that guarantees

 J(A|A,B,{w})−minK∈KJ(K|A,B,{w})≤O(% poly(κ,γ−1)GW2√TlogT).
###### Proof of Theorem 11.

Define . Let be the contribution to the regret associated with the first rounds of exploration. By Lemma 19, we have that

 J0≤8T0G(κ3γ−1(W+κ√m)+√m)2.

Let , in the arguments below.

From this point on, let be the algorithm, from [ABH+19a], executed in Phase 2. By the Simulation Lemma (Lemma 13), we have

 Regret ≤J0+T∑t=T0+1ct(xt,ut)−J(K|A,B,{w}) ≤J0+(J(A|^A,^B,{^w})−J(K|^A,^B,{^w}))+(J(K|^A,^B,{^w})−J(K|A,B,w)).

The middle term in the regret can be upper bounded by the regret of algorithm on the fictional system and the perturbation sequence . Before we can invoke Theorem 12, observe that

1. By the Preservation of Stability Lemma (Lemma 14), is strongly stable on , as long as ,

2. Lemma 17 ensures that the iterates produced by the execution of Algorithm satisfy , as long as .

With the above observations in place, Theorem 12 guarantees

 J(A|^A,^B,{^w})−J(K|^A,^B,{^w})≤poly(κ,γ−1,m)G(1+W)2√TlogT

The last expression in the preceding line can be bound as Stability of Value Function Lemma (Lemma 15) indicates, while constraining our choices as , and , and observing that

 |J(K|^A,^B,{^w})−J(K|A,B,w)| ≤32Tκ3γ−1GW(κ2γ−1εw+16Wκ5γ−2εA,B) ≤poly(κ,γ−1,m)G(1+W)2TεA,B ≤poly(κ,γ−1,k,m,n)GT(1+W)2.5T−0.50

The last line follows by Theorem 18, and suggest the optimal (by this analysis) choice of . Apart from this proof, all other proofs and statements list exact upper bounds on the polynomial factors. Here they have been omitted for ease of presentation. ∎

The regret minimizing algorithm, Phase 2 of Algorithm 1, chooses so as to optimize for the cost of the perturbation-based controller in a fictional linear dynamical system described via the matrix pair and the perturbation sequence . The following lemma shows that the choice of ensures that state-control sequence visited by Algorithm 1 coincides with the sequence visted by the regret-minimizing algorithm in the fictional system.

###### Lemma 13 (Simulation Lemma).

Let be the algorithm, from [ABH+19a], executed in Phase 2, and be the state-control iterates produced by Algorithm 1. Then

 xt=xt(A|^A,^B,{^w}),ut=ut(A|^A,^B,{^w}),andT∑t=T0+1ct(xt,ut)=J(A|^A,^B,{^w})

.

###### Proof.

This proof follows by induction on . Note that at the start of , it is fed the initial state by the choice of . Say that for some , it happens that the inductive hypothesis is true. Consequently

 ut(A|^A,^B,{^w}) =Kxt(A|^A,^B,{^w})+H∑i=1M[i−1]t^wt−i =Kxt+H∑i=1M[i−1]t^wt−i=ut.

This, in turn, implies by choice of that

 xt+1(A|^A,^B,{^w}) =^Axt(A|^A,^B,{^w})+^But(A|^A,^B,{^w})+^wt =^Axt+^But+^wt=xt.

The claim follows. ∎

The lemma stated below guarantees that the strong stability of is approximately preserved under small deviations of the system matrices.

###### Lemma 14 (Preservation of Stability).

If a linear controller is strongly stable for a linear dynamical system , ie.

 A+BK=QLQ−1,∥A∥,∥B∥,∥Q∥,∥Q−1∥≤κ,∥L∥≤1−γ,

then is strongly stable for a linear dynamical system , ie.

 ^A+^BK=Q^LQ−1,∥^A∥,∥^B∥≤κ+εA,B,∥^L∥≤1−γ+2κ3εA,B,

as long as . Furthermore, when in agreement with the said conditions, the transforming matrices that certify strong stability in both these cases coincide, and the transformed matrices obey .

###### Proof.

Let with , . Now

 ^A+^BK =QLQ−1+(^A−A)+(^B−B)K =Q(L+Q−1(^A−A+(^B−B)K)Q)Q−1

It suffices to note that . ∎

The next lemma establishes that if the same linear state-feedback policy is executed on the actual and the fictional linear dynamical system, the difference between the costs incurred in the two scenarios varies proportionally with some measure of distance between the two systems.

###### Lemma 15 (Stability of Value Function).

Let , . As long as it happens that and , it is true that

 |J(K|^A,^B,{^w})−J(K|A,B,w)|≤32Tκ3γ−1GW(κ2γ−1εw+16Wκ5γ−2εA,B),

for any strongly stable controller with respect to .

###### Proof.

Under the action of a linear controller , which is strongly stable for , it may be verified that

 xt+1(K|A,B,{w})=T∑t=0(A−BK)iwt−i.

Now, if , note that . By the Preservation of Stability Lemma (Lemma 14), is strongly stable for . Consequently, .

 |J(A|^A,^B,{^w})−J(A|A,B,w)|≤32Tκ3γ−1GW∥xt+1(K|A,B,{w})−xt+1(K|^A,^B,{^w})∥

Finally, note

 ∥xt+1(K|A,B,{w})−xt+1(K|^A,^B,{^w})∥ ≤ T∑t=0∥∥(A−BK)iwt−i−(^A−^BK)i^wt−i∥∥ ≤ T∑t=0(∥∥(A−BK)iwt−i−(A−BK)i^wt−i∥∥+∥∥(A−BK)i^wt−i−(^A−^BK)i^wt−i∥∥) ≤ κ2γ−1εw+2Wκ2T∑i=0∥^Li−Li∥

The last inequality follows from the addendum attached to Lemma 14. Lastly, observe, as a consequence of Lemma 16 and Lemma 14, that

 T∑i=0∥^Li−Li∥=T∑i=0∥(L+(^L−L))i−Li∥≤4γ−2∥^L−L∥.

###### Lemma 16.

For any matrix pair , such that , we have

 ∞∑t=0∥(L+ΔL)t−Lt∥≤γ−2∥ΔL∥
###### Proof.

We make the inductive claim that . The truth of this claim for is easily verifiable. Assuming the inductive claim for some , observe

 ∥(L+ΔL)t+1−Lt+1∥ =∥(L+ΔL)(L+ΔL)t−LLt∥ ≤∥L((L+ΔL)t−Lt)∥+∥ΔL(L+ΔL)t∥ ≤t(1−γ)t−1∥L∥∥ΔL∥+∥ΔL∥(1−γ)t ≤(t+1)(1−γ)t∥ΔL∥

Finally, observe that . ∎

While we have that is bounded by assumption, the next lemma bounds the possible value of . Please see the proof for the exact polynomial coefficients.

###### Lemma 17.

During the execution of Phase 2 of Algorithm 1, the iterates produced satisfy, as long as , that

 ∥xt∥≤poly(κ,γ−1,m)(1+W),and∥wt∥≤poly(κ,γ−1,m)(1+W).
###### Proof.

Consider a generic linear dynamical system that evolves as , where the control is chosen as . Such a choice entails that for any

 xt+1=(A−BK)H′+1xt−H′+H′∑i=0(A−BK)i(wt−i+B~ut−i).

Note that Phase 2 of Algorithm 1 is a specific instance of this setting where is chosen as . We put forward the (strong) inductive hypothesis that for all , we have

 ∥xt∥≤X:=κ7γ−2(W+Y)1−κ2(1−γ)H′+1,
 ∥^wt−1∥≤Y:=1+2κ11γ−3εA,B1−κ2(1−γ)H′+11−4γ11γ−3εA,B1−κ2(1−γ)H′+1W+κ2γ−1(W+κ√m).

Now, observe that

 ∥xt0+1∥ ≤κ2(1−γ)H′+1X+κ2γ−1(W+κ5γ−1Y)≤X ∥^wt0∥ =∥wt+(A−^A)xt0+(B−^B)ut0∥ ≤W+εA,B(2κX+κ4γ−1Y) ≤W+2κ4γ−1εA,B(X+Y) ≤W(1+2κ11γ−3εA,B1−κ2(1−γ)H′+1)+(2κ4γ−1εA,B+2κ11γ−3εA,B1−κ2(1−γ)H′+1)Y ≤W(1+2κ11γ−3εA,B1−κ2(1−γ)H′+1)+4γ11γ−3εA,B1−κ2(1−γ)H′+1Y≤Y

The base case be verified via computation. To simplify the expression, choose , to obtain that as long as ,

 ∥^wt∥≤(100+κ2γ−1)W+κ3γ−1√m,∥xt∥≤2γ−2κ7(101+κ2γ−1)W+κ3γ−1√m.

## 6 System Identification via Random Inputs

This section details the guarantees afforded by Algorithm 2. Define . The said algorithm attempts to identify the deterministic-equivalent dynamics, ie. the matrix pair , by first identifying matrices of the form , and then recovering by solving an associated linear system of equations.

###### Theorem 18 (System Recovery).

Under the assumptions listed in Section 2, when Algorithm 2 is run for

 T0=103knm(W+κ√m)2κ6γ−2ε−2A,Blog((n+m)kδ−1)

steps, it is guaranteed that the output pair

satisfies, with probability

, that

 ∥^A−A∥F,∥^B−B∥F≤εA,B.
###### Proof.

As a consequence of controllability, observe that is a unique solution of the system of equations presented below.

 ACk=[A′B,(A′)2B,…(A′)kB]=X[B,A′B,…(A′)k−1B]=XCk

Now, if, for all , , it follows that ; in addition, . By Lemma 21, we have

 ∥^A′−A′∥F≤εκ√knσmin(Ck)−ε√k.

So, setting suffices. This may be accomplished by Lemma 20. ∎

The evolution of the state sequence during the identification phase in terms of the control perturbations is stated below. Following this, we state an upper bound on the said sequence.

 xt+1=t∑i=0(A′)t−i(wi+Bηi) (2)
###### Lemma 19.

The states produced by Algorithm 2 satisfy

 ∥xt∥ ≤κ2γ−1(W+κ√m), ∥ut∥ ≤κ3γ−1(W+κ√m)+√m, ct(xt,ut)−minx,uct(x,u) ≤8G(κ3γ−1(W+κ√m)+√m)2.
###### Proof.

The strong stability of suffices to establish this claim.

 ∥xt∥≤(W+∥B∥√m)∥Q∥∥Q−1∥t−1∑i=0∥Li∥≤(W+κ√m)κ2t−1∑i=0(1−γ)i

In addition to sub-multiplicativity of the norm, we use that . ∎

### 6.1 Step 1: Moment Recovery

The following lemma promises an approximate recovery of ’s through an appeal to arguments involving measures of concentration.

###### Lemma 20.

Algorithm 2 satisfies for all , with probability or more

 ∥Nj−(A′)jB∥≤√m(W+κ√m)κ2γ−1√8log((n+m)kδ−1)T0−k.
###### Proof.

Let . With Equation 2, the fact that is zero-mean with isotropic unit covariance, and that it is chosen independently of implies