The Nonstochastic Control Problem

11/27/2019 ∙ by Elad Hazan, et al. ∙ Princeton University University of Washington 15

We consider the problem of controlling an unknown linear dynamical system with non-stochastic adversarial perturbations and adversarial convex loss functions. The possibility of sub-linear regret was posed as an open question by [Tu 2019]. We answer this question in the affirmative by giving an efficient algorithm that guarantees a regret of T^(2/3). Crucial for our result is a system identification procedure that is provably effective under adversarial perturbations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Classical control theory assumes that nature evolves according to well-specified dynamics that is perturbed by i.i.d. noise. While this approximation has proven very useful for controlling some real world systems, it does not allow for construction of truly robust controllers. The focus of this paper is the construction of truly robust controllers even when the underlying system is unknown.

Specifically, we consider the case in which the underlying system is linear, but has potentially adversarial perturbations (that can model deviations from linearity), i.e.

(1)

where is the (observed) system state, is a learner-chosen control and is an adversarial disturbance. The goal of the controller is to minimize a sum of sequentially revealed adversarial cost functions .

The goal in this game-theoretic setting is to minimize policy regret, or the regret compared to the best controller, from a class , that is made aware of the system dynamics, the cost sequence, and all the disturbances ahead of time:

It may be noted that the cost of the benchmark is measured on the counterfactual state-action sequence that the policy in consideration visits, as opposed to the state-sequence visited by the the learner. In contrast to the worst-case optimality of robust control, achieving low regret demands stronger promises of instance-wise (near) optimality on every perturbation sequence.

Non-stochastic Control: Without knowledge of adversarial system matrices and perturbations , iteratively generate controls to minimize regret over sequentially revealed adversarial convex costs.

The above constitutes a powerful (non-stochastic) adversarial generalization of stochastic control.

Our starting point is the recent work of [ABH+19a], where the authors proposed a novel class of policies, namely choosing actions as a linear combination of past disturbances, . They demonstrate that learning the coefficients , via online convex optimization, allows their controller to compete with the class of all linear policies in the state. This latter class is important, since it is known to be optimal for the standard setting of normal i.i.d noise and quadratic loss functions, also known as the Linear Quadratic Regulator (LQR), and associated robust control settings (see [BB08] for examples).

The caveat in [ABH+19a] is that the system matrices need to be known. In the case of a known system, the disturbances can be simply computed via observations of the state, ie. . However, if the system is unknown, it is not clear how to generalize their approach. Fundamentally, the important component that is difficulty in identifying the system, or the matrices , from the observations. This is non-trivial since the noise is assumed to be adversarial, and was posed as a question in [TU19].

In this paper we show how to overcome this difficulty and obtain sublinear regret for controlling an unknown system in the presence of adversarial noise and adversarial loss functions. The regret notion we adopt is policy regret against linear policies, exactly as in [ABH+19a]. The crucial component we introduce is adversarial sys-id: an efficient method for uncovering the underlying system even in the presence of adversarial perturbations. This method is not based on naive least squares method of regressing on . In particular, without independent, zero-mean

’s, the latter approach can produce inconsistent estimates of the system matrices.

Informally, the main result is:

Theorem 1 (Informal Statement).

For an unknown linear dynamical system as (1), where are chosen by an oblivious adversary to be bounded in the range , there exists an efficient algorithm that generates an adaptive sequence of controls for which

1.1 Related Work

There has been a resurgence of literature on control of linear dynamical systems in the recent machine learning venues. The case of known systems was extensively studied in the control literature, see the survey

[STE94]. Sample complexity and regret bounds for control (under Gaussian noise) were obtained in [AS11, DMM+18, ALS19, MTR19, CKM19]. The works of [ABK14], [CHK+18] and [AHS19b] allow for control of LDS with adversarial loss functions. Provable control in the Gaussian noise setting via the policy gradient method was studied in [FGK+18]. These works operate in the absence of perturbations or assume that the same are i.i.d., as opposed to our adversarial.

The most relevant reformulation of the control problem that enables our result is the recent work of [ABH+19a], who use online learning techniques and convex relaxation to obtain provable bounds for LDS with adversarial perturbations. However, the result and the algorithm make extensive use of the availability of the system matrices.

Recently [SBR19] showed how to use least squares to learn an underlying Markov operator in lieu of the system and in the presence of noise. It is possible that their recovery technique can also be used to generate perturbation estimates, and then apply the techniques of [ABH+19a] for control. However, the conditions on the system they assume are even more general than ours, and it is not clear if they are sufficient for control. For system identification in the stochastic noise setting, [OO19] prove sample complexity bounds for the Ho-Kalman algorithm [HK66]. A stronger result that holds under partial observability was shown in [SRD19]. While these results apply to stochastic noise, parameter recovery in the setting of adversarial noise was recently studied in the contextual bandits literature [KWS18]. Other relevant work from the machine learning literature includes the technique of spectral filtering for learning and open-loop control of partially observable systems [HSZ17, AHL+18, HLS+18].

We make extensive use of techniques from online learning [CL06, So12, HAZ16].

2 Problem Setting

Linear Dynamical Systems

We consider the setting of linear dynamical systems with time-invariant dynamics, i.e.

where and . The perturbation sequence may be adversarially chosen at the beginning of the interaction, and is unknown to the learner. Likewise, the system is augmented with time-varying convex cost functions . The total cost associated with a sequence of (random) controls, derived through an algorithm , is

With some abuse of notation, we will denote by the cost associated with the execution of controls as a linear controller would suggest, ie. . The following conditions are assumed on the cost and the perturbations.

Assumption 2.

The perturbation sequence is bounded, ie. , and chosen at the start of the interaction, implying that this sequence does not depend on the choice of .

Assumption 3.

As long as , the convex costs admit .

The fundamental Linear Quadratic Regulator problem is a specialization of the above to the case when the perturbations are i.i.d. Gaussian and the cost functions are positive quadratics, ie.

Objective

We consider the setting where the learner has no knowledge of and the perturbation sequence . In this case, any inference of these quantities may only take place indirectly through the observation of the state . Furthermore, the learner is made aware of the cost function only once the choice of has been made.

Under such constraints, the objective of the algorithm is to choose an (adaptive) sequence of controls that ensure that the cost suffered in this manner is comparable to that of the best choice of a linear controller with complete knowledge of system dynamics and the foreknowledge of the cost and perturbation sequences . Formally, we measure regret as

is the set of -strongly stable linear controllers defined below. The notion of strong stability, introduced in [CHK+18], offers a quantification of the classical notion of a stable controller in manner that permits a discussion on non-asymptotic regret bounds.

Definition 4 (Strong Stability).

A linear controller is -strongly stable for a linear dynamical system specified via if there exists a decomposition of with , and .

We also assume the learner has access to a fixed stabilizing controller for the transition matrices . When operating under unknown transition matrices, the knowledge of a stabilizing controller permits the learner to prevent an inflation of the size of the state beyond reasonable bounds.

Assumption 5.

The learner knows a linear controller that is -strongly stable for the true, but unknown, transition matrices .

The non-triviality of the regret guarantee rests on the benchmark set not being empty. As noted in [CHK+18], a sufficient condition to ensure the existence of a strongly stable controller is the controllability of the linear system . Informally, controllability for a linear system is characterized by the ability to drive the system to any desired state through appropriate control inputs in the presence of deterministic dynamics, ie. .

Definition 6 (Strong Controllability).

For a linear dynamical system , define, for , a matrix as

A linear dynamical system is controllable with controllability index if has full row-rank. In addition, such a system is also strongly controllable if .

As with stability, a quantitative analog of controllability first suggested in [CHK+18] is presented above. It is useful to note that, as a consequence of the Cayley-Hamiltion theorem, for a controllable system the controllability index is always at most the dimension of the state space. We adopt the assumption that the system is strongly controllable.

Assumption 7.

The linear dynamical system is strongly controllable.

3 Preliminaries

This section sets up the concepts that aid the algorithmic description and the analysis.

3.1 Parameterization of the Controller

The total cost objective of a linear controller is non-convex in the canonical parameterization [FGK+18], ie. is not convex in . To remedy this, we use an alternative perturbation-based parameterization for controller, recently proposed in [ABH+19a], where the advised control is linear in the past perturbations (as opposed to the state). This permits that the offline search for an optimal controller may be posed as a convex program.

Definition 8.

A perturbation-based policy chooses the recommended action at a state as

3.2 State Evolution

Under the execution of a stationary policy

, the state may be expressed as a linear transformation of the perturbations, where the linear transformation itself is linear in the parameterizing matrices. We set up this notation below.

Definition 9.

For a matrix pair , define the state-perturbation transfer matrix:

In [ABH+19a] the authors note that, under the linear dynamical system with perturbations , the state produced by this policy evolves as specified below. Of particular importance in this context is the observation that ’s are linear in ’s.

Following [ABH+19a], we adopt the definition of the surrogate setting.

Definition 10.

Define the surrogate state and the surrogate action as stated below. The surrogate cost as chosen to be the specialization of the -th cost function with the surrogate state-action pair as the argument.

4 The Algorithm

Our approach follows the explore-then-commit paradigm, identifying the underlying the deterministic-equivalent dynamics to within a specified accuracy using random inputs in the exploration phase. Such an approximate recovery of parameters permits an approximate recovery of the perturbations, thus facilitating the execution of the perturbation-based controller on the approximated perturbations.

  Input: learning rate , horizon , number of iterations , rounds of exploration .
  Phase 1: System Identification.
  Call Algorithm 2 with a budget of rounds to obtain system estimates .
  Phase 2: Robust Control.
  Define the constraint set as
  Initialize and for .
  for  do
     Choose the action
     Observe the new state , the cost function .
     Record an estimate .
     Update .
  end for
Algorithm 1 Adversarial control via system identification.
  Input: number of iterations .
  for  do
     Execute the control with .
     Record the observed state .
  end for
  Declare , for all .
  Return as
where , .
Algorithm 2 System identification via random inputs.
Theorem 11.

Under the assumptions listed in Section 2, the regret incurred by Algorithm 1 admits the upper bound 111The unnatural scaling in occurs because in the analysis, we (sometimes) take . Also, the magnitude of is not tuned to produce the optimal dependence in parameters other than . stated below. In particular, this is the case when , , .

5 Regret Analysis

To present the proof concisely, we set up a few articles of use. For a generic algorithm operating on a generic linear dynamical system specified via a matrix pair and perturbations , let

  1. be the cost of executing , as incurred on the last time steps,

  2. be the state achieved at time step , and

  3. be the control executed at time step .

We also note that following result from [ABH+19a] that applies to the case when the matrices that govern the underlying dynamics are made known to the algorithm. In such a case, an exact inference of is possible.

Theorem 12 ([ABH+19a]).

Let be a strong stable controller for a generic system , be a perturbation sequence with , and be costs satisfying Assumption 3. Then there exists an algorithm (specifically, Phase 2 of Algorithm 1 with a learning rate of and ), utilizing , that guarantees

Proof of Theorem 11.

Define . Let be the contribution to the regret associated with the first rounds of exploration. By Lemma 19, we have that

Let , in the arguments below.

From this point on, let be the algorithm, from [ABH+19a], executed in Phase 2. By the Simulation Lemma (Lemma 13), we have

Regret

The middle term in the regret can be upper bounded by the regret of algorithm on the fictional system and the perturbation sequence . Before we can invoke Theorem 12, observe that

  1. By the Preservation of Stability Lemma (Lemma 14), is strongly stable on , as long as ,

  2. Lemma 17 ensures that the iterates produced by the execution of Algorithm satisfy , as long as .

With the above observations in place, Theorem 12 guarantees

The last expression in the preceding line can be bound as Stability of Value Function Lemma (Lemma 15) indicates, while constraining our choices as , and , and observing that

The last line follows by Theorem 18, and suggest the optimal (by this analysis) choice of . Apart from this proof, all other proofs and statements list exact upper bounds on the polynomial factors. Here they have been omitted for ease of presentation. ∎

The regret minimizing algorithm, Phase 2 of Algorithm 1, chooses so as to optimize for the cost of the perturbation-based controller in a fictional linear dynamical system described via the matrix pair and the perturbation sequence . The following lemma shows that the choice of ensures that state-control sequence visited by Algorithm 1 coincides with the sequence visted by the regret-minimizing algorithm in the fictional system.

Lemma 13 (Simulation Lemma).

Let be the algorithm, from [ABH+19a], executed in Phase 2, and be the state-control iterates produced by Algorithm 1. Then

.

Proof.

This proof follows by induction on . Note that at the start of , it is fed the initial state by the choice of . Say that for some , it happens that the inductive hypothesis is true. Consequently

This, in turn, implies by choice of that

The claim follows. ∎

The lemma stated below guarantees that the strong stability of is approximately preserved under small deviations of the system matrices.

Lemma 14 (Preservation of Stability).

If a linear controller is strongly stable for a linear dynamical system , ie.

then is strongly stable for a linear dynamical system , ie.

as long as . Furthermore, when in agreement with the said conditions, the transforming matrices that certify strong stability in both these cases coincide, and the transformed matrices obey .

Proof.

Let with , . Now

It suffices to note that . ∎

The next lemma establishes that if the same linear state-feedback policy is executed on the actual and the fictional linear dynamical system, the difference between the costs incurred in the two scenarios varies proportionally with some measure of distance between the two systems.

Lemma 15 (Stability of Value Function).

Let , . As long as it happens that and , it is true that

for any strongly stable controller with respect to .

Proof.

Under the action of a linear controller , which is strongly stable for , it may be verified that

Now, if , note that . By the Preservation of Stability Lemma (Lemma 14), is strongly stable for . Consequently, .

Finally, note

The last inequality follows from the addendum attached to Lemma 14. Lastly, observe, as a consequence of Lemma 16 and Lemma 14, that

Lemma 16.

For any matrix pair , such that , we have

Proof.

We make the inductive claim that . The truth of this claim for is easily verifiable. Assuming the inductive claim for some , observe

Finally, observe that . ∎

While we have that is bounded by assumption, the next lemma bounds the possible value of . Please see the proof for the exact polynomial coefficients.

Lemma 17.

During the execution of Phase 2 of Algorithm 1, the iterates produced satisfy, as long as , that

Proof.

Consider a generic linear dynamical system that evolves as , where the control is chosen as . Such a choice entails that for any

Note that Phase 2 of Algorithm 1 is a specific instance of this setting where is chosen as . We put forward the (strong) inductive hypothesis that for all , we have

Now, observe that

The base case be verified via computation. To simplify the expression, choose , to obtain that as long as ,

6 System Identification via Random Inputs

This section details the guarantees afforded by Algorithm 2. Define . The said algorithm attempts to identify the deterministic-equivalent dynamics, ie. the matrix pair , by first identifying matrices of the form , and then recovering by solving an associated linear system of equations.

Theorem 18 (System Recovery).

Under the assumptions listed in Section 2, when Algorithm 2 is run for

steps, it is guaranteed that the output pair

satisfies, with probability

, that

Proof.

As a consequence of controllability, observe that is a unique solution of the system of equations presented below.

Now, if, for all , , it follows that ; in addition, . By Lemma 21, we have

So, setting suffices. This may be accomplished by Lemma 20. ∎

The evolution of the state sequence during the identification phase in terms of the control perturbations is stated below. Following this, we state an upper bound on the said sequence.

(2)
Lemma 19.

The states produced by Algorithm 2 satisfy

Proof.

The strong stability of suffices to establish this claim.

In addition to sub-multiplicativity of the norm, we use that . ∎

6.1 Step 1: Moment Recovery

The following lemma promises an approximate recovery of ’s through an appeal to arguments involving measures of concentration.

Lemma 20.

Algorithm 2 satisfies for all , with probability or more

Proof.

Let . With Equation 2, the fact that is zero-mean with isotropic unit covariance, and that it is chosen independently of implies