Classical control theory assumes that nature evolves according to well-specified dynamics that is perturbed by i.i.d. noise. While this approximation has proven very useful for controlling some real world systems, it does not allow for construction of truly robust controllers. The focus of this paper is the construction of truly robust controllers even when the underlying system is unknown.
Specifically, we consider the case in which the underlying system is linear, but has potentially adversarial perturbations (that can model deviations from linearity), i.e.
where is the (observed) system state, is a learner-chosen control and is an adversarial disturbance. The goal of the controller is to minimize a sum of sequentially revealed adversarial cost functions .
The goal in this game-theoretic setting is to minimize policy regret, or the regret compared to the best controller, from a class , that is made aware of the system dynamics, the cost sequence, and all the disturbances ahead of time:
It may be noted that the cost of the benchmark is measured on the counterfactual state-action sequence that the policy in consideration visits, as opposed to the state-sequence visited by the the learner. In contrast to the worst-case optimality of robust control, achieving low regret demands stronger promises of instance-wise (near) optimality on every perturbation sequence.
Non-stochastic Control: Without knowledge of adversarial system matrices and perturbations , iteratively generate controls to minimize regret over sequentially revealed adversarial convex costs.
The above constitutes a powerful (non-stochastic) adversarial generalization of stochastic control.
Our starting point is the recent work of [ABH+19a], where the authors proposed a novel class of policies, namely choosing actions as a linear combination of past disturbances, . They demonstrate that learning the coefficients , via online convex optimization, allows their controller to compete with the class of all linear policies in the state. This latter class is important, since it is known to be optimal for the standard setting of normal i.i.d noise and quadratic loss functions, also known as the Linear Quadratic Regulator (LQR), and associated robust control settings (see [BB08] for examples).
The caveat in [ABH+19a] is that the system matrices need to be known. In the case of a known system, the disturbances can be simply computed via observations of the state, ie. . However, if the system is unknown, it is not clear how to generalize their approach. Fundamentally, the important component that is difficulty in identifying the system, or the matrices , from the observations. This is non-trivial since the noise is assumed to be adversarial, and was posed as a question in [TU19].
In this paper we show how to overcome this difficulty and obtain sublinear regret for controlling an unknown system in the presence of adversarial noise and adversarial loss functions. The regret notion we adopt is policy regret against linear policies, exactly as in [ABH+19a]. The crucial component we introduce is adversarial sys-id: an efficient method for uncovering the underlying system even in the presence of adversarial perturbations. This method is not based on naive least squares method of regressing on . In particular, without independent, zero-mean
’s, the latter approach can produce inconsistent estimates of the system matrices.
Informally, the main result is:
Theorem 1 (Informal Statement).
For an unknown linear dynamical system as (1), where are chosen by an oblivious adversary to be bounded in the range , there exists an efficient algorithm that generates an adaptive sequence of controls for which
1.1 Related Work
There has been a resurgence of literature on control of linear dynamical systems in the recent machine learning venues. The case of known systems was extensively studied in the control literature, see the survey[STE94]. Sample complexity and regret bounds for control (under Gaussian noise) were obtained in [AS11, DMM+18, ALS19, MTR19, CKM19]. The works of [ABK14], [CHK+18] and [AHS19b] allow for control of LDS with adversarial loss functions. Provable control in the Gaussian noise setting via the policy gradient method was studied in [FGK+18]. These works operate in the absence of perturbations or assume that the same are i.i.d., as opposed to our adversarial.
The most relevant reformulation of the control problem that enables our result is the recent work of [ABH+19a], who use online learning techniques and convex relaxation to obtain provable bounds for LDS with adversarial perturbations. However, the result and the algorithm make extensive use of the availability of the system matrices.
Recently [SBR19] showed how to use least squares to learn an underlying Markov operator in lieu of the system and in the presence of noise. It is possible that their recovery technique can also be used to generate perturbation estimates, and then apply the techniques of [ABH+19a] for control. However, the conditions on the system they assume are even more general than ours, and it is not clear if they are sufficient for control. For system identification in the stochastic noise setting, [OO19] prove sample complexity bounds for the Ho-Kalman algorithm [HK66]. A stronger result that holds under partial observability was shown in [SRD19]. While these results apply to stochastic noise, parameter recovery in the setting of adversarial noise was recently studied in the contextual bandits literature [KWS18]. Other relevant work from the machine learning literature includes the technique of spectral filtering for learning and open-loop control of partially observable systems [HSZ17, AHL+18, HLS+18].
2 Problem Setting
Linear Dynamical Systems
We consider the setting of linear dynamical systems with time-invariant dynamics, i.e.
where and . The perturbation sequence may be adversarially chosen at the beginning of the interaction, and is unknown to the learner. Likewise, the system is augmented with time-varying convex cost functions . The total cost associated with a sequence of (random) controls, derived through an algorithm , is
With some abuse of notation, we will denote by the cost associated with the execution of controls as a linear controller would suggest, ie. . The following conditions are assumed on the cost and the perturbations.
The perturbation sequence is bounded, ie. , and chosen at the start of the interaction, implying that this sequence does not depend on the choice of .
As long as , the convex costs admit .
The fundamental Linear Quadratic Regulator problem is a specialization of the above to the case when the perturbations are i.i.d. Gaussian and the cost functions are positive quadratics, ie.
We consider the setting where the learner has no knowledge of and the perturbation sequence . In this case, any inference of these quantities may only take place indirectly through the observation of the state . Furthermore, the learner is made aware of the cost function only once the choice of has been made.
Under such constraints, the objective of the algorithm is to choose an (adaptive) sequence of controls that ensure that the cost suffered in this manner is comparable to that of the best choice of a linear controller with complete knowledge of system dynamics and the foreknowledge of the cost and perturbation sequences . Formally, we measure regret as
is the set of -strongly stable linear controllers defined below. The notion of strong stability, introduced in [CHK+18], offers a quantification of the classical notion of a stable controller in manner that permits a discussion on non-asymptotic regret bounds.
Definition 4 (Strong Stability).
A linear controller is -strongly stable for a linear dynamical system specified via if there exists a decomposition of with , and .
We also assume the learner has access to a fixed stabilizing controller for the transition matrices . When operating under unknown transition matrices, the knowledge of a stabilizing controller permits the learner to prevent an inflation of the size of the state beyond reasonable bounds.
The learner knows a linear controller that is -strongly stable for the true, but unknown, transition matrices .
The non-triviality of the regret guarantee rests on the benchmark set not being empty. As noted in [CHK+18], a sufficient condition to ensure the existence of a strongly stable controller is the controllability of the linear system . Informally, controllability for a linear system is characterized by the ability to drive the system to any desired state through appropriate control inputs in the presence of deterministic dynamics, ie. .
Definition 6 (Strong Controllability).
For a linear dynamical system , define, for , a matrix as
A linear dynamical system is controllable with controllability index if has full row-rank. In addition, such a system is also strongly controllable if .
As with stability, a quantitative analog of controllability first suggested in [CHK+18] is presented above. It is useful to note that, as a consequence of the Cayley-Hamiltion theorem, for a controllable system the controllability index is always at most the dimension of the state space. We adopt the assumption that the system is strongly controllable.
The linear dynamical system is strongly controllable.
This section sets up the concepts that aid the algorithmic description and the analysis.
3.1 Parameterization of the Controller
The total cost objective of a linear controller is non-convex in the canonical parameterization [FGK+18], ie. is not convex in . To remedy this, we use an alternative perturbation-based parameterization for controller, recently proposed in [ABH+19a], where the advised control is linear in the past perturbations (as opposed to the state). This permits that the offline search for an optimal controller may be posed as a convex program.
A perturbation-based policy chooses the recommended action at a state as
3.2 State Evolution
Under the execution of a stationary policy
, the state may be expressed as a linear transformation of the perturbations, where the linear transformation itself is linear in the parameterizing matrices. We set up this notation below.
For a matrix pair , define the state-perturbation transfer matrix:
In [ABH+19a] the authors note that, under the linear dynamical system with perturbations , the state produced by this policy evolves as specified below. Of particular importance in this context is the observation that ’s are linear in ’s.
Following [ABH+19a], we adopt the definition of the surrogate setting.
Define the surrogate state and the surrogate action as stated below. The surrogate cost as chosen to be the specialization of the -th cost function with the surrogate state-action pair as the argument.
4 The Algorithm
Our approach follows the explore-then-commit paradigm, identifying the underlying the deterministic-equivalent dynamics to within a specified accuracy using random inputs in the exploration phase. Such an approximate recovery of parameters permits an approximate recovery of the perturbations, thus facilitating the execution of the perturbation-based controller on the approximated perturbations.
Under the assumptions listed in Section 2, the regret incurred by Algorithm 1 admits the upper bound 111The unnatural scaling in occurs because in the analysis, we (sometimes) take . Also, the magnitude of is not tuned to produce the optimal dependence in parameters other than . stated below. In particular, this is the case when , , .
5 Regret Analysis
To present the proof concisely, we set up a few articles of use. For a generic algorithm operating on a generic linear dynamical system specified via a matrix pair and perturbations , let
be the cost of executing , as incurred on the last time steps,
be the state achieved at time step , and
be the control executed at time step .
We also note that following result from [ABH+19a] that applies to the case when the matrices that govern the underlying dynamics are made known to the algorithm. In such a case, an exact inference of is possible.
Theorem 12 ([ABH+19a]).
Proof of Theorem 11.
Define . Let be the contribution to the regret associated with the first rounds of exploration. By Lemma 19, we have that
Let , in the arguments below.
The middle term in the regret can be upper bounded by the regret of algorithm on the fictional system and the perturbation sequence . Before we can invoke Theorem 12, observe that
By the Preservation of Stability Lemma (Lemma 14), is strongly stable on , as long as ,
Lemma 17 ensures that the iterates produced by the execution of Algorithm satisfy , as long as .
With the above observations in place, Theorem 12 guarantees
The last expression in the preceding line can be bound as Stability of Value Function Lemma (Lemma 15) indicates, while constraining our choices as , and , and observing that
The last line follows by Theorem 18, and suggest the optimal (by this analysis) choice of . Apart from this proof, all other proofs and statements list exact upper bounds on the polynomial factors. Here they have been omitted for ease of presentation. ∎
The regret minimizing algorithm, Phase 2 of Algorithm 1, chooses so as to optimize for the cost of the perturbation-based controller in a fictional linear dynamical system described via the matrix pair and the perturbation sequence . The following lemma shows that the choice of ensures that state-control sequence visited by Algorithm 1 coincides with the sequence visted by the regret-minimizing algorithm in the fictional system.
Lemma 13 (Simulation Lemma).
This proof follows by induction on . Note that at the start of , it is fed the initial state by the choice of . Say that for some , it happens that the inductive hypothesis is true. Consequently
This, in turn, implies by choice of that
The claim follows. ∎
The lemma stated below guarantees that the strong stability of is approximately preserved under small deviations of the system matrices.
Lemma 14 (Preservation of Stability).
If a linear controller is strongly stable for a linear dynamical system , ie.
then is strongly stable for a linear dynamical system , ie.
as long as . Furthermore, when in agreement with the said conditions, the transforming matrices that certify strong stability in both these cases coincide, and the transformed matrices obey .
Let with , . Now
It suffices to note that . ∎
The next lemma establishes that if the same linear state-feedback policy is executed on the actual and the fictional linear dynamical system, the difference between the costs incurred in the two scenarios varies proportionally with some measure of distance between the two systems.
Lemma 15 (Stability of Value Function).
Let , . As long as it happens that and , it is true that
for any strongly stable controller with respect to .
Under the action of a linear controller , which is strongly stable for , it may be verified that
Now, if , note that . By the Preservation of Stability Lemma (Lemma 14), is strongly stable for . Consequently, .
For any matrix pair , such that , we have
We make the inductive claim that . The truth of this claim for is easily verifiable. Assuming the inductive claim for some , observe
Finally, observe that . ∎
While we have that is bounded by assumption, the next lemma bounds the possible value of . Please see the proof for the exact polynomial coefficients.
During the execution of Phase 2 of Algorithm 1, the iterates produced satisfy, as long as , that
Consider a generic linear dynamical system that evolves as , where the control is chosen as . Such a choice entails that for any
Note that Phase 2 of Algorithm 1 is a specific instance of this setting where is chosen as . We put forward the (strong) inductive hypothesis that for all , we have
Now, observe that
The base case be verified via computation. To simplify the expression, choose , to obtain that as long as ,
6 System Identification via Random Inputs
This section details the guarantees afforded by Algorithm 2. Define . The said algorithm attempts to identify the deterministic-equivalent dynamics, ie. the matrix pair , by first identifying matrices of the form , and then recovering by solving an associated linear system of equations.
Theorem 18 (System Recovery).
The evolution of the state sequence during the identification phase in terms of the control perturbations is stated below. Following this, we state an upper bound on the said sequence.
The states produced by Algorithm 2 satisfy
The strong stability of suffices to establish this claim.
In addition to sub-multiplicativity of the norm, we use that . ∎
6.1 Step 1: Moment Recovery
The following lemma promises an approximate recovery of ’s through an appeal to arguments involving measures of concentration.
Algorithm 2 satisfies for all , with probability or more
Let . With Equation 2, the fact that is zero-mean with isotropic unit covariance, and that it is chosen independently of implies