# On the Sample Complexity of the Linear Quadratic Regulator

This paper addresses the optimal control problem known as the Linear Quadratic Regulator in the case when the dynamics are unknown. We propose a multi-stage procedure, called Coarse-ID control, that estimates a model from a few experimental trials, estimates the error in that model with respect to the truth, and then designs a controller using both the model and uncertainty estimate. Our technique uses contemporary tools from random matrix theory to bound the error in the estimation procedure. We also employ a recently developed approach to control synthesis called System Level Synthesis that enables robust control design by solving a convex optimization problem. We provide end-to-end bounds on the relative error in control cost that are nearly optimal in the number of parameters and that highlight salient properties of the system to be controlled such as closed-loop sensitivity and optimal control magnitude. We show experimentally that the Coarse-ID approach enables efficient computation of a stabilizing controller in regimes where simple control schemes that do not take the model uncertainty into account fail to stabilize the true system.

• 20 publications
• 12 publications
• 32 publications
• 62 publications
• 28 publications
12/31/2019

### Optimistic robust linear quadratic dual control

Recent work by Mania et al. has proved that certainty equivalent control...
08/26/2021

### Finite-time System Identification and Adaptive Control in Autoregressive Exogenous Systems

Autoregressive exogenous (ARX) systems are the general class of input-ou...
10/24/2017

### Control Problems with Vanishing Lie Bracket Arising from Complete Odd Circulant Evolutionary Games

We study an optimal control problem arising from a generalization of roc...
05/17/2021

### Probabilistic robust linear quadratic regulators with Gaussian processes

Probabilistic models such as Gaussian processes (GPs) are powerful tools...
05/30/2022

### A practical optimal control approach for two-speed actuators

This paper addresses the closed-loop control of an actuator with both a ...
03/25/2018

### Finite-Data Performance Guarantees for the Output-Feedback Control of an Unknown System

As the systems we control become more complex, first-principle modeling ...
10/01/2021

### Design of multiplicative watermarking against covert attacks

This paper addresses the design of an active cyberattack detection archi...

## 1 Introduction

Having surpassed human performance in video games [38] and Go [47]

, there has been a renewed interest in applying machine learning techniques to planning and control. In particular, there has been a considerable amount of effort in developing new techniques for

continuous control where an autonomous system interacts with a physical environment [15, 31]. A tremendous opportunity lies in deploying these data-driven systems in more demanding interactive tasks including self-driving vehicles, distributed sensor networks, and agile robotics. As the role of machine learning expands to more ambitious tasks, however, it is critical these new technologies be safe and reliable. Failure of such systems could have severe social and economic consequences including the potential loss of human life. How can we guarantee that our new data-driven automated systems are robust?

Unfortunately, there are no clean baselines delineating the possible control performance achievable given a fixed amount of data collected from a system. Such baselines would enable comparisons of different techniques and would allow engineers to trade off between data collection and action in scenarios with high uncertainty. Typically, a key difficulty in establishing baselines is in proving lower bounds that state the minimum amount of knowledge needed to achieve a particular performance, regardless of method. However, in the context of controls, even upper bounds describing the worst-case performance of competing methods are exceptionally rare. Without such estimates, we are left to compare algorithms on a case-by-case basis, and we may have trouble diagnosing whether poor performance is due to algorithm choice or some other error such as a software bug or a mechanical flaw.

In this paper, we attempt to build a foundation for a theoretical understanding of how machine learning interfaces with control by analyzing one of the most well-studied problems in classical optimal control, the Linear Quadratic Regulator (LQR). Here we assume that the system to be controlled obeys linear dynamics, and we wish to minimize some quadratic function of the system state and control action. This problem has been studied for decades in control: it has a simple, closed form solution on the infinite time horizon and an efficient, dynamic programming solution on finite time horizons. When the dynamics are unknown, however, there are far fewer results about achievable performance.

Our contribution is to analyze the LQR problem when the dynamics of the system are unknown, and we can measure the system’s response to varied inputs. A naïve solution to this problem would be to collect some data of how the system behaves over time, fit a model to this data, and then solve the original LQR problem assuming this model is accurate. Unfortunately, while this procedure might perform well given sufficient data, it is difficult to determine how many experiments are necessary in practice. Furthermore, it is easy to construct examples where the procedure fails to find a stabilizing controller.

As an alternative, we propose a method that couples our uncertainty in estimation with the control design. Our main approach uses the following framework of Coarse-ID control to solve the problem of LQR with unknown dynamics:

1. Use supervised learning to learn a coarse model of the dynamical system to be controlled. We refer to the system estimate as the

nominal system.

2. Using either prior knowledge or statistical tools like the bootstrap, build probabilistic guarantees about the distance between the nominal system and the true, unknown dynamics.

3. Solve a robust optimization problem over controllers that optimizes performance of the nominal system while penalizing signals with respect to the estimated uncertainty, ensuring stable and robust execution.

We will show that for a sufficient number of observations of the system, this approach is guaranteed to return a control policy with small relative error. In particular, it guarantees asymptotic stability of the closed-loop system. In the case of LQR, step 1 of coarse-ID control simply requires solving a linear least squares problem, step 2 uses a standard bootstrap technique, and step 3 requires solving a small semidefinite program. Analyzing this approach, on the other hand, requires contemporary techniques in non-asymptotic statistics and a novel parameterization of control problems that renders nonconvex problems convex [35, 54].

We demonstrate the utility of our method on a simple simulation. In the presented example, we show that simply using the nominal system to design a control policy frequently results in unstable closed-loop behavior, even when there is an abundance of data from the true system. However, the Coarse-ID approach finds a stabilizing controller with very few system observations.

### 1.1 Problem Statement and Our Contributions

The standard optimal control problem aims to find a control sequence that minimizes an expected cost. We assume a dynamical system with state can be acted on by a control and obeys the stochastic dynamics

 xt+1=ft(xt,ut,wt) (1.1)

where is a random process with independent of for all . Optimal control then seeks to minimize

 minimizeE[∑Tt=1ct(xt,ut)]subject toxt+1=ft(xt,ut,wt). (1.2)

Here, denotes the state-control cost at every time step, and the input is allowed to depend on the current state and all previous states and actions. In this generality, problem (1.2

) encapsulates many of the problems considered in the reinforcement learning literature.

The simplest optimal control problem with continuous state is the Linear Quadratic Regulator (LQR), in which costs are a fixed quadratic function of state and control and the dynamics are linear and time-invariant:

 minimizeE[∑Tt=1x∗tQxt+u∗t−1Rut−1]subject toxt+1=Axt+But+wt. (1.3)

Here (resp. ) is an (resp. ) positive definite matrix, and are called the state transition matrices, and is Gaussian noise with zero-mean and covariance . Throughout, denotes the Hermitian transpose of the matrix .

In what follows, we will be concerned with the infinite time horizon variant of the LQR problem where we let the time horizon go to infinity and minimize the average cost. When the dynamics are known, this problem has a celebrated closed form solution based on the solution of matrix Ricatti equations [59]. Indeed, the optimal solution sets for a fixed matrix , and the corresponding optimal cost will serve as our gold-standard baseline to which we will compare the achieved cost of all algorithms.

In the case when the state transition matrices are unknown, fewer results have been established about what cost is achievable. We will assume that we can conduct experiments of the following form: given some initial state , we can evolve the dynamics for time steps using any control sequence , measuring the resulting output . If we run such independent experiments, what infinite time horizon control cost is achievable using only the data collected? For simplicity of bookkeeping, in our analysis we further assume that we can prepare the system in initial state .

In what follows we will examine the performance of the Coarse-ID control framework in this scenario. We will estimate the errors accrued by least squares estimates of the system dynamics. This estimation error is not easily handled by standard techniques because the design matrix is highly correlated with the model to be estimated. Regardless, for theoretical tractability, we can build a least squares estimate using only the final sample of each of the experiments. Indeed, in Section 2 we prove the following

###### Proposition 1.1.

Define the matrices

 (1.4)

Assume we collect data from the linear, time-invariant system initialized at , using inputs for . Suppose that the process noise is and that

 N≥8(n+p)+16log(4/δ).

Then, with probability at least

, the least squares estimator using only the final sample of each trajectory satisfies both the inequality

 ∥ˆA−A∥2≤16σw√λmin(σ2uGTG∗T+σ2wFTF∗T)√(n+2p)log(36/δ)N, (1.5)

and the inequality

 ∥ˆB−B∥2≤16σwσu√(n+2p)log(36/δ)N. (1.6)

The details of the estimation procedure are described in Section 2 below. Note that this estimation result seems to yield an optimal dependence in terms of the number of parameters: together have parameters to learn and each measurement consists of values. Moreover, this proposition further illustrates that not all linear systems are equally easy to estimate. The matrices and are finite time controllability Gramians

for the control and noise inputs, respectively . These are standard objects in control: each eigenvalue/vector pair of such a Gramian characterizes how much input energy is required to move the system in that particular direction of the state-space. Therefore

quantifies the least controllable, and hence most difficult to excite and estimate, mode of the system. This property is captured nicely in our bound, which indicates that for systems for which all modes are easily excitable (i.e., all modes of the system amplify the applied inputs and disturbances), the identification task becomes easier.

Note that we cannot compute the operator norm error bounds (1.5) and (1.6) without knowing the true system matrices . However, as we show in Section 2.3, a simple bootstrap procedure can efficiently upper bound the errors and from simulation.

With our estimates and error bounds in hand, we can turn to the problem of synthesizing a controller. We can assert with high probability that , and , for and , where the size of the error terms is determined by the number of samples collected. In light of this, it is natural to pose the following robust variant of the standard LQR optimal control problem (1.3), which computes a robustly stabilizing controller that seeks to minimize the worst-case performance of the system given the (high-probability) norm bounds on the perturbations and :

 minimizesup∥ΔA∥2≤ϵA∥ΔB∥2≤ϵBlimT→∞1T∑Tt=1E[x∗tQxt+u∗t−1Rut−1]subject toxt+1=(^A+ΔA)xt+(^B+ΔB)ut+wt. (1.7)

Although classic methods exist for computing such controllers [20, 42, 48, 55], they typically require solving nonconvex optimization problems, and it is not readily obvious how to extract interpretable measures of controller performance as a function of the perturbation sizes and . To that end, we leverage the recently developed System Level Synthesis (SLS) framework [54] to create an alternative robust synthesis procedure. Described in detail in Section 3, SLS lifts the system description into a higher dimensional space that enables efficient search for controllers. At the cost of some conservatism, we are able to guarantee robust stability of the resulting closed-loop system for all admissible perturbations and bound the performance gap between the resulting controller and the optimal LQR controller. This is summarized in the following

###### Proposition 1.2.

Let be estimated via the independent data collection scheme used in Proposition 1.1 and synthesized using robust SLS. Let denote the infinite time horizon LQR cost accrued by using the controller and denote the optimal LQR cost achieved when are known. Then the relative error in the LQR cost is bounded as

 ˆJ−J⋆J⋆≤O⎛⎝CLQR√(n+p)log(1/δ)N⎞⎠ (1.8)

with probability provided is sufficiently large.

The complexity term depends on the rollout length , the true dynamics, the matrices

which define the LQR cost, and the variances

and of the control and noise inputs, respectively. The probability comes from the probability of estimation error from Proposition 1.1. The particular form of and concrete bounds on are both provided in Section 4.

Though the optimization problem formulated by SLS is infinite dimensional, in Section 5 we provide two finite dimensional upper bounds on the optimization that inherit the stability guarantees of the SLS formulation. Moreover, we show via numerical experiments in Section 6 that the controllers synthesized by our optimization do indeed provide stabilizing controllers with small relative error. We further show that settings exist wherein a naïve synthesis procedure that ignores the uncertainty in the state-space parameter estimates produces a controller that performs poorly (or has unstable closed-loop behavior) relative to the controller synthesized using the SLS procedure.

### 1.2 Related Work

We first describe related work in the estimation of unknown dynamical systems and then turn to connections in the literature on robust control with uncertain models. We will end this review with a discussion of a few works from the reinforcement learning community that have attempted to address the LQR problem and related variants.

#### Estimation of unknown dynamical systems.

Estimation of unknown systems, especially linear dynamical systems, has a long history in the system identification subfield of control theory. While the text of Ljung [33] covers the classical asymptotic results, our interest is primarily in nonasymptotic results. Early results [10, 53] on nonasymptotic rates for parameter identification featured conservative bounds which are exponential in the system degree and other relevant quantities. More recently, Bento et al. [5] show that when the matrix is stable and induced by a sparse graph, then one can recover the support of from a single trajectory using -penalized least squares. Furthermore, Hardt et al. [25] provide the first polynomial time guarantee for identifying general, but stable linear systems. Their guarantees, however, are in terms of predictive output performance of the model, and require an assumption on the true system that is more stringent than stability. It is not clear how their statistical risk guarantee can be used in a downstream robust synthesis procedure.

Next, we turn our attention to system identification of linear systems in the frequency domain. A comprehensive text on these methods (which differ from the aforementioned state-space methods) is the work by Chen and Gu [11]. For stable systems, Helmicki et al. [26] propose to identify a finite impulse response (FIR) approximation by directly estimating the first impulse response coefficients. This method is analyzed in a non-adversarial probabilistic setting by [22, 49], who prove that a polynomial number of samples are sufficient to recover a FIR filter which approximates the true system in both -norm and -norm. However, transfer function methods do not easily allow for optimal control with state variables, since they only model the input/output behavior of the system.

In parallel to the system identification community, identification of auto-regressive time series models is a widely studied topic in the statistics literature (see e.g. Box et al. [7] for the classical results). Goldenshluger and Zeevi [23]

show that the coefficients of a stationary autoregressive model can be estimated from a single trajectory of length polynomial in

via least squares, where denotes the stability radius of the process. They also prove that their rate is minimax optimal. More recently, several authors [30, 36, 39] have studied generalization bounds for non i.i.d. data, extending the standard learning theory guarantees for independent data. At the crux of these arguments lie various mixing assumptions [58], which limits the analysis to only hold for stable dynamical systems. The general trend in this line of research is that systems with smaller mixing time (i.e. systems that are more stable) are easier to identify (i.e. take less samples). Our result in Proposition 1.1, however, suggests instead that identification benefits from more easily excitable systems. While this analysis holds for a different setting in which we have access to full state observations, empirical testing suggests that Proposition 1.1 reflects reality more accurately than arguments based on mixing. We leave reconciling this to future work.

#### Robust controller design.

For end-to-end guarantees, parameter estimation is only half the picture. Our procedure provides us with a family of system models described by a nominal estimate and a set of unknown but bounded model errors. It is therefore necessary to ensure that the computed controller has stability and performance guarantees for any such admissible realization. The problem of robustly stabilizing such a family of systems is one with a rich history in the controls community. When modelling errors to the nominal system are taken as arbitrary norm-bounded linear time-invariant (LTI) operators in feedback with the nominal plant, traditional small-gain theorems and robust synthesis techniques can be applied to exactly solve the problem [13, 59]

. However, when the errors are known to have more structure, e.g., as in our case where perturbations to the nominal system are static and real additive errors to the state-space parameters, there are more sophisticated techniques based on structured singular values and corresponding

-synthesis techniques [14, 18, 41, 57] or integral quadratic constraints (IQCs) [37]

. While theoretically appealing and much less conservative than traditional small-gain approaches, the resulting synthesis methods are both computationally intractable (although effective heuristics do exist) and difficult to interpret analytically. In particular, we know of no results in the literature that bound the degradation in performance of an uncertain system in terms of the size of the possible perturbations affecting it.

To circumvent this issue, we leverage a novel parameterization of robustly stabilizing controllers based on the SLS framework for controller synthesis [54]. We describe this framework in more detail in Section 3. Originally developed to allow for optimal and robust controller synthesis techniques to be applicable to large-scale systems, the SLS framework can be viewed as a generalization of the celebrated Youla parameterization [56], and as we show, allows for model uncertainty to be included in a transparent as well as analytically tractable way.

#### PAC learning and reinforcement learning.

Concerning end-to-end guarantees for LQR which couple estimation and control synthesis, our work is most comparable to that of Fiechter [21], who shows that the discounted

LQR problem is PAC-learnable. Fietcher analyzes an identify-then-control scheme similar to the one we propose, but there are several key differences. First, our probabilistic bounds on identification are much sharper, by leveraging modern tools from high-dimensional statistics. Second, Fiechter implicitly assumes that the true closed-loop system with the estimated controller is not only stable but also contractive. While this very strong assumption is nearly impossible to verify in practice, contractive closed-loop assumptions are actually pervasive throughout the literature, as we describe below. To the best of our knowledge, our work is the first to properly lift this technical restriction. Third, and most importantly, Fietcher proposes to directly solve the discounted LQR problem with the identified model, and does not take into account any uncertainty in the controller synthesis step. This is problematic for two reasons. First, it is easy to construct an instance of a discounted LQR problem where the

optimal solution does not stabilize the true system (see e.g. [43]). Therefore, even in the limit of infinite data, there is no guarantee that the closed-loop system will be stable. Second, even if the optimal solution does stabilize the underlying system, failing to take uncertainty into account can lead to situations where the synthesized controller does not. We will demonstrate this behavior in our experiments.

We are also particularly interested in the LQR problem as a baseline for more complicated problems in reinforcement learning (RL). LQR should be a relatively easy problem in RL because on can learn the dynamics from anywhere in the state space, vastly simplifying the problem of exploration. Hence, it is important to establish how well a pure exploration followed by exploitation strategy can fare on this simple baseline.

There are indeed some related efforts in RL and online learning. Abbasi-Yadkori and Szepesvari [2] propose to use the optimism in the face of uncertainty (OFU) principle for the LQR problem, by maintaining confidence ellipsoids on the true parameter, and using the controller which, in feedback, minimizes the cost objective the most among all systems in the confidence ellipsoid. Ignoring the computational intractability of this approach, their analysis reveals an exponential dependence in the system order in their regret bound, and also makes the very strong assumption that the optimal closed-loop systems are contractive for every in the confidence ellipsoid. The regret bound is improved by Ibrahimi et al. [27] to depend linearly on the state dimension under additional sparsity constraints on the dynamics.

In response to the computational intractability of the OFU principle, researchers in RL and online learning have proposed the use of Thompson sampling

[45] for exploration. Abeille and Lazaric [3] show that the regret of a Thompson sampling approach for LQR scales as , and this result was improved by Ouyang et al. [40] to , matching that of [2], where hides poly-logarithmic factors. We note, however, that both works make the same restrictive assumption that the optimal closed-loop systems are uniformly contractive over some known set.

Jiang et al. [28] propose a general exploration algorithm for contextual decision processes (CDPs) and show that CDPs with low Bellman rank are PAC-learnable; in the LQR setting, they show the Bellman rank is bounded by . While this result is appealing from an information-theoretic standpoint, the proposed algorithm is computationally intractable for continuous problems.

Finally, we note that the self-normalized martingale techniques underpinning the regret analysis of Abbasi-Yadkori and Szepesvari may be of use in extending Proposition 1.1 to the dependent-data setting [1, 52].

## 2 System Identification through Least-Squares

To estimate a coarse model of the unknown system dynamics, we turn to the simple and classical method of linear least squares. By running experiments in which the system starts at and the dynamics evolve with a given input, we can record the resulting state observations. The set of inputs and outputs from each such experiment will be called a rollout. For system estimation, we excite the system with Gaussian noise for rollouts, each of length . The resulting dataset is , where indexes the time in one rollout and indexes independent rollouts. Therefore, we can estimate the system dynamics by

 (ˆA,ˆB)∈argmin(A,B)N∑ℓ=1T−1∑t=012∥Ax(ℓ)t+Bu(ℓ)t−x(ℓ)t+1∥22. (2.1)

For the Coarse-ID control setting, a good estimate of error is just as important as the estimate of the dynamics. Statistical theory and tools allow us to quantify the error of the least squares estimator. First, we present a theoretical analysis of the error in a simplified setting. Then, we describe a computational bootstrap procedure for error estimation from data alone.

### 2.1 Least Squares Estimation as a Random Matrix Problem

We begin by explicitly writing the form of the least squares estimator. First, fixing notation to simplify the presentation, let and let . Then the system dynamics can be rewritten, for all ,

 x∗t+1=z∗tΘ+w∗t.

Then in a single rollout, we will collect

 X:=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣x∗1x∗2⋮x∗T⎤⎥ ⎥ ⎥ ⎥ ⎥⎦,Z:=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣z∗0z∗1⋮z∗T−1⎤⎥ ⎥ ⎥ ⎥ ⎥⎦,W:=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣w∗0w∗1⋮w∗T−1⎤⎥ ⎥ ⎥ ⎥ ⎥⎦. (2.2)

The system dynamics give the identity . Resetting state of the system to each time, we can perform rollouts and collect datasets like (2.2). Having the ability to reset the system to a state independent of past observations will be important for the analysis in the following section, and it is also practically important for potentially unstable systems. Denote the data for each rollout as . With slight abuse of notation, let be composed of vertically stacked , and similarly for and . Then we have

 XN=ZNΘ+WN.

The full data least squares estimator for is (assuming for now invertibility of ),

 ˆΘ=(Z∗NZN)−1Z∗NXN=Θ+(Z∗NZN)−1Z∗NWN. (2.3)

Then the estimation error is given by

 E:=ˆΘ−Θ=(Z∗NZN)−1Z∗NWN. (2.4)

The magnitude of this error is the quantity of interest in determining confidence sets around estimates . However, since and are not independent, this estimator is difficult to analyze using standard methods. While this type of analysis is an open problem of interest, in this paper we turn instead to a simplified estimator.

### 2.2 Theoretical Bounds on Least Squares Error

In this section, we work out the statistical rate for the least squares estimator which uses just the last sample of each trajectory . This estimation procedure is made precise in Algorithm 1

. Our analysis ideas are analogous to those used to prove statistical rates for standard linear regression, and they leverage recent tools in nonasymptotic analysis of random matrices. The result is presented above in Proposition

1.1.

In the context of Proposition 1.1, a single data point from each -step rollout is used. We emphasize that this strategy results in independent data, which can be seen by defining the estimator matrix directly. The previous estimator (2.3) is amended as follows: the matrices defined in (2.2) instead include only the final timestep of each trial, , and similar modifications are made to and . The estimator (2.3) uses these modified matrices, which now contain independent rows. To see this, recall the definition of and from (1.4),

 GT=[AT−1BAT−2B...B],FT=[AT−1AT−2...I].

We can unroll the system dynamics and see that

 xT=GT⎡⎢ ⎢ ⎢ ⎢⎣u0u1⋮uT−1⎤⎥ ⎥ ⎥ ⎥⎦+FT⎡⎢ ⎢ ⎢ ⎢⎣w0w1⋮wT−1⎤⎥ ⎥ ⎥ ⎥⎦. (2.5)

Using Gaussian excitation, gives

 [xTuT]∼N(0,[σ2uGTG∗T+σ2wFTF∗T00σ2uI]). (2.6)

Since , as long as both are positive, this is a non-degenerate distribution.

Therefore, bounding the estimation error can be achieved via proving a result on the error in random design linear regression with vector valued observations. First, we present a lemma which bounds the spectral norm of the product of two independent Gaussian matrices.

###### Lemma 2.1.

Fix a and . Let , be independent random vectors and for . With probability at least ,

 ∥∥ ∥∥N∑k=1fkg∗k∥∥ ∥∥2≤4∥Σf∥1/22∥Σg∥1/22√N(m+n)log(9/δ).

We believe this bound to be standard, and include a proof in the appendix for completeness. Lemma 2.1 shows that if is with i.i.d. entries and is with i.i.d. entries, and and are independent, then with probability at least we have

 ∥XY∥2≤4√N(n1+n2)log(9/δ).

Next, we state a standard nonasymptotic bound on the minimum singular value of a standard Wishart matrix (see e.g. Corollary 5.35 of [51]).

###### Lemma 2.2.

Let have i.i.d. entries. With probability at least ,

 √λmin(X∗X)≥√N−√n−√2log(1/δ).

We combine the previous lemmas into a statement on the error of random design regression.

###### Lemma 2.3.

Let be i.i.d. from with invertible. Let . Let with each entry i.i.d. and independent of . Let , and suppose that

 N≥8n+16log(2/δ). (2.7)

For any fixed matrix , we have with probability at least ,

 ∥QE∥2≤16σw∥QΣ−1/2∥2√(n+p)log(18/δ)N.
###### Proof.

First, observe that is equal in distribution to , where has i.i.d. entries. By Lemma 2.2, with probability at least ,

 √λmin(X∗X)≥√N−√n−√2log(2/δ)≥√N/2.

The last inequality uses (2.7) combined with the inequality . Furthermore, by Lemma 2.1 and (2.7), with probability at least ,

 ∥X∗W∥2≤4σw√N(n+p)log(18/δ).

Let denote the event which is the intersection of the two previous events. By a union bound, . We continue the rest of the proof assuming the event holds. Since is invertible,

 QE=Q(Z∗Z)†Z∗W=Q(Σ1/2X∗XΣ1/2)†Σ1/2X∗W=QΣ−1/2(X∗X)−1X∗W.

Taking operator norms on both sides,

 ∥QE∥2≤∥QΣ−1/2∥2∥(X∗X)−1∥2∥X∗W∥2=∥QΣ−1/2∥2∥X∗W∥2λmin(X∗X).

Combining the inequalities above,

 ∥X∗W∥2λmin(X∗X)≤16σw√(n+p)log(18/δ)N.

The result now follows. ∎

Using this result on random design linear regression, we are now ready to analyze the estimation errors of the identification in Algorithm 1 and provide a proof of Proposition 1.1.

###### Proof.

Consider the least squares estimation error (2.4) with modified single-sample-per-rollout matrices. Recall that rows of the design matrix are distributed as independent normals, as in (2.6). Then applying Lemma 2.3 with so that extracts only the estimate for , we conclude that with probability at least ,

 ∥ˆA−A∥2≤16σw√λmin(σ2uGTG∗T+σ2wFTF∗T)√(n+2p)log(36/δ)N, (2.8)

as long as . Now applying Lemma 2.3 under the same condition on with , we have with probability at least ,

 ∥ˆB−B∥2≤16σwσu√(n+2p)log(36/δ)N. (2.9)

The result follows by application of the union bound. ∎

There are several interesting points to make about the guarantees offered by Proposition 1.1. First, as mentioned in the introduction, there are parameters to learn and our bound states that we need measurements, each measurement providing values. Hence, this appears to be an optimal dependence with respect to the parameters and . Second, note that intuitively, if the system amplifies the control and noise inputs in all directions of the state-space, as captured by the minimum eigenvalues of the control and disturbance Gramians or , respectively, then the system has a larger “signal-to-noise” ratio and the system matrix is easier to estimate. On the other hand, this measure of the excitability of the system has no impact on learning . Finally, unlike in Fiechter’s work [21], we do not need to assume that is invertible. As long as the process noise is not degenerate, it will excite all modes of the system.

### 2.3 Estimating Model Uncertainty with the Bootstrap

In the previous sections we offered a theoretical guarantee on the statistical rate for the least squares estimation of and from independent samples. However, there are two important limitations to using such guarantees in practice to offer upper bounds on and . First, using only one sample per system rollout is empirically less efficient than using all available data for estimation. Second, even optimal statistical analyses often do not recover constant factors that match practice. For purposes of robust control, it is important to obtain upper bounds on and that are not too conservative. Thus, we aim to find and such that and with high probability.

We propose a vanilla bootstrap method for estimating and . Bootstrap methods have had a profound impact in both theoretical and applied statistics since their introduction [17]

. These methods are used to estimate statistical quantities (e.g. confidence intervals) by sampling synthetic data from an empirical distribution determined by the available data. For the problem at hand we propose the procedure described in Algorithm

2.111We assume that and are known. Otherwise they can be estimated from data.

For and estimated by Algorithm 2 we intuitively have

 P(∥A−ˆA∥2≤ˆϵA)≈1−δandP(∥B−ˆB∥2≤ˆϵB)≈1−δ.

There are many known asymptotic and finite sample theoretical guarantees for the bootstrap, particularly for the parametric version we use. We do not discuss these results here; for more details see texts by Van Der Vaart and Wellner [50], Shao and Tu [46], and Hall [24]. Instead, in Appendix D we show empirically the performance of the bootstrap for our estimation problem.

## 3 Robust Synthesis

With estimates of the system and operator norm error bounds in hand, we now turn to control design. In this section we introduce some useful tools from System Level Synthesis (SLS), a recently developed approach to control design that relies on a particular parameterization of signals in a control system [35, 54]. In this section, we review the main SLS framework, highlighting the key constructions that we will use to solve the robust LQR problem. As we show in this and the following section, using the SLS framework, as opposed to traditional techniques from robust control, allows us to (a) compute robust controllers using semidefinite programming, and (b) provide sub-optimality guarantees in terms of the size of the uncertainties on our system estimates.

### 3.1 Useful Results from System Level Synthesis

Though not mainstream in contemporary machine learning, it is most convenient to work in the frequency domain in establishing SLS results. To that end, we use bold face letters to denote transfer functions and signals in the frequency domain. We note that for those not familiar with control theory, the basic components used herein are fairly minimal. The use of transfer functions is simply a convenient bookkeeping tool that gives us compact ways of writing out the semi-infinite Toeplitz matrices which encode the evolution of a dynamical system over time. Indeed, the relevant connections for LQR are illustrated in Appendix B.

We start our discussion by introducing notation that is common in the controls literature. For a thorough introduction to the functional analysis commonly used in control theory, see Chapters 2 and 3 of [59]. Let (resp. ) denote the unit circle (resp. open unit disk) in the complex plane. The restriction of the Hardy spaces and to matrix-valued real-rational functions that are analytic on the complement of will be referred to as and , respectively. In controls parlance, this corresponds to (discrete-time) stable matrix-valued transfer functions. For these two function spaces, the and norms simplify to

 ∥G∥H∞=supz∈T∥G(z)∥2,∥G∥H2=√12π∫T∥G(z)∥2Fdz. (3.1)

Finally, the notation refers to the set of transfer functions such that . Equivalently, if and is strictly proper.

The most important transfer function for the LQR problem is the map from the state sequence to the control actions (this is the control policy) and the map from the noise disturbance to the states and controls. Let denote the transfer function from state to control action. Then we have the equation . Here we assume that can be an arbitrary transfer function, not necessarily just a static feedback law. Then the closed-loop transfer matrices from the process noise to the state and control action can be checked to satisfy

 [xu]=[(zI−A−BK)−1K(zI−A−BK)−1]w. (3.2)

We then have the following theorem parameterizing the set of stable closed-loop transfer matrices, as described in equation (3.2), that are achievable by a given stabilizing controller .

###### Theorem 3.1 (State-Feedback Parameterization [54]).

The following are true:

• The affine subspace defined by

 [zI−A−B][ΦxΦu]=I, Φx,Φu∈1zRH∞ (3.3)

parameterizes all system responses (3.2) from to , achievable by an internally stabilizing state-feedback controller .

• For any transfer matrices satisfying (3.3), the controller is internally stabilizing and achieves the desired system response (3.2).

Note that in particular, as in (3.2) are elements of the affine space defined by (3.3) whenever is a causal stabilizing controller.

It follows from Theorem 3.1 and the standard equivalence between infinite horizon LQR and optimal control that, for a disturbance process distributed as , the standard LQR problem (1.3) can be equivalently written as

 minΦx,Φuσ2w∥∥ ∥∥⎡⎣Q1200R12⎤⎦[ΦxΦu]∥∥ ∥∥2H2 s.t. % equation (???). (3.4)

We provide a full derivation of this equivalence in Appendix B. Going forward, we drop the multiplier in the objective function as it affects neither the optimal controller nor the sub-optimality guarantees that we compute in Section 4.

We will also make extensive use of a robust variant of Theorem 3.1.

###### Theorem 3.2 (Robust Stability [35]).

Suppose that the transfer matrices satisfy

 [zI−A−B][ΦxΦu]=I+Δ. (3.5)

Then the controller stabilizes the system described by if and only if . Furthermore, the resulting system response is given by

 [xu]=[ΦxΦu](I+Δ)−1w. (3.6)
###### Corollary 3.3.

Under the assumptions of Theorem 3.2, if for any induced norm , then the controller stabilizes the system described by .

###### Proof.

Follows immediately from the small gain theorem, see for example Section 9.2 in [59]. ∎

### 3.2 Robust LQR Synthesis

We return to the problem setting where estimates of a true system satisfy

 ∥ΔA∥2≤ϵA,  ∥ΔB∥2≤ϵB

where and .

We begin with a simple sufficient condition under which any controller that stabilizes also stabilizes the true system . To state the lemma, we introduce one additional piece of notation. For a matrix , we let denote the resolvent

 RM:=(zI−M)−1. (3.7)

We now can state our robustness lemma.

###### Lemma 3.4.

Let the controller stabilize and be its corresponding system response (3.2) on system . Then if stabilizes , it achieves the following LQR cost

 (3.8)

Furthermore, letting

 (3.9)

a sufficient condition for to stabilize is that .

###### Proof.

Follows immediately from Theorems 3.1, 3.2 and Corollary 3.3 by noting that for system responses satisfying

 [zI−ˆA−ˆB][ΦxΦu]=I,

it holds that

 [zI−A−B][ΦxΦu]=I+^Δ

for as defined in equation (3.9). ∎

We can therefore recast the robust LQR problem (1.7) in the following equivalent form

 (3.10)

The resulting robust control problem is one subject to real-parametric uncertainty, a class of problems known to be computationally intractable [8]. Although effective computational heuristics (e.g., DK iteration [59]) exist, the performance of the resulting controller on the true system is difficult to characterize analytically in terms of the size of the perturbations.

To circumvent this issue, we take a slightly conservative approach and find an upper-bound to the cost that is independent of the uncertainties and . First, note that if , we can write

 J(A,B,K)≤∥(I+^Δ)−1∥H∞J(ˆA,ˆB,K)≤11−∥^Δ∥H∞J(ˆA,ˆB,K). (3.11)

Because captures the performance of the controller on the nominal system , it is not subject to any uncertainty. It therefore remains to compute a tractable bound for , which we do using the following fact.

###### Proposition 3.5.

For any and as defined in (3.9)

 ∥^Δ∥H∞≤∥∥ ∥∥⎡⎢⎣ϵA√αΦxϵB√1−αΦu⎤⎥⎦∥∥ ∥∥H∞=:Hα(Φx,Φu). (3.12)
###### Proof.

Note that for any block matrix of the form , we have

 ∥[M1M2]∥2≤(∥M1∥22+∥M2∥22)1/2. (3.13)

To verify this assertion, note that

 ∥[M1M2]∥22=λmax(M1M∗1+M2M∗2)≤λmax(M1M∗1)+λmax(M2M∗2)=∥M1∥22+∥M2∥22.

With (3.13) in hand, we have

 =∥∥ ∥∥[√αϵAΔA√1−αϵBΔB]⎡⎢⎣ϵA√αΦxϵB√1−αΦu⎤⎥⎦∥∥ ∥∥H∞ ≤∥∥[√αϵAΔA√1−αϵBΔB]∥∥2∥∥ ∥∥⎡⎢⎣ϵA√αΦxϵB√1−αΦu⎤⎥⎦∥∥ ∥∥H∞≤∥∥ ∥∥⎡⎢⎣ϵA√αΦxϵB√1−αΦu⎤⎥⎦∥∥ ∥∥H∞,

completing the proof. ∎

The following corollary is then immediate.

###### Corollary 3.6.

Let the controller and resulting system response be as defined in Lemma 3.4. Then if , the controller stabilizes the true system .

Applying Proposition 3.5 in conjunction with the bound (3.11), we arrive at the following upper bound to the cost function of the robust LQR problem (1.7), which is independent of the perturbations :

 sup∥ΔA∥2≤ϵA∥ΔB∥2≤ϵBJ(A,B,K) ≤∥∥ ∥∥⎡⎣Q1200R12⎤⎦[ΦxΦu]∥∥ ∥∥H211−Hα(Φx,Φu)=J(ˆA,ˆB,K)1−Hα(Φx,Φu). (3.14)

The upper bound is only valid when