# Regret Bounds for Robust Adaptive Control of the Linear Quadratic Regulator

We consider adaptive control of the Linear Quadratic Regulator (LQR), where an unknown linear system is controlled subject to quadratic costs. Leveraging recent developments in the estimation of linear systems and in robust controller synthesis, we present the first provably polynomial time algorithm that provides high probability guarantees of sub-linear regret on this problem. We further study the interplay between regret minimization and parameter estimation by proving a lower bound on the expected regret in terms of the exploration schedule used by any algorithm. Finally, we conduct a numerical study comparing our robust adaptive algorithm to other methods from the adaptive LQR literature, and demonstrate the flexibility of our proposed method by extending it to a demand forecasting problem subject to state constraints.

• 20 publications
• 12 publications
• 32 publications
• 62 publications
• 28 publications
10/31/2021

### Safe Adaptive Learning-based Control for Constrained Linear Quadratic Regulators with Regret Guarantees

We study the adaptive control of an unknown linear system with a quadrat...
02/17/2019

### Learning Linear-Quadratic Regulators Efficiently with only √(T) Regret

We present the first computationally-efficient algorithm with O(√(T)) r...
06/17/2022

### Thompson Sampling Achieves Õ(√(T)) Regret in Linear Quadratic Control

Thompson Sampling (TS) is an efficient method for decision-making under ...
09/26/2018

### Safely Learning to Control the Constrained Linear Quadratic Regulator

We study the constrained linear quadratic regulator with unknown dynamic...
02/24/2020

### Robust Learning-Based Control via Bootstrapped Multiplicative Noise

Despite decades of research and recent progress in adaptive control and ...
12/31/2019

### Optimistic robust linear quadratic dual control

Recent work by Mania et al. has proved that certainty equivalent control...
06/19/2020

### Learning Controllers for Unstable Linear Quadratic Regulators from a Single Trajectory

We present the first approach for learning – from a single trajectory – ...

## 1 Introduction

The problem of adaptively controlling an unknown dynamical system has a rich history, with classical asymptotic results of convergence and stability dating back decades [15, 16]. Of late, there has been a renewed interest in the study of a particular instance of such problems, namely the adaptive Linear Quadratic Regulator (LQR), with an emphasis on non-asymptotic guarantees of stability and performance. Initiated by Abbasi-Yadkori and Szepesvári [2], there have since been several works analyzing the regret suffered by various adaptive algorithms on LQR– here the regret incurred by an algorithm is thought of as a measure of deviations in performance from optimality over time. These results can be broadly divided into two categories: those providing high-probability guarantees for a single execution of the algorithm [2, 5, 11, 14], and those providing bounds on the expected Bayesian regret incurred over a family of possible systems [3, 21]. As we discuss in more detail, these methods all suffer from one or several of the following limitations: restrictive and unverifiable assumptions, limited applicability, and computationally intractable subroutines. In this paper, we provide, to the best of our knowledge, the first polynomial-time algorithm for the adaptive LQR problem that provides high probability guarantees of sub-linear regret, and that does not require unverifiable or unrealistic assumptions.

##### Related Work.

There is a rich body of work on the estimation of linear systems as well as on the robust and adaptive control of unknown systems. We target our discussion to works on non-asymptotic guarantees for the LQR control of an unknown system, broadly divided into three categories.

Offline estimation and control synthesis: In a non-adaptive setting, i.e., when system identification can be done offline prior to controller synthesis and implementation, the first work to provide end-to-end guarantees for the LQR optimal control problem is that of Fiechter [13], who shows that the discounted LQR problem is PAC-learnable. Dean et al Dean et al. [8] improve on this result, and provide the first end-to-end sample complexity guarantees for the infinite horizon average cost LQR problem.

Optimism in the Face of Uncertainty (OFU): Abbasi-Yadkori and Szepesvári [2], Faradonbeh et al. [11], and Ibrahimi et al. [14] employ the Optimism in the Face of Uncertainty (OFU) principle [7], which optimistically selects model parameters from a confidence set by choosing those that lead to the best closed-loop (infinite horizon) control performance, and then plays the corresponding optimal controller, repeating this process online as the confidence set shrinks. While OFU in the LQR setting has been shown to achieve optimal regret , its implementation requires solving a non-convex optimization problem to precision , for which no provably efficient implementation exists.

Thompson Sampling (TS): To circumvent the computational roadblock of OFU, recent works replace the intractable OFU subroutine with a random draw from the model uncertainty set, resulting in Thompson Sampling (TS) based policies [3, 5, 21]. Abeille and Lazaric [5] show that such a method achieves regret with high-probability for scalar systems. However, their proof does not extend to the non-scalar setting. Abbasi-Yadkori and Szepesvári [3] and Ouyang et al. [21] consider expected regret in a Bayesian setting, and provide TS methods which achieve regret. Although not directly comparable to our result, we remark on the computational challenges of these algorithms. Whereas the proof of Abbasi-Yadkori and Szepesvári [3] was shown to be incorrect [20], Ouyang et al. [21] make the restrictive assumption that there exists a (known) initial compact set describing the uncertainty in the system parameters, such that for any system , the optimal controller is stabilizing when applied to any other system . No means of constructing such a set are provided, and there is no known tractable algorithm to verify if a given set satisfies this property. Also, it is implicitly assumed that projecting onto this set can be done efficiently.

##### Contributions.

To develop the first polynomial-time algorithm that provides high probability guarantees of sub-linear regret, we leverage recent results from the estimation of linear systems [23], robust controller synthesis [18, 25], and coarse-ID control [8]. We show that our robust adaptive control algorithm: (i) guarantees stability and near-optimal performance at all times; (ii) achieves a regret up to time bounded by ; and (iii) is based on finite-dimensional semidefinite programs of size logarithmic in .

Furthermore, our method estimates the system parameters at rate in operator norm. Although system parameter identification is not necessary for optimal control performance, an accurate system model is often desirable in practice. Motivated by this, we study the interplay between regret minimization and parameter estimation, and identify fundamental limits connecting the two. We show that the expected regret of our algorithm is lower bounded by , proving that our analysis is sharp up to logarithmic factors. Moreover, our lower bound suggests that the estimation rate achievable by any algorithm with regret is .

Finally, we conduct a numerical study of the adaptive LQR problem, in which we implement our algorithm, and compare its performance to heuristic implementations of OFU and TS based methods. We show on several examples that the regret incurred by our algorithm is comparable to that of the OFU and TS based methods. Furthermore, the infinite horizon cost achieved by our algorithm at any given time on the true system is consistently lower than that attained by OFU and TS based algorithms. Finally, we use a demand forecasting example to show how our algorithm naturally generalizes to incorporate environmental uncertainty and safety constraints.

## 2 Problem Statement and Preliminaries

In this work we consider adaptive control of the following discrete-time linear system

 (2.1)

where is the state, is the control input, and is the process noise. We assume that the state variables are observed exactly and, for simplicity, that . We consider the Linear Quadratic Regulator optimal control problem, given by cost matrices and ,

 J⋆=minulimT→∞1TE[T∑k=1x⊤kQxk+u⊤kRuk] s.t. % dynamics (???),

where the minimum is taken over measurable functions , with each adapted to the history , , …, , and possibe additional randomness independent of future states. Given knowledge of , the optimal policy is a static state-feedback law , where is derived from the solution to a discrete algebraic Riccati equation.

We are interested in algorithms which operate without knowledge of the true system transition matrices . We measure the performance of such algorithms via their regret, defined as

 Regret(T):=T∑k=1(x⊤kQxk+u⊤kRuk−J⋆).

The regret of any algorithm is lower-bounded by , a bound matched by OFU up to logarithmic factors [11]

. However, after each epoch, OFU requires optimizing a non-convex objective to

precision. Instead, our method uses a subroutine based on convex optimization and robust control.

### 2.1 Preliminaries: System Level Synthesis

We briefly describe the necessary background on robust control and System Level Synthesis [25] (SLS). These tools were recently used by Dean et al. [8] to provide non-asymptotic bounds for LQR in the offline “estimate-and-then-control” setting. In Appendix A, we expand on these preliminaries.

Consider the dynamics (2.1), and fix a static state-feedback control policy , i.e., let . Then, the closed loop map from the disturbance process to the state and control input at time is given by

 xk=∑kt=1(A⋆+B⋆K)k−twt−1,uk=∑kt=1K(A⋆+B⋆K)k−twt−1. (2.2)

Letting and , we can rewrite Eq. (2.2) as

 [xkuk]=k∑t=1[Φx(k−t+1)Φu(k−t+1)]wt−1, (2.3)

where are called the closed loop system response elements induced by the controller . The SLS framework shows that for any elements constrained to obey

 Φx(k+1)=A⋆Φx(k)+B⋆Φu(k),Φx(1)=I,∀k≥1, (2.4)

there exists some controller that achieves the desired system responses (2.3). Theorem A.1 formalizes this observation: the SLS framework thereore allows for any optimal control problem over linear systems to be cast as an optimization problem over elements , constrained to satisfy the affine equations (2.4). Comparing equations (2.2) and (2.3), we see that the former is non-convex in the controller , whereas the latter is affine in the elements , enabling solutions to previously difficult optimal control problems.

As we work with infinite horizon problems, it is notationally more convenient to work with transfer function representations of the above objects, which can be obtained by taking a -transform of their time-domain representations. The frequency domain variable can be informally thought of as the time-shift operator, i.e., , allowing for a compact representation of LTI dynamics. We use boldface letters to denote such transfer functions, e.g., . Then, the constraints (2.4) can be rewritten as

 [zI−A⋆−B⋆][ΦxΦu]=I,

and the corresponding (not necessarily static) control law is given by .

Although other approaches to optimal controller design exists, we argue now that the SLS parameterization has some appealing properties when applied to the control of uncertain systems. In particular, suppose that rather than having access to the true system transition matrices , we instead only have access to estimates . The SLS framework allows us to characterize the system responses achieved by a controller, computed using only the estimates , on the true system . Specifically, if we denote , simple algebra shows that

Theorem A.2 shows that if exists, then the controller , computed using only the estimates , achieves the following response on the true system :

 [xu]=[ΦxΦu](I+ˆΔ)−1w.

Further, if stabilizes the system , and is stable (simple sufficient conditions can be derived to ensure this, see [8]), then is also stabilizing for the true system. It is this transparency between system uncertainty and controller performance that we exploit in our algorithm.

We end this discussion with the definition of a function space that we use extensively throughout:

 S(C,ρ) :={M=∞∑k=0M(k)z−k|∥M(k)∥≤Cρk,k=0,1,2,...}.

The space consists of stable transfer functions that satisfy a certain decay rate in the spectral norm of their impulse response elements. We denote the restriction of to the space of -length finite impulse response (FIR) filters by , i.e., if , and for all . Further note that we write to mean that , i.e., that .

We equip with the and norms, which are infinite horizon analogs of the spectral and Frobenius norms of a matrix, respectively: and . The and norm have distinct interpretations. The norm of a system is equal to its operator norm, and can be used to measure the robustness of a system to unmodelled dynamics [26]. The

norm has a direct interpretation as the energy transferred to the system by a white noise process, and is hence closely related to the LQR optimal control problem. Unsurprisingly, the

norm appears in the objective function of our optimization problem, whereas the norm appears in the constraints to ensure robust stability and performance.

## 3 Algorithm and Guarantees

Our proposed robust adaptive control algorithm for LQR is shown in Algorithm 1. We note that while Line 9 of Algorithm 1 is written as an infinite-dimensional optimization problem, because of the FIR nature of the decision variables, it can be equivalently written as a finite-dimensional semidefinite program. We describe this transformation in Section G.3 of the Appendix.

Some remarks on practice are in order. First, in Line 7, only the trajectory data collected during the -th epoch is used for the least squares estimate. Second, the epoch lengths we use grow exponentially in the epoch index. These settings are chosen primarily to simplify the analysis; in practice all the data collected should be used, and it may be preferable to use a slower growing epoch schedule (such as ). Finally, for storage considerations, instead of performing a batch least squares update of the model, a recursive least squares (RLS) estimator rule can be used to update the parameters in an online manner.

### 3.1 Regret Upper Bounds

Our guarantees for Algorithm 1 are stated in terms of certain system specific constants, which we define here. We let denote the static feedback solution to the LQR problem for . Next, we define such that the closed loop system belongs to . Our main assumption is stated as follows.

###### Assumption 3.1.

We are given a controller that stabilizes the true system . Furthermore, letting denote the response of on , we assume that and , where the constants are defined in Algorithm 1.

The requirement of an initial stabilizing controller is not restrictive; Dean et al. [8] provide an offline strategy for finding such a controller. Furthermore, in practice Algorithm 1 can be initialized with no controller, with random inputs applied instead to the system in the first epoch to estimate within an initial confidence set for which the synthesis problem becomes feasible.

Our first guarantee is on the rate of estimation of as the algorithm progresses through time. This result builds on recent progress [23] for estimation along trajectories of a linear dynamical system. For what follows, the notation hides absolute constants and factors.

###### Theorem 3.2.

Fix a and suppose that Assumption 3.1 holds. With probability at least the following statement holds. Suppose that is at an epoch boundary. Let denote the current estimate of computed by Algorithm 1 at the end of time . Then, this estimate satisfies the guarantee

 max{∥ˆA(T)−A⋆∥,∥ˆB(T)−B⋆∥}≤˜O(C⋆∥K⋆∥(1−ρ⋆)3√n+pT1/3).

Theorem 3.2 shows that Algorithm 1 achieves a consistent estimate of the true dynamics , and learns at a rate of . We note that consistency of parameter estimates is not a guarantee provided by OFU or TS based approaches.

Next, we state an upper bound on the regret incurred by Algorithm 1.

###### Theorem 3.3.

Fix a and suppose that Assumption 3.1 holds. With probability at least the following statement holds. For all we have that Algorithm 1 satisfies

 Regret(T)≤˜O((n+p)C4⋆(1+∥K⋆∥)4(1+∥B⋆∥)2J⋆(1−ρ⋆)16T2/3).

Here, the notation also hides terms.

The intuition behind our proof is transparent. We use SLS to show that the cost during epoch is bounded by , where the factor is the performance degredation incurred by model uncertainty, and the factor is the additional cost incurred from injecting exploration noise. Hence, the regret incurred during this epoch is , We then bound our estimation error by . Setting , we have the per epoch bound . Choosing to balance these competing powers of and summing over logarithmic number of epochs, we obtain a final regret of .

The main difficulty in the proof is ensuring that the transient behavior of the resulting controllers is uniformly bounded when applied to the true system. Prior works sidestep this issue by assuming that the true dynamics lie within a (known) compact set for which the Heine-Borel theorem asserts the existence of finite constants that capture this behavior. We go a step further and work through the perturbation analysis which allows us to give a regret bound that depends only on simple quantities of the true system . The full proof is given in the appendix.

Finally, we remark that the dependence on in our results is an artifact of our perturbation analysis, and we leave sharpening this dependence to future work.

### 3.2 Regret Lower Bounds and Parameter Estimation Rates

We saw that Algorithm 1 achieves regret with high probability. Now we provide a matching algorithmic lower bound on the expected regret, showing that the analysis presented in Section 3.1 is sharp as a function of . Moreover, our lower bound characterizes how much regret must be accrued in order to achieve a specified estimation rate for the system parameters .

###### Theorem 3.4.

Let the initial state be distributed according to the steady state distribution of the optimal closed loop system, and let be any sequence of inputs as in Section 2. Furthermore, let be any function such that with probability we have

 λmin(T−1∑k=0[xkuk][x⊤ku⊤k])≥f(T). (3.1)

Then, there exist positive values and such that for all we have

 T∑k=0E[x⊤kQxk+u⊤kRuk−J⋆]≥12(1−δ)λmin(R)(1+σmin(K⋆)2)f(T−T0)−C0,

where and are functions of , , , , , and , detailed in Appendix E.

The proof of the estimation error Theorem 3.2 shows that Algorithm 1 satisfies Eq. (3.1) with

. Since the exploration variance

used by Algorithm 1 during the -th epoch is given by , we obtain the following corollary which demonstrates the sharpness of our regret analysis with respect to the scaling of .

###### Corollary 3.5.

For the expected regret of Algorithm 1 satisfies

 T∑k=1E[x⊤kQxk+u⊤kRuk−J⋆]≥˜Ω(λmin(R)(1+σmin(K⋆)2)T2/3).

A natural question to ask is how much regret does any algorithm accrue in order to achieve estimation error and . From Theorem 3.2 we know that Algorithm 1 estimates at rate . Therefore, in order to achieve estimation error, must be . Hence, Theorem 3.3 implies that the regret of Algorithm 1 to achieve estimation error is .

Interestingly, let us consider any other Algorithm achieving regret for some . Then, Theorem 3.4 suggests that the best rate achievable by such an algorithm is

, since the minimum eigenvalue condition Eq. (

3.1

) governs the signal-to-noise ratio. In the case of linear-regression with independent data it is known that the minimax estimation rate is lower bounded by square root of the inverse of the minimum eigenvalue (

3.1). We conjecture that the same results holds in our case. Therefore, to achieve estimation error, any Algorithm would likely require regret, showing that Algorithm 1 is optimal up to logarithmic factors in this sense. Finally, we note that while Algorithm 1 estimates at a rate , Theorem 3.4 suggests that any algorithm achieving the regret would estimate at a rate .

## 4 Experiments

##### Regret Comparison.

We illustrate the performance of several adaptive schemes empirically. We compare the proposed robust adaptive method with non-Bayesian Thompson sampling (TS) as in Abeille and Lazaric [5] and a heuristic projected gradient descent (PGD) implementation of OFU. As a simple baseline, we use the nominal control method, which synthesizes the optimal infinite-horizon LQR controller for the estimated system and injects noise with the same schedule as the robust approach. Implementation details and computational considerations for all adaptive methods are in Appendix G.

The comparison experiments are carried out on the following LQR problem:

 A⋆=⎡⎢⎣1.010.0100.011.010.0100.011.01⎤⎥⎦,  B⋆=I,  Q=10I,  R=I,  σw=1. (4.1)

This system corresponds to a marginally unstable Laplacian system where adjacent nodes are weakly connected; these dynamics were also studied by [4, 8, 24]. The cost is such that input size is penalized relatively less than state. This problem setting is amenable to robust methods due to both the cost ratio and the marginal instability, which are factors that may hurt optimistic methods. In Appendix H.1, we show similar results for an unstable system with large transients.

To standardize the initialization of the various adaptive methods, we use a rollout of length where the input is a stabilizing controller plus Gaussian noise with fixed variance . This trajectory is not counted towards the regret, but the recorded states and inputs are used to initialize parameter estimates. In each experiment, the system starts from to reduce variance over runs. For all methods, the actual errors and are used rather than bounds or bootstrapped estimates. The effect of this choice on regret is small, as examined empirically in Appendix H.2.

The performance of the various adaptive methods is compared in Figure 1. The median and 90th percentile regret over 500 instances is displayed in Figure 1a, which gives an idea of both typical and worst-case behavior. The regret of the optimal LQR controller for the true system is displayed as a baseline. Overall, the methods have very similar performance. One benefit of robustness is the guaranteed stability and bounded infinite-horizon cost at every point during operation. In Figure 1b, this infinite-horizon LQR cost is plotted for the controllers played during each epoch. This value measures the cost of using each epoch’s controller indefinitely, rather than continuing to update its parameters. The robust adaptive method performs relatively better than other adaptive algorithms, indicating that it is more amenable to early stopping, i.e., to turning off the adaptive component of the algorithm and playing the current controller indefinitely.

##### Extension to Uncertain Environment with State Constraints.

The proposed robust adaptive method naturally generalizes beyond the standard LQR problem. We consider a disturbance forecasting example which incorporates environmental uncertainty and safety constraints. Consider a system with known dynamics driven by stochastic disturbances that are now correlated in time. We model the disturbance process as the output of an unknown autonomous LTI system, as illustrated in Figure 1(a). This setting can be interpreted as a demand forecasting problem, where, for example, the system is a server farm and the disturbances represent changes in the amount of incoming jobs. If the dynamics of the correlated disturbance process are known, this knowledge can be used for more cost-effective temperature control.

We let the system with known dynamics be described by the graph Laplacian dynamics as in Eq. (4.1). The disturbance dynamics are unknown and are governed by a stable system transition matrix , resulting in the following dynamics for the full system:

The costs are set to model expensive inputs, with and . The controller synthesis problem in Line 9 of Algorithm 1 is modified to reflect the problem structure, and crucially, we add a constraint on the system response . Further details of the formulation are explained in Appendix H.3. Figure 1(b) illustrates the effect. While the unconstrained synthesis results in trajectories with large state values, the constrained synthesis results in much more moderate behavior.

## 5 Conclusions and Future Work

We presented a polynomial-time algorithm for the adaptive LQR problem that provides high probability guarantees of sub-linear regret. In contrast to other approaches to this problem, our robust adaptive method guarantees stability, robust performance, and parameter estimation. We also explored the interplay between regret minimization and parameter estimation, identifying fundamental limits connecting the two.

Several questions remain to be answered. It is an open question whether a polynomial-time algorithm can achieve a regret of . In our implementation of OFU, we observed that PGD performed quite effectively. Interesting future work is to see if the techniques of Fazel et al. [12] for policy gradient optimization on LQR can be applied to prove convergence of PGD on the OFU subroutine, which would provide an optimal polynomial-time algorithm. Moreover, we observed that OFU and TS methods in practice gave estimates of system parameters that were comparable with our method which explicitly adds excitation noise. It seems that the switching of control policies at epoch boundaries provides more excitation for system identification than is currently understood by the theory. Furthermore, practical issues that remain to be addressed include satisfying safety constraints and dealing with nonlinear dynamics; in both settings, finite-sample parameter estimation/system identification and adaptive control remain an open problem.

#### Acknowledgments

SD is supported by an NSF Graduate Research Fellowship. As part of the RISE lab, HM is generally supported in part by NSF CISE Expeditions Award CCF-1730628, DHS Award HSHQDC-16-3-00083, and gifts from Alibaba, Amazon Web Services, Ant Financial, CapitalOne, Ericsson, GE, Google, Huawei, Intel, IBM, Microsoft, Scotiabank, Splunk and VMware. BR is generously supported in part by NSF award CCF-1359814, ONR awards N00014-17-1-2191, N00014-17-1-2401, and N00014-17-1-2502, the DARPA Fundamental Limits of Learning (Fun LoL) and Lagrange Programs, and an Amazon AWS AI Research Award.

## Appendix A Background on System Level Synthesis

We begin by defining two function spaces which we use extensively throughout:

 RH∞ ={M:C⟶Cn×p|M(z) is rational,M(z) is analytic on Dc}, (A.1) RH∞(C,ρ) ={M∈RH∞|∥M[k]∥≤Cρk,k=1,2,...}. (A.2)

Note that we use to denote in the main body of the text.

Recall that our main object of interest is the system

 xk+1=Axk+Buk+wk,

and our goal is to design a LTI feedback control policy such that the resulting closed loop system is stable. For a given , we refer to the closed loop transfer functions from and as the system response. Symbolically, we denote these maps as and . Simple algebra shows that given , these maps take on the form

 Φx=(zI−A−BK)−1,Φu=K(zI−A−BK)−1. (A.3)

We then have the following theorem parameterizing the set of such stable closed-loop transfer functions that are achievable by a stabilizing controller .

###### Theorem A.1 (State-Feedback Parameterization [25]).

The following are true:

• The affine subspace defined by

 [zI−A−B][ΦxΦu]=I, Φx,Φu∈1zRH∞ (A.4)

parameterizes all system responses (A.3) from to , achievable by an internally stabilizing state-feedback controller .

• For any transfer matrices satisfying (A.4), the controller is internally stabilizing and achieves the desired system response (A.3).

If stabilizes , then the LQR cost of on can be written by Parseval’s identity as

 J(A,B,K;σ2wI):=limT→∞1TE[T∑k=1x⊤kQxk+u⊤kRuk]=σ2w∥∥∥[Q1/200R1/2][ΦxΦu]∥∥∥2H2. (A.5)

More generally, we will define to be the LQR cost when the process noise is driven by . When we omit the last argument, we mean , i.e. .

In [8], the authors use SLS to study how uncertainty in the true parameters affect the LQR objective cost. Our analysis relies on these tools, which we briefly describe below.

The starting point for the theory is a characterization of all robustly stabilizing controllers.

###### Theorem A.2 ([18]).

Suppose that the transfer matrices satisfy

 [zI−A−B][ΦxΦu]=I+Δ. (A.6)

Then the controller stabilizes the system described by if and only if . Furthermore, the resulting system response is given by

 [xu]=[ΦxΦu](I+Δ)−1w. (A.7)

This robustness result is used to derive a cost perturbation result for LQR.

###### Lemma A.3 ([8]).

Let the controller stabilize and be its corresponding system response on system . Then if stabilizes , it achieves the following LQR cost

 (A.8)

Furthermore, letting

 ˆΔ:=[ΔAΔB][ΦxΦu]. (A.9)

a sufficient condition for to stabilize is that . An upper bound on is given by, for any ,

 (A.10)

where we assume that and .

## Appendix B Synthesis Results

We first study the following infinite-dimensional synthesis problem.

 (B.1)

We will conduct our analysis assuming that this infinite-dimensional problem is solvable. Later on, we will show how to relax this problem to a finite-dimension one via FIR truncation, and show the minor modifications needed to the analysis for the guarantees to hold.

We now prove a sub-optimality guarantee on the solution to (B.1) which holds for certain choices of and the coefficients and . This result also establishes an important technical consideration, which is when the problemmmm (B.1) is feasible.

###### Theorem B.1.

Let denote the minimal LQR cost achievable by any controller for the dynamical system with transition matrices , and let denote its optimal static feedback contoller. Suppose that and that (wlog) . Suppose furthermore that is small enough to satisfy the following conditions:

 ε(1+∥K⋆∥)∥RA⋆+B⋆K⋆∥H∞ ≤1/5, ε(1+∥K⋆∥)C⋆ ≤1−ρ⋆.

Let be any estimates of the transition matrices such that . Then, if and are set as,

 Cx =O(1)C⋆1−ρ⋆, Cu =O(1)∥K⋆∥C⋆1−ρ⋆, ρ =(1/4)ρ⋆+3/4,

we have that (a) the program (B.1) is feasible, (b) letting denote an optimal solution to (B.1), the relative error in the LQR cost is

 J(A⋆,B⋆,K)≤(1+5ε(1+∥K⋆∥)∥RA⋆+B⋆K⋆∥H∞)2J⋆, (B.2)

and (c) if furthermore , the response of on the true system satisfies

 ˆΦx∈RH∞(O(1)C⋆(1−ρ⋆)2,7/8+(1/8)ρ⋆), ˆΦu∈RH∞(O(1)∥K⋆∥C⋆(1−ρ⋆)2,7/8+(1/8)ρ⋆).
###### Proof.

The proof of (a) and (b) is nearly identical to that given in [8], which works by showing that and is a feasible response which gives the desired sub-optimality guarantee. The only modification is that we need to find constants for which and . We do this by writing

 RˆA+ˆBK⋆=RA⋆+B⋆K⋆(I−Δ)−1,Δ=(ΔA+ΔBK⋆)RA⋆+B⋆K⋆.

By the definition of and our assumptions, we have that

 Δ∈RH∞(ε(1+∥K⋆∥)C⋆,ρ⋆),∥Δ∥H∞<1.

This places us in a position to apply Lemma F.3, from which we conclude that

 (I−Δ)−1∈RH∞(O(1),Avg(ρ⋆,1)).

Now applying Lemma F.1 to , we conclude that

 RˆA+ˆBK⋆ ∈RH∞(O(1)C⋆1−ρ⋆,(1/4)ρ⋆+3/4).

The claims of (a) and (b) now follows.

Now for the proof of (c). Let be the solution to (B.1). We have that

We know that by the constraints of the optimization problem (B.1) and furthermore,

 ˆΔ∈RH∞(ε(Cx+Cu),ρ).

By assumption we have , from which we conclude using Lemma F.3 that

 (I+ˆΔ)−1∈RH∞(O(1),Avg(ρ,1)).

Furthermore, from Lemma F.1, we conclude that

 Φx(I+ˆΔ)−1 ∈RH∞(Cx1−ρ,3/4+(1/4)ρ), Φu(I+ˆΔ)−1 ∈RH∞(Cu1−ρ,3/4+(1/4)ρ).

The claim now follows by plugging in the values of , , and . ∎

### b.1 Suboptimality bounds for FIR truncated SLS

Optimization problem (B.1) is convex but infinite dimensional, and as far as we are aware does not admit an efficient solution. In Algorithm 1, we instead propose solving the following FIR approximation to problem (B.1):

 minimizeγ∈[0,1) 11−γminΦx,Φu,V∥∥∥[Q1/200R1/2][ΦxΦu]∥∥∥H2 s.t. (B.3) ∥V∥≤CxρF+1,Φx∈1zRHF∞(Cx,ρ),Φu∈1zRHF∞(Cu,ρ).

where here denotes the FIR truncation length used. This optimization problem can be posed as a finite dimensional semidefinite program (see Section G.3). Let denote the resulting controller. We begin with a lemma identifying conditions under which optimization problem (B.3) is feasible. to ease notation going forward, we let .

###### Lemma B.2.

Let the assumptions of Theorem B.1 hold, and further assume that

 F0≥log(2Cx)log(1/ρ)−1