1 Introduction
The problem of adaptively controlling an unknown dynamical system has a rich history, with classical asymptotic results of convergence and stability dating back decades [15, 16]. Of late, there has been a renewed interest in the study of a particular instance of such problems, namely the adaptive Linear Quadratic Regulator (LQR), with an emphasis on nonasymptotic guarantees of stability and performance. Initiated by AbbasiYadkori and Szepesvári [2], there have since been several works analyzing the regret suffered by various adaptive algorithms on LQR– here the regret incurred by an algorithm is thought of as a measure of deviations in performance from optimality over time. These results can be broadly divided into two categories: those providing highprobability guarantees for a single execution of the algorithm [2, 5, 11, 14], and those providing bounds on the expected Bayesian regret incurred over a family of possible systems [3, 21]. As we discuss in more detail, these methods all suffer from one or several of the following limitations: restrictive and unverifiable assumptions, limited applicability, and computationally intractable subroutines. In this paper, we provide, to the best of our knowledge, the first polynomialtime algorithm for the adaptive LQR problem that provides high probability guarantees of sublinear regret, and that does not require unverifiable or unrealistic assumptions.
Related Work.
There is a rich body of work on the estimation of linear systems as well as on the robust and adaptive control of unknown systems. We target our discussion to works on nonasymptotic guarantees for the LQR control of an unknown system, broadly divided into three categories.
Offline estimation and control synthesis: In a nonadaptive setting, i.e., when system identification can be done offline prior to controller synthesis and implementation, the first work to provide endtoend guarantees for the LQR optimal control problem is that of Fiechter [13], who shows that the discounted LQR problem is PAClearnable. Dean et al Dean et al. [8] improve on this result, and provide the first endtoend sample complexity guarantees for the infinite horizon average cost LQR problem.
Optimism in the Face of Uncertainty (OFU): AbbasiYadkori and Szepesvári [2], Faradonbeh et al. [11], and Ibrahimi et al. [14] employ the Optimism in the Face of Uncertainty (OFU) principle [7], which optimistically selects model parameters from a confidence set by choosing those that lead to the best closedloop (infinite horizon) control performance, and then plays the corresponding optimal controller, repeating this process online as the confidence set shrinks. While OFU in the LQR setting has been shown to achieve optimal regret , its implementation requires solving a nonconvex optimization problem to precision , for which no provably efficient implementation exists.
Thompson Sampling (TS): To circumvent the computational roadblock of OFU, recent works replace the intractable OFU subroutine with a random draw from the model uncertainty set, resulting in Thompson Sampling (TS) based policies [3, 5, 21]. Abeille and Lazaric [5] show that such a method achieves regret with highprobability for scalar systems. However, their proof does not extend to the nonscalar setting. AbbasiYadkori and Szepesvári [3] and Ouyang et al. [21] consider expected regret in a Bayesian setting, and provide TS methods which achieve regret. Although not directly comparable to our result, we remark on the computational challenges of these algorithms. Whereas the proof of AbbasiYadkori and Szepesvári [3] was shown to be incorrect [20], Ouyang et al. [21] make the restrictive assumption that there exists a (known) initial compact set describing the uncertainty in the system parameters, such that for any system , the optimal controller is stabilizing when applied to any other system . No means of constructing such a set are provided, and there is no known tractable algorithm to verify if a given set satisfies this property. Also, it is implicitly assumed that projecting onto this set can be done efficiently.
Contributions.
To develop the first polynomialtime algorithm that provides high probability guarantees of sublinear regret, we leverage recent results from the estimation of linear systems [23], robust controller synthesis [18, 25], and coarseID control [8]. We show that our robust adaptive control algorithm: (i) guarantees stability and nearoptimal performance at all times; (ii) achieves a regret up to time bounded by ; and (iii) is based on finitedimensional semidefinite programs of size logarithmic in .
Furthermore, our method estimates the system parameters at rate in operator norm. Although system parameter identification is not necessary for optimal control performance, an accurate system model is often desirable in practice. Motivated by this, we study the interplay between regret minimization and parameter estimation, and identify fundamental limits connecting the two. We show that the expected regret of our algorithm is lower bounded by , proving that our analysis is sharp up to logarithmic factors. Moreover, our lower bound suggests that the estimation rate achievable by any algorithm with regret is .
Finally, we conduct a numerical study of the adaptive LQR problem, in which we implement our algorithm, and compare its performance to heuristic implementations of OFU and TS based methods. We show on several examples that the regret incurred by our algorithm is comparable to that of the OFU and TS based methods. Furthermore, the infinite horizon cost achieved by our algorithm at any given time on the true system is consistently lower than that attained by OFU and TS based algorithms. Finally, we use a demand forecasting example to show how our algorithm naturally generalizes to incorporate environmental uncertainty and safety constraints.
2 Problem Statement and Preliminaries
In this work we consider adaptive control of the following discretetime linear system
(2.1) 
where is the state, is the control input, and is the process noise. We assume that the state variables are observed exactly and, for simplicity, that . We consider the Linear Quadratic Regulator optimal control problem, given by cost matrices and ,
where the minimum is taken over measurable functions , with each adapted to the history , , …, , and possibe additional randomness independent of future states. Given knowledge of , the optimal policy is a static statefeedback law , where is derived from the solution to a discrete algebraic Riccati equation.
We are interested in algorithms which operate without knowledge of the true system transition matrices . We measure the performance of such algorithms via their regret, defined as
The regret of any algorithm is lowerbounded by , a bound matched by OFU up to logarithmic factors [11]
. However, after each epoch, OFU requires optimizing a nonconvex objective to
precision. Instead, our method uses a subroutine based on convex optimization and robust control.2.1 Preliminaries: System Level Synthesis
We briefly describe the necessary background on robust control and System Level Synthesis [25] (SLS). These tools were recently used by Dean et al. [8] to provide nonasymptotic bounds for LQR in the offline “estimateandthencontrol” setting. In Appendix A, we expand on these preliminaries.
Consider the dynamics (2.1), and fix a static statefeedback control policy , i.e., let . Then, the closed loop map from the disturbance process to the state and control input at time is given by
(2.2) 
Letting and , we can rewrite Eq. (2.2) as
(2.3) 
where are called the closed loop system response elements induced by the controller . The SLS framework shows that for any elements constrained to obey
(2.4) 
there exists some controller that achieves the desired system responses (2.3). Theorem A.1 formalizes this observation: the SLS framework thereore allows for any optimal control problem over linear systems to be cast as an optimization problem over elements , constrained to satisfy the affine equations (2.4). Comparing equations (2.2) and (2.3), we see that the former is nonconvex in the controller , whereas the latter is affine in the elements , enabling solutions to previously difficult optimal control problems.
As we work with infinite horizon problems, it is notationally more convenient to work with transfer function representations of the above objects, which can be obtained by taking a transform of their timedomain representations. The frequency domain variable can be informally thought of as the timeshift operator, i.e., , allowing for a compact representation of LTI dynamics. We use boldface letters to denote such transfer functions, e.g., . Then, the constraints (2.4) can be rewritten as
and the corresponding (not necessarily static) control law is given by .
Although other approaches to optimal controller design exists, we argue now that the SLS parameterization has some appealing properties when applied to the control of uncertain systems. In particular, suppose that rather than having access to the true system transition matrices , we instead only have access to estimates . The SLS framework allows us to characterize the system responses achieved by a controller, computed using only the estimates , on the true system . Specifically, if we denote , simple algebra shows that
Theorem A.2 shows that if exists, then the controller , computed using only the estimates , achieves the following response on the true system :
Further, if stabilizes the system , and is stable (simple sufficient conditions can be derived to ensure this, see [8]), then is also stabilizing for the true system. It is this transparency between system uncertainty and controller performance that we exploit in our algorithm.
We end this discussion with the definition of a function space that we use extensively throughout:
The space consists of stable transfer functions that satisfy a certain decay rate in the spectral norm of their impulse response elements. We denote the restriction of to the space of length finite impulse response (FIR) filters by , i.e., if , and for all . Further note that we write to mean that , i.e., that .
We equip with the and norms, which are infinite horizon analogs of the spectral and Frobenius norms of a matrix, respectively: and . The and norm have distinct interpretations. The norm of a system is equal to its operator norm, and can be used to measure the robustness of a system to unmodelled dynamics [26]. The
norm has a direct interpretation as the energy transferred to the system by a white noise process, and is hence closely related to the LQR optimal control problem. Unsurprisingly, the
norm appears in the objective function of our optimization problem, whereas the norm appears in the constraints to ensure robust stability and performance.3 Algorithm and Guarantees
Our proposed robust adaptive control algorithm for LQR is shown in Algorithm 1. We note that while Line 9 of Algorithm 1 is written as an infinitedimensional optimization problem, because of the FIR nature of the decision variables, it can be equivalently written as a finitedimensional semidefinite program. We describe this transformation in Section G.3 of the Appendix.
Some remarks on practice are in order. First, in Line 7, only the trajectory data collected during the th epoch is used for the least squares estimate. Second, the epoch lengths we use grow exponentially in the epoch index. These settings are chosen primarily to simplify the analysis; in practice all the data collected should be used, and it may be preferable to use a slower growing epoch schedule (such as ). Finally, for storage considerations, instead of performing a batch least squares update of the model, a recursive least squares (RLS) estimator rule can be used to update the parameters in an online manner.
3.1 Regret Upper Bounds
Our guarantees for Algorithm 1 are stated in terms of certain system specific constants, which we define here. We let denote the static feedback solution to the LQR problem for . Next, we define such that the closed loop system belongs to . Our main assumption is stated as follows.
Assumption 3.1.
We are given a controller that stabilizes the true system . Furthermore, letting denote the response of on , we assume that and , where the constants are defined in Algorithm 1.
The requirement of an initial stabilizing controller is not restrictive; Dean et al. [8] provide an offline strategy for finding such a controller. Furthermore, in practice Algorithm 1 can be initialized with no controller, with random inputs applied instead to the system in the first epoch to estimate within an initial confidence set for which the synthesis problem becomes feasible.
Our first guarantee is on the rate of estimation of as the algorithm progresses through time. This result builds on recent progress [23] for estimation along trajectories of a linear dynamical system. For what follows, the notation hides absolute constants and factors.
Theorem 3.2.
Theorem 3.2 shows that Algorithm 1 achieves a consistent estimate of the true dynamics , and learns at a rate of . We note that consistency of parameter estimates is not a guarantee provided by OFU or TS based approaches.
Next, we state an upper bound on the regret incurred by Algorithm 1.
Theorem 3.3.
The intuition behind our proof is transparent. We use SLS to show that the cost during epoch is bounded by , where the factor is the performance degredation incurred by model uncertainty, and the factor is the additional cost incurred from injecting exploration noise. Hence, the regret incurred during this epoch is , We then bound our estimation error by . Setting , we have the per epoch bound . Choosing to balance these competing powers of and summing over logarithmic number of epochs, we obtain a final regret of .
The main difficulty in the proof is ensuring that the transient behavior of the resulting controllers is uniformly bounded when applied to the true system. Prior works sidestep this issue by assuming that the true dynamics lie within a (known) compact set for which the HeineBorel theorem asserts the existence of finite constants that capture this behavior. We go a step further and work through the perturbation analysis which allows us to give a regret bound that depends only on simple quantities of the true system . The full proof is given in the appendix.
Finally, we remark that the dependence on in our results is an artifact of our perturbation analysis, and we leave sharpening this dependence to future work.
3.2 Regret Lower Bounds and Parameter Estimation Rates
We saw that Algorithm 1 achieves regret with high probability. Now we provide a matching algorithmic lower bound on the expected regret, showing that the analysis presented in Section 3.1 is sharp as a function of . Moreover, our lower bound characterizes how much regret must be accrued in order to achieve a specified estimation rate for the system parameters .
Theorem 3.4.
Let the initial state be distributed according to the steady state distribution of the optimal closed loop system, and let be any sequence of inputs as in Section 2. Furthermore, let be any function such that with probability we have
(3.1) 
Then, there exist positive values and such that for all we have
where and are functions of , , , , , and , detailed in Appendix E.
The proof of the estimation error Theorem 3.2 shows that Algorithm 1 satisfies Eq. (3.1) with
. Since the exploration variance
used by Algorithm 1 during the th epoch is given by , we obtain the following corollary which demonstrates the sharpness of our regret analysis with respect to the scaling of .Corollary 3.5.
For the expected regret of Algorithm 1 satisfies
A natural question to ask is how much regret does any algorithm accrue in order to achieve estimation error and . From Theorem 3.2 we know that Algorithm 1 estimates at rate . Therefore, in order to achieve estimation error, must be . Hence, Theorem 3.3 implies that the regret of Algorithm 1 to achieve estimation error is .
Interestingly, let us consider any other Algorithm achieving regret for some . Then, Theorem 3.4 suggests that the best rate achievable by such an algorithm is
, since the minimum eigenvalue condition Eq. (
3.1) governs the signaltonoise ratio. In the case of linearregression with independent data it is known that the minimax estimation rate is lower bounded by square root of the inverse of the minimum eigenvalue (
3.1). We conjecture that the same results holds in our case. Therefore, to achieve estimation error, any Algorithm would likely require regret, showing that Algorithm 1 is optimal up to logarithmic factors in this sense. Finally, we note that while Algorithm 1 estimates at a rate , Theorem 3.4 suggests that any algorithm achieving the regret would estimate at a rate .4 Experiments
Regret Comparison.
We illustrate the performance of several adaptive schemes empirically. We compare the proposed robust adaptive method with nonBayesian Thompson sampling (TS) as in Abeille and Lazaric [5] and a heuristic projected gradient descent (PGD) implementation of OFU. As a simple baseline, we use the nominal control method, which synthesizes the optimal infinitehorizon LQR controller for the estimated system and injects noise with the same schedule as the robust approach. Implementation details and computational considerations for all adaptive methods are in Appendix G.
The comparison experiments are carried out on the following LQR problem:
(4.1) 
This system corresponds to a marginally unstable Laplacian system where adjacent nodes are weakly connected; these dynamics were also studied by [4, 8, 24]. The cost is such that input size is penalized relatively less than state. This problem setting is amenable to robust methods due to both the cost ratio and the marginal instability, which are factors that may hurt optimistic methods. In Appendix H.1, we show similar results for an unstable system with large transients.
To standardize the initialization of the various adaptive methods, we use a rollout of length where the input is a stabilizing controller plus Gaussian noise with fixed variance . This trajectory is not counted towards the regret, but the recorded states and inputs are used to initialize parameter estimates. In each experiment, the system starts from to reduce variance over runs. For all methods, the actual errors and are used rather than bounds or bootstrapped estimates. The effect of this choice on regret is small, as examined empirically in Appendix H.2.
The performance of the various adaptive methods is compared in Figure 1. The median and 90th percentile regret over 500 instances is displayed in Figure 1a, which gives an idea of both typical and worstcase behavior. The regret of the optimal LQR controller for the true system is displayed as a baseline. Overall, the methods have very similar performance. One benefit of robustness is the guaranteed stability and bounded infinitehorizon cost at every point during operation. In Figure 1b, this infinitehorizon LQR cost is plotted for the controllers played during each epoch. This value measures the cost of using each epoch’s controller indefinitely, rather than continuing to update its parameters. The robust adaptive method performs relatively better than other adaptive algorithms, indicating that it is more amenable to early stopping, i.e., to turning off the adaptive component of the algorithm and playing the current controller indefinitely.
Extension to Uncertain Environment with State Constraints.
The proposed robust adaptive method naturally generalizes beyond the standard LQR problem. We consider a disturbance forecasting example which incorporates environmental uncertainty and safety constraints. Consider a system with known dynamics driven by stochastic disturbances that are now correlated in time. We model the disturbance process as the output of an unknown autonomous LTI system, as illustrated in Figure 1(a). This setting can be interpreted as a demand forecasting problem, where, for example, the system is a server farm and the disturbances represent changes in the amount of incoming jobs. If the dynamics of the correlated disturbance process are known, this knowledge can be used for more costeffective temperature control.
We let the system with known dynamics be described by the graph Laplacian dynamics as in Eq. (4.1). The disturbance dynamics are unknown and are governed by a stable system transition matrix , resulting in the following dynamics for the full system:
The costs are set to model expensive inputs, with and . The controller synthesis problem in Line 9 of Algorithm 1 is modified to reflect the problem structure, and crucially, we add a constraint on the system response . Further details of the formulation are explained in Appendix H.3. Figure 1(b) illustrates the effect. While the unconstrained synthesis results in trajectories with large state values, the constrained synthesis results in much more moderate behavior.
5 Conclusions and Future Work
We presented a polynomialtime algorithm for the adaptive LQR problem that provides high probability guarantees of sublinear regret. In contrast to other approaches to this problem, our robust adaptive method guarantees stability, robust performance, and parameter estimation. We also explored the interplay between regret minimization and parameter estimation, identifying fundamental limits connecting the two.
Several questions remain to be answered. It is an open question whether a polynomialtime algorithm can achieve a regret of . In our implementation of OFU, we observed that PGD performed quite effectively. Interesting future work is to see if the techniques of Fazel et al. [12] for policy gradient optimization on LQR can be applied to prove convergence of PGD on the OFU subroutine, which would provide an optimal polynomialtime algorithm. Moreover, we observed that OFU and TS methods in practice gave estimates of system parameters that were comparable with our method which explicitly adds excitation noise. It seems that the switching of control policies at epoch boundaries provides more excitation for system identification than is currently understood by the theory. Furthermore, practical issues that remain to be addressed include satisfying safety constraints and dealing with nonlinear dynamics; in both settings, finitesample parameter estimation/system identification and adaptive control remain an open problem.
Acknowledgments
SD is supported by an NSF Graduate Research Fellowship. As part of the RISE lab, HM is generally supported in part by NSF CISE Expeditions Award CCF1730628, DHS Award HSHQDC16300083, and gifts from Alibaba, Amazon Web Services, Ant Financial, CapitalOne, Ericsson, GE, Google, Huawei, Intel, IBM, Microsoft, Scotiabank, Splunk and VMware. BR is generously supported in part by NSF award CCF1359814, ONR awards N000141712191, N000141712401, and N000141712502, the DARPA Fundamental Limits of Learning (Fun LoL) and Lagrange Programs, and an Amazon AWS AI Research Award.
References
 AbbasiYadkori [2012] Yasin AbbasiYadkori. Online Learning for Linearly Parametrized Control Problems. PhD thesis, University of Alberta, 2012.
 AbbasiYadkori and Szepesvári [2011] Yasin AbbasiYadkori and Csaba Szepesvári. Regret Bounds for the Adaptive Control of Linear Quadratic Systems. In Conference on Learning Theory, 2011.

AbbasiYadkori and Szepesvári [2015]
Yasin AbbasiYadkori and Csaba Szepesvári.
Bayesian Optimal Control of Smoothly Parameterized Systems: The Lazy
Posterior Sampling Algorithm.
In
Conference on Uncertainty in Artificial Intelligence
, 2015.  AbbasiYadkori et al. [2018] Yasin AbbasiYadkori, Nevena Lazic, and Csaba Szepesvári. Regret Bounds for ModelFree Linear Quadratic Control. arXiv:1804.06021, 2018.
 Abeille and Lazaric [2017] Marc Abeille and Alessandro Lazaric. Thompson Sampling for LinearQuadratic Control Problems. In AISTATS, 2017.
 Anderson and Matni [2017] James Anderson and Nikolai Matni. Structured State Space Realizations for SLS Distributed Controllers. In Allerton, 2017.
 Bittanti and Campi [2006] S. Bittanti and M. C. Campi. Adaptive control of linear time invariant systems: the “bet on the best” principle. Communications in Information and Systems, 6(4), 2006.
 Dean et al. [2017] Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. On the Sample Complexity of the Linear Quadratic Regulator. arXiv:1710.01688, 2017.

Diamond and Boyd [2016]
Steven Diamond and Stephen Boyd.
CVXPY: A Pythonembedded modeling language for convex
optimization.
Journal of Machine Learning Research
, 17(83), 2016.  Dumitrescu [2007] Bogdan Dumitrescu. Positive trigonometric polynomials and signal processing applications. 2007.
 Faradonbeh et al. [2017] Mohamad Kazem Shirani Faradonbeh, Ambuj Tewari, and George Michailidis. Finite Time Analysis of Optimal Adaptive Policies for LinearQuadratic Systems. arXiv:1711.07230, 2017.
 Fazel et al. [2018] Maryam Fazel, Rong Ge, Sham M. Kakade, and Mehran Mesbahi. Global Convergence of Policy Gradient Methods for Linearized Control Problems. arXiv:1801.05039, 2018.
 Fiechter [1997] ClaudeNicolas Fiechter. PAC Adaptive Control of Linear Systems. In Conference on Learning Theory, 1997.

Ibrahimi et al. [2012]
Morteza Ibrahimi, Adel Javanmard, and Benjamin Van Roy.
Efficient Reinforcement Learning for High Dimensional Linear Quadratic Systems.
In Neural Information Processing Systems, 2012.  Ioannou and Sun [1996] Petros A Ioannou and Jing Sun. Robust adaptive control, volume 1. PTR PrenticeHall Upper Saddle River, NJ, 1996.
 Krstic et al. [1995] Miroslav Krstic, Ioannis Kanellakopoulos, and Peter V Kokotovic. Nonlinear and adaptive control design. Wiley, 1995.
 Lincoln and Rantzer [2006] Bo Lincoln and Anders Rantzer. Relaxing dynamic programming. IEEE Transactions on Automatic Control, 51(8):1249–1260, 2006.
 Matni et al. [2017] Nikolai Matni, YuhShyang Wang, and James Anderson. Scalable system level synthesis for virtually localizable systems. In IEEE Conference on Decision and Control, 2017.
 O’Donoghue et al. [2016] Brendan O’Donoghue, Eric Chu, Neal Parikh, and Stephen Boyd. Conic Optimization via Operator Splitting and Homogeneous SelfDual Embedding. Journal of Optimization Theory and Applications, 169(3), 2016.
 Osband and Van Roy [2016] Ian Osband and Benjamin Van Roy. Posterior Sampling for Reinforcement Learning Without Episodes. arXiv:1608.02731, 2016.
 Ouyang et al. [2017] Yi Ouyang, Mukul Gagrani, and Rahul Jain. Learningbased Control of Unknown Linear Systems with Thompson Sampling. arXiv:1709.04047, 2017.
 Rudelson and Vershynin [2011] Mark Rudelson and Roman Vershynin. HansonWright inequality and subgaussian concentration. Electronic Communications in Probability, 18(82), 2011.
 Simchowitz et al. [2018] Max Simchowitz, Horia Mania, Stephen Tu, Michael I Jordan, and Benjamin Recht. Learning Without Mixing: Towards A Sharp Analysis of Linear System Identification. arXiv:1802.08334, 2018.
 Tu and Recht [2017] Stephen Tu and Benjamin Recht. LeastSquares Temporal Difference Learning for the Linear Quadratic Regulator. arXiv:1712.08642, 2017.
 Wang et al. [2016] YuhShyang Wang, Nikolai Matni, and John C Doyle. A System Level Approach to Controller Synthesis. arXiv:1610.04815, 2016.
 Zhou et al. [1995] K. Zhou, J. C. Doyle, and K. Glover. Robust and Optimal Control. 1995.
Appendix A Background on System Level Synthesis
We begin by defining two function spaces which we use extensively throughout:
(A.1)  
(A.2) 
Note that we use to denote in the main body of the text.
Recall that our main object of interest is the system
and our goal is to design a LTI feedback control policy such that the resulting closed loop system is stable. For a given , we refer to the closed loop transfer functions from and as the system response. Symbolically, we denote these maps as and . Simple algebra shows that given , these maps take on the form
(A.3) 
We then have the following theorem parameterizing the set of such stable closedloop transfer functions that are achievable by a stabilizing controller .
Theorem A.1 (StateFeedback Parameterization [25]).
The following are true:

The affine subspace defined by
(A.4) parameterizes all system responses (A.3) from to , achievable by an internally stabilizing statefeedback controller .
If stabilizes , then the LQR cost of on can be written by Parseval’s identity as
(A.5) 
More generally, we will define to be the LQR cost when the process noise is driven by . When we omit the last argument, we mean , i.e. .
In [8], the authors use SLS to study how uncertainty in the true parameters affect the LQR objective cost. Our analysis relies on these tools, which we briefly describe below.
The starting point for the theory is a characterization of all robustly stabilizing controllers.
Theorem A.2 ([18]).
Suppose that the transfer matrices satisfy
(A.6) 
Then the controller stabilizes the system described by if and only if . Furthermore, the resulting system response is given by
(A.7) 
This robustness result is used to derive a cost perturbation result for LQR.
Lemma A.3 ([8]).
Let the controller stabilize and be its corresponding system response on system . Then if stabilizes , it achieves the following LQR cost
(A.8) 
Furthermore, letting
(A.9) 
a sufficient condition for to stabilize is that . An upper bound on is given by, for any ,
(A.10) 
where we assume that and .
Appendix B Synthesis Results
We first study the following infinitedimensional synthesis problem.
(B.1) 
We will conduct our analysis assuming that this infinitedimensional problem is solvable. Later on, we will show how to relax this problem to a finitedimension one via FIR truncation, and show the minor modifications needed to the analysis for the guarantees to hold.
We now prove a suboptimality guarantee on the solution to (B.1) which holds for certain choices of and the coefficients and . This result also establishes an important technical consideration, which is when the problemmmm (B.1) is feasible.
Theorem B.1.
Let denote the minimal LQR cost achievable by any controller for the dynamical system with transition matrices , and let denote its optimal static feedback contoller. Suppose that and that (wlog) . Suppose furthermore that is small enough to satisfy the following conditions:
Let be any estimates of the transition matrices such that . Then, if and are set as,
we have that (a) the program (B.1) is feasible, (b) letting denote an optimal solution to (B.1), the relative error in the LQR cost is
(B.2) 
and (c) if furthermore , the response of on the true system satisfies
Proof.
The proof of (a) and (b) is nearly identical to that given in [8], which works by showing that and is a feasible response which gives the desired suboptimality guarantee. The only modification is that we need to find constants for which and . We do this by writing
By the definition of and our assumptions, we have that
This places us in a position to apply Lemma F.3, from which we conclude that
Now applying Lemma F.1 to , we conclude that
The claims of (a) and (b) now follows.
Now for the proof of (c). Let be the solution to (B.1). We have that
We know that by the constraints of the optimization problem (B.1) and furthermore,
By assumption we have , from which we conclude using Lemma F.3 that
Furthermore, from Lemma F.1, we conclude that
The claim now follows by plugging in the values of , , and . ∎
b.1 Suboptimality bounds for FIR truncated SLS
Optimization problem (B.1) is convex but infinite dimensional, and as far as we are aware does not admit an efficient solution. In Algorithm 1, we instead propose solving the following FIR approximation to problem (B.1):
(B.3)  
where here denotes the FIR truncation length used. This optimization problem can be posed as a finite dimensional semidefinite program (see Section G.3). Let denote the resulting controller. We begin with a lemma identifying conditions under which optimization problem (B.3) is feasible. to ease notation going forward, we let .
Lemma B.2.
Let the assumptions of Theorem B.1 hold, and further assume that