Safely Learning to Control the Constrained Linear Quadratic Regulator

by   Sarah Dean, et al.
berkeley college

We study the constrained linear quadratic regulator with unknown dynamics, addressing the tension between safety and exploration in data-driven control techniques. We present a framework which allows for system identification through persistent excitation, while maintaining safety by guaranteeing the satisfaction of state and input constraints. This framework involves a novel method for synthesizing robust constraint-satisfying feedback controllers, leveraging newly developed tools from system level synthesis. We connect statistical results with cost sub-optimality bounds to give non-asymptotic guarantees on both estimation and controller performance.



page 1

page 2

page 3

page 4


Safe Adaptive Learning-based Control for Constrained Linear Quadratic Regulators with Regret Guarantees

We study the adaptive control of an unknown linear system with a quadrat...

Regret Bounds for Robust Adaptive Control of the Linear Quadratic Regulator

We consider adaptive control of the Linear Quadratic Regulator (LQR), wh...

Adaptive Robust Model Predictive Control with Matched and Unmatched Uncertainty

We propose a learning-based robust predictive control algorithm that can...

Non-Episodic Learning for Online LQR of Unknown Linear Gaussian System

This paper considers the data-driven linear-quadratic regulation (LQR) p...

Finite-Data Performance Guarantees for the Output-Feedback Control of an Unknown System

As the systems we control become more complex, first-principle modeling ...

Learning-enhanced robust controller synthesis with rigorous statistical and control-theoretic guarantees

The combination of machine learning with control offers many opportuniti...

Efficient Learning of Distributed Linear-Quadratic Controllers

In this work, we propose a robust approach to design distributed control...

Code Repositories


Robust MPC for Linear Systems with Parametric and Additive Uncertainty

view repo


A simple robust MPC for linear systems with model mismatch: Balancing conservatism vs computational complexity

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

While data-driven design has considerable potential in contemporary control systems where precise modeling of the dynamics is intractable (e.g., systems with complex contact forces), one of the biggest hurdles to overcome for practical deployment is maintaining safe execution during the learning process.

Motivated by this issue, we study the data-driven design of a controller for the constrained Linear Quadratic Regulator (LQR) problem. In constrained LQR, we design a controller for a (potentially unknown) linear dynamical system that minimizes a given quadratic cost, subject to the additional requirement that both the state and input stay within a specified safe region. This is a problem that has received much attention within the model predictive control (MPC) community.

For the LQR problem with no constraints, a natural method of exploration for learning the dynamics is to excite the system by injecting white noise. When safety is not an issue, this method is effective and recently Dean et al. 

[1] provide an end-to-end sample complexity on this “identify-then-control” scheme. However, this method fails to consider safety or constraint satisfaction.

We directly address the tension between exploration for learning and safety, which are fundamentally at odds. We do this by synthesizing a controller which simultaneously excites and regulates the system; we propose to learn by additively injecting bounded noise to the control inputs computed by a safe controller. By leveraging the recently developed system level synthesis (SLS) framework for control design 

[2], we give a computationally tractable algorithm which returns a controller that (a) guarantees the closed loop system remains within the specified constraint set and (b) ensures that enough noise can be injected into the system to obtain a statistical guarantee on learning. To the best of our knowledge, our algorithm is the first to simultaneously achieve both objectives. Furthermore, the controller synthesis is solved by a convex optimization problem whose feasibility is a certificate of safety, and no considerations of robust invariant sets are required.

Our second contribution is to provide a sub-optimality bound on control performance for constrained LQR. Using the same SLS framework, we quantify the excess cost incurred by playing a controller designed on the uncertain dynamics obtained from learning, in terms of both the size of the uncertainty sets and a type of constraint robustness margin of the optimal constrained controller for the true system. This allows us to provide the first end-to-end sample complexity guarantee for the control of constrained systems.

1.1 Related Work

Estimation and control of the unconstrained LQR problem has been studied in the non-asymptotic setting [3, 1]. However, the identification schemes rely on pure excitation and system restarts, which is unsuitable in the constrained setting. The online learning literature simultaneously considers learning and control, where strategies are based on optimism in the face of uncertainty (OFU) [4] or Thompson Sampling [5]. These approaches guarantee system estimation only up to optimal closed-loop equivalence, and do not consider safety. Building on a statistical result by Simchowitz et al. [6] which allows for non-asymptotic guarantees on parameter estimation from a single trajectory of a linear system, Dean et al. [7] provide a robust online method that guarantees parameter estimation and stability throughout.

The design of controllers that guarantee robust constraint satisfaction has long been considered in the context of model predictive control [8], including methods that model uncertainty in the dynamics directly [9], or model it as a bounded state disturbance for computational efficiency [10, 11]. Strategies for incorporating estimation of the dynamics include experiment-design inspired costs [12], decoupling learning from constraint satisfaction [13], and set-membership methods rather than parameter estimation [14]. Due to the receding horizon nature of model predictive controllers, this literature relies on set invariance theory for infinite horizon guarantees [15]. Our framework considers the infinite horizon problem directly, and therefore we do not require computation of invariant sets.

Finally, the machine-learning community has begun to consider safety in reinforcement learning, where much work positions itself as being for general dynamical systems in lieu of providing statistical guarantees 

[16, 17, 18, 19]. Some works assume the existence of an initial safe controller for learning [20], and robust MPC methods have been proposed to modify potentially unsafe learning inputs [21]. Our framework gives an alternative procedure for designing such a controller using coarse system estimates. Most similar to this work is that of Lu et al. [22], who propose a method to allow excitation on top of a safe controller, but consider only finite-time safety and require non-convex optimization to obtain formal guarantees.

2 Problem Setting and Preliminaries

We fix an underlying linear dynamical system , with full state observation, initial condition , sequence of inputs , and disturbance process . The dynamics matrices are unknown. For estimates of the system , define for , and similarly for and .

We assume some prior knowledge is given in the form of initial estimates and uncertainty measures . We note that the initial estimates may be coarse grained, and the goal of the learning procedure will be to refine this uncertainty prior to optimal control design.

2.1 System Level Synthesis

Many approaches to optimal control for systems with constraints involve receding horizon control, where an open loop finite-time trajectory is computed at each timestep; indeed, parameterizing optimal control problems by a state feedback controller generally leads to nonconvex optimization. Instead, we can parametrize the problem in terms of convolution with the closed-loop system response,


where we defined the fixed initial condition. The system sevel synthesis (SLS) framework shows that for any elements constrained to obey, for all ,

there exists a controller that achieves the desired system responses (2.1). The state-feedback parameterization result in Theorem 1 of Wang et al. [2] formalizes this observation, and therefore any optimal control problem over linear systems can be cast as an optimization problem over system response elements. We use boldface letters to denote transfer functions, e.g. and signals, . The affine constraints can be rewritten as

and the corresponding control law is given by .

2.2 Notation

In this paper, we restrict our attention to the function space , consisting of (discrete-time) stable matrix-valued transfer functions. We use to denote the set of transfer functions such that . We further use the notation for transfer functions that satisfy a certain decay rate in the spectral norm of their impulse response elements.

When working with transfer functions and signals, denote the coefficient of the term of degree as and . We will also denote

as the block row vector of system response elements of

As is standard, we let denote the -norm of a vector . For a matrix , we let denote its operator norm. We will consider the , , and norms, which are infinite horizon analogs of the Frobenius, spectral, and operator norms of a matrix, respectively: , , and .

Finally, for two numbers , we let (resp. ) denote that there exists an absolute constant such that (resp. ).

2.3 Optimal Control Problem

We now describe the constrained optimal control problem (OCP) that we would want to solve given perfect knowledge of . This formulation acts as our baseline:


This problem is to be interpreted as follows. First, let the set enumerate all inputs that result from linear dynamic stabilizing feedback controllers for of the form . This is made possible by the system level synthesis framework described above. The cost

where the system is in feedback with , is any distribution that satisfies and and is independent across time, i.e., for . On the other hand, the constraints read that for every possible realization satisfying , the trajectory and the inputs coming from the system dynamics in feedback with the law are contained within the state and input constraint polytopes and , respectively.

We note that the OCP given in (2.2) is a convex, but infinite-dimensional problem. It is an idealized baseline to compare our actual solutions to; our sub-optimality guarantees will be with respect to the optimal cost achieved by this idealized problem. This is a desirable baseline, since it optimizes for average case performance but ensures safety for the worst-case behavior, consistent with MPC literature [10, 23]. We remark that an alternative to (2.2) is to replace the worst case constraint behavior with probabilistic chance constraints [24]. We do not work with chance constraints because they are generally difficult to directly enforce on an infinite horizon; arguments around recursive feasibility using robust invariant sets are common in the MPC literature to deal with this issue.

3 Constraint-Satisfying Control

We begin by formulating a method for robustly operating a system while maintaining safety. First, a system level synthesis approach to the constrained LQR problem is described and then modified to be robust to uncertainties in system dynamics. Finally, we discuss a reduction to tractable a finite-dimensional optimization problem.

3.1 A System Level Approach

Using the SLS formulation, we define an optimization problem that solves the OCP (2.2).

Proposition 3.1.

The following convex optimization problem solves OCP (2.2).



with indexing the rows of and .

With , we define the LQR cost on the true system (omitting the constant multiple ) as

We remark that the feasibility of the convex synthesis problem in (3.1) for an initial condition implies that is a member of a robust control invariant set.


By the state-feedback parameterization result in Theorem 1 of [2], the SLS parametrization encompasses all internally stabilizing state-feedback controllers acting on the true system . Thus, it is necessary only to show that the optimization problem in (3.1) is consistent with that of (2.2) under the system level parametrization. The equivalence between the LQR cost and the system norm is standard and omitted for brevity – see the Appendix of [1] for this reformulation in terms of system responses.

Therefore, it remains to consider the inequality constraints. Because the constraints must be satisfied robustly, it is equivalent to consider

Then considering elements in the second term for ,

Thus the inequality constraint on the function is an equivalent condition. A similar computation holds for the input constraint. ∎

3.2 Robust Control

Further motivation for reformulating the optimal control problem in terms of system responses is the ability to transparently consider uncertainties in the dynamics. Recall that we consider controller synthesis under model errors, where only nominal estimates of the system are known. Then the model mismatch impacts the closed-loop system in a transparent way:

Proposition 3.2.

Define . If , satisfy and , then on the true system, the controller achieves the system response and cost bounded by


We make use of the robust stability result in Theorem 2 of Matni et al. [25] and note that is a sufficient condition for the existence of the inverse for any induced norm by the small gain theorem. Then by the sub-multiplicativity of the and norms,

Motivated by this result, consider the following robust optimization problem:


where are fixed parameters and

where we define .

Theorem 3.3.

Any controller designed from a feasible solution to the robust control problem (3.2) for any will stabilize the true system. Furthermore, the state and input constraints will be satisfied.


First, note that


Then by Proposition 3.2, the true system trajectory will be given by

Therefore, the state constraints are satisfied as long as

The first term reduces to as in the non-robust case. Because information about is not known, we resort to a sufficient condition to bound the second term, letting ,

Consequently, a sufficient condition for satisfying state constraints is to have for all ,

Therefore, the constraints on imply that the state constraints are satisfied. Similar logic shows that the constraints on imply that the input constraints are satisfied. ∎

3.3 Finite Dimensional Reduction

To make controller synthesis tractable, we can solve a finite approximation to optimization problem (3.2) wherein we only optimize over the first impulse response elements of and , treating them as finite impulse response (FIR) filers. We show that in this setting, the optimization variables and constraints admit finite-dimensional representations. We first reformulate the constraints. Starting with the affine constraint, we have for


where we will also optimize over , a term which captures the tail of the system responses that we ignore in the synthesis.

Next, considering the system norm constraints, the norm can be reduced to a compact SDP over as in Theorem 5.8 of Dumitrescu [26], described explicitly for this setting in Appendix G.3 of Dean et al. [7]. For the norm, the constraint becomes an operator norm bound,


where the tail variable enters transparently. For , the inequality constraints on and remain. For any , the expression reduces to, for


Therefore, the synthesis problem becomes


This is a finite dimensional SDP. The controller given by can be written in an equivalent state-space realization via Theorem 2 of Anderson et al. [27].

4 Suboptimality Guarantees

How much is control performance degraded by uncertainties about the dynamics? In this section, we derive a sub-optimality bound which answers this question for the constrained LQR problem. First, consider the addition of an outer minimization over and :111 The objective is unimodal in individually, and therefore this outer minimization can be achieved by searching over the box . For less computational complexity, the minimization need only be over a single outer variable: . In this case, the sub-optimality bound will retain the same flavor, but the norm distinctions between cost and constraints will be less clear.


Denote the solution to the true optimal control problem as , then define and . Additionally, define constants related to the optimal system norm and the dynamics uncertainties:

and .

Theorem 4.1.

Define the constraint robustness margins of the optimal constrained controller as

and similarly for . Then, as long as and , we have that the cost achieved by synthesized from the minimizers of (4.1) satisfies

While this result is stated in terms of quantities related to the unknown true system, we note that a similar data-dependent expression could be derived that depends only on the estimated system. We further remark that the condition on constraint robustness margins may be restrictive; for systems operating close to their constraints, our theorem requires near-perfect knowledge before guaranteeing sub-optimality.


Using Proposition 3.2 along with the norm bounds (3.3) and the constraints in optimization problem (4.1),

Next, we will use the following lemma

Lemma 4.2.

Under the conditions of Theorem 4.1, we have that the following is a feasible solution to (4.1)

where we define .

The proof of Lemma 4.2 follows by checking that the proposed solution satisfies all the constraints and is presented in Appendix 9. Applying Lemma 4.2,

This is true because is the optimal solution to (4.1), so objective function with feasible is an upper bound. Then we have

The second inequality follows from an application of Proposition 3.2 with the roles of the nominal and true systems switched. The final follows from bounding by and noticing that for , where we set . ∎

Here, we briefly remark that a similar sub-optimality bound can be derived for the finite problem in (3.7). In short, controllers synthesized from the optimization problem will satisfy a sub-optimality bound of the form in Theorem 4.1 with an additional factor due to the FIR truncation. The formal statement and proof of this result are deferred to Appendix 9, but we highlight here that the cost penalty incurred due to FIR approximation decays exponentially in the horizon over which the approximation is taken.

5 Learning with Control

Finally, we connect the previous results on robust control with system estimation. To show a priori guarantees on statistical learning we adopt control actions that both keep the system safe and provide excitation,


where each is stochastic and -bounded, i.e. . Given a trajectory sequence , we propose to learn the dynamics via least-squares regression on a trajectory of length :


We will prove a statistical rate on the least-squares estimate in terms of the system response and the trajectory length.

The bulk of the proof for the statistical rate comes from a general theorem regarding linear-response time series data from Simchowitz et al. [6]. Recently, this proof was adopted by Dean et al. [7] to show a rate of estimation in the setting given by (5.1) when both and the disturbance

are Gaussian distributed. We modify the reduction given by Dean et al. to the case when the excitation and disturbance are no longer Gaussian, but instead zero-mean and bounded. We assume that


are both zero-mean sequences with independent coordinates and finite fourth moments. In particular, we assume

, , , . These assumptions are quickly verified for common distributions such as uniform on a compact interval or over a discrete set of points. The main estimation result is the following.

Theorem 5.1.

Fix a failure probability

. Suppose the stochastic disturbance and the input disturbance satisfy the assumptions above. Assume for simplicity that , and that the stabilizing controller achieves a SLS response . Let . Then as long as the trajectory length satisfies the condition:


we have the following bound on the least-squares estimation errors that holds with probability at least ,

The proof of this result is presented in Appendix 8. We remark on the interpretation of statistical learning bounds. A priori guarantees, like the one presented here, depend on quantities related to the underlying true system. As such, they are not directly useful when the system is unknown,222 Statistical bounds in terms of data-dependent quantities can also be worked out; however, modern methods like bootstrapping generally provide tighter statistical guarantees [28]. but rather they indicate qualities of systems that make them easier or harder to estimate.

Corollary 5.2.

If the robust control synthesis problem (3.2) is feasible for any , initial system estimates , initial dynamics uncertainties , replaced with333 Note that since the quantity would not generally be known, it can be bounded by . , and replaced with , then the resulting control law with stochastic stabilizes the true system, satisfies state and input constraints, and allows for learning at the rate given in Theorem 5.1.


(Sketch) The proposed control law is equivalent to the system controlled by a deterministic control law plus an enlarged process noise distribution . Therefore, the stability and constraint satisfaction follow from Theorem 3.3. Since the control law is of the form (5.1), the results of Theorem 5.1 hold. ∎

Finally, we connect the sub-optimality result to the statistical learning bound for an end-to-end sample complexity bound on the constrained LQR problem.

Corollary 5.3.

Assume initial feasibility of the learning problem. For simplicity, assume . Then for


the cost achieved by synthesized from (3.2) on the least-squares estimates satisfies with probability at least ,


(Sketch) This result follows by combining the statistical guarantee in Theorem 5.1 with the sub-optimality bound in Theorem 4.1. Note that we use the naïve bound and similarly ; this results in an extra factor of appearing in (5.4). ∎

Notice that this result depends both on the true system and the initial system estimates by way of the learning controller, which affects and constants in the term. The system constraints enter through their effect on , and while they may impact the waiting time, they do not influence the ultimate cost sub-optimality.

6 Numerical Experiments

We demonstrate the utility of this framework on the double integrator example. In this case, the true dynamics are given by

with the constraints as states bounded between and , and inputs bounded in between and . We have . Our initial estimate comes from a randomly generated initial perturbation of the true system with . Safe controllers are generated with finite truncation length , and for larger initial conditions, the system is warm-started with a finite-time robust controller with horizon to reduce the initial condition.

Figure 1: Safe learning trajectories synthesized with coarse initial estimates (a), then robust execution with reduced model errors (b).

Figure 1 displays safe trajectories and input sequences for several example initial conditions. In 1a, the plotted trajectories are used for learning: the controller both regulates and excites the system (), and is robust to initial uncertainties. Figure 1b demonstrates an ability to operate closer to the margin when there is less uncertainty: in this case, there is no added excitation () and the system estimates are better specified (), so larger initial conditions are feasible.

Figure 2: Over time, estimation errors decrease (a). As safety requirements increase, the maximum feasible excitation decreases (b).

Figure 2

a displays the decreasing estimation errors over time, demonstrating learning. Shaded areas represent quartiles over

trials. Figure 2b displays the trade-off between safety and exploration by showing the largest value of for which the robust synthesis is feasible, given a size for the state constraint set . Here, we leave , and examine a variety of errors in the dynamics estimates. As the uncertainties in the dynamics decrease, higher levels of both safety and exploration are achievable.

7 Discussion

In this paper, we propose a method for learning unknown linear systems while ensuring that they satisfy state and input constraints. By synthesizing a controller that both excites and regulates the system, we address the trade-off between safety and exploration directly. We further derive an end-to-end finite sample bound on the performance of LQR controllers synthesized from collected data.

There are several directions for possible extensions of this work. To mitigate the conservativeness of the robust controller, tighter bounds on the uncertainty in the system response could be derived for structured settings, where more than just the norm of the error is known. To connect this work to experiment design literature, the objective in the synthesis problem (3.2) could be replaced with an exploration inspired cost function for the learning stage.

Alternatively, the constrained LQR problem could be cast in the setting of online learning, where one seeks to minimize cost at all times, including during learning. This would require an analysis of recursive feasibility, to understand the transition that occurs when controllers are updated based on refined system estimates. It would also likely require a direct quantification of performance loss when the robustness margin conditions are not satisfied. Finally, we remark that the exploration vs. safety trade-off is compelling for nonlinear systems.


We thank Francesco Borrelli and the members of the MPC Lab at UC Berkeley for their helpful comments and feedback. SD is supported by an NSF Graduate Research Fellowship under Grant No. DGE 1752814. ST is supported by a Google PhD fellowship. BR is generously supported in part by ONR awards N00014-17-1-2191, N00014-17-1-2401, and N00014-18-1-2833, the DARPA Assured Autonomy (FA8750-18-C-0101) and Lagrange (W911NF-16-1-0552) programs, and an Amazon AWS AI Research Award.



8 Learning Results

First, we prove a simple small-ball result for random variables with finite fourth moments.

Proposition 8.1.

Let be a zero-mean random variable with finite fourth moment, which satisfies the conditions

Let be a fixed scalar and . We have that


First, we note that we can assume without loss of generality, since we can perform a change of variables . We have that . Therefore, by the Paley-Zygmund inequality and Young’s inequality, we have that

Now define the function for as

Clearly and . Since by Jensen’s inequality we know that , this means . On the other hand,

Assume that (otherwise the claim is trivially true). The only critical points of the function