QRnet: optimal regulator design with LQR-augmented neural networks

In this paper we propose a new computational method for designing optimal regulators for high-dimensional nonlinear systems. The proposed approach leverages physics-informed machine learning to solve high-dimensional Hamilton-Jacobi-Bellman equations arising in optimal feedback control. Concretely, we augment linear quadratic regulators with neural networks to handle nonlinearities. We train the augmented models on data generated without discretizing the state space, enabling application to high-dimensional problems. We use the proposed method to design a candidate optimal regulator for an unstable Burgers' equation, and through this example, demonstrate improved robustness and accuracy compared to existing neural network formulations.



page 11


Adaptive Deep Learning for High Dimensional Hamilton-Jacobi-Bellman Equations

Computing optimal feedback controls for nonlinear systems generally requ...

A Tensor Decomposition Approach for High-Dimensional Hamilton-Jacobi-Bellman Equations

A tensor decomposition approach for the solution of high-dimensional, fu...

Optimal Stopping via Randomized Neural Networks

This paper presents new machine learning approaches to approximate the s...

Gradient-augmented Supervised Learning of Optimal Feedback Laws Using State-dependent Riccati Equations

A supervised learning approach for the solution of large-scale nonlinear...

An adaptive artificial neural network-based generative design method for layout designs

Layout designs are encountered in a variety of fields. For problems with...

High Dimensional Level Set Estimation with Bayesian Neural Network

Level Set Estimation (LSE) is an important problem with applications in ...

Hybrid Neural Network Augmented Physics-based Models for Nonlinear Filtering

In this paper we present a hybrid neural network augmented physics-based...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

While the linear quadratic regulator (LQR) is firmly established as one of the most powerful tools in linear control, the design of optimal regulators for nonlinear systems continues to challenge the control community. The bottleneck in optimal feedback design is the need to solve a Hamilton-Jacobi-Bellman (HJB) partial differential equation (PDE). Due to the well-known “curse-of-dimensionality,” this can be extremely difficult for high-dimensional nonlinear systems.

For this reason, there is an extensive literature on methods for approximating solutions of HJB equations. Some key examples include series expansions [3, 20, 17, 24, 6], level set methods [26], patchy dynamic programming [8, 25], semi-Lagrangian methods [5, 12], method of characteristics and Hopf formula-based algorithms [9, 33]

, tensor-based methods

[11], and polynomial approximation [16]. Unfortunately, many of these methods are limited to moderate dimensions, local solutions, or require the dynamics to have certain special algebraic structure.

In recent years, neural networks (NNs) have gained considerable attention as a promising tool for high-dimensional problems since they can avoid the use of spatial grids. Many NN-based methods represent the solution of the HJB equation – called the value function – with a NN and minimize the residual of the HJB PDE and boundary conditions at randomly sampled collocation points [2, 30, 29]. That is, they solve the HJB PDE in the least-squares sense. [15] propose a method for learning a suboptimal policy and approximate value function locally around some nominal trajectories.

Finally, a number of recent works have demonstrated the potential of data-driven methods for HJB. The core idea is to generate data by solving a number of two-point boundary value problems (BVPs) which describe the characteristics of the value function. These BVPs can be solved independently, without a spatial mesh, and in parallel, thus making the algorithm causality-free. This property allows the method to be applied to high-dimensional problems. Given BVP data, one then constructs a model of the value function based on this data. In [18]

the value function is calculated with sparse grid interpolation, in

[14, 22, 23]supervised learning is used to train a NN model, and in [4] the value function is approximated by sparse polynomial regression. Lastly, [13] consider a related approach which connects forward-backward stochastic differential equations with the HJB equation in stochastic optimal control.

In this paper we propose a physics-informed machine learning method to solve high-dimensional infinite horizon quadratic regularization problems. This method extends the framework introduced in [14, 22, 23] to infinite horizon problems, and we modify the NN value function model to include a quadratic term which comes from the LQR approximation for the linearized dynamics. This mirrors the form of a series expansion, with the NN accounting for all higher order terms.

We contend that the proposed model structure, which we call QRnet, has the following advantages:

  • In common practice, linear and nonlinear parts of the control are often treated separately. On the other hand, the QRnet feedback controller smoothly integrates these components to achieve good performance on large domains while retaining the local robustness of LQR.

  • Model training is LQR-initialized: rather than learning from scratch, the NN builds on the scaffolding of the LQR value function. As we show in Section 4.3, this can reduce sensitivity to variations in the data set and weight initialization.

2 Problem setting

We consider infinite-horizon nonlinear optimal control problems (OCPs) of the form


Here is the state, is the control, and

is a Lipschitz continuous vector field. We consider problems with quadratic running cost,


where is positive semi-definite and is positive definite. We assume that the objective state is a fixed point of the dynamics such that . Due to various sources of disturbance and real-time application requirements, we would like to design an optimal control in closed-loop feedback form, , which can be evaluated online given any measurement of .

2.1 The Hamilton-Jacobi-Bellman equation

To compute the optimal feedback control, we start by defining the value function as the optimal cost-to-go of creftype 1 starting at the point . That is,


It can be shown that the value function is the unique viscosity solution [10] of the steady state Hamilton-Jacobi-Bellman (HJB) PDE,


where we denote . If creftype 4 can be solved (in the viscosity sense), then it provides both necessary and sufficient conditions for optimality.

Given the value function , we define the Hamiltonian


The optimal control satisfies the Hamiltonian minimization condition,


Thus if we can solve creftype 4, the optimal feedback control is obtained online as the solution of creftype 6.

2.2 Linear quadratic control

When the system dynamics are linear, that is if for constant matrices , , then creftype 1 reduces to the classic infinite-horizon LQR problem. It is well-known that in this case the value function is given by


for satisfying the continuous algebraic Riccati equation,


Furthermore, the resulting state feedback controller is linear with constant gain:


For nonlinear systems, a common approach is to linearize the dynamics about . One can then compute a controller based on the linearized dynamics,


This approach has yielded many successful engineering applications, but it is suboptimal and in some cases even fails to stabilize the nonlinear dynamics. For this reason we are interested in modeling the value function for the full nonlinear dynamics.

2.3 Pontryagin’s Minimum Principle

To make use of creftype 6 for general nonlinear systems, we need an efficient way to approximate the value function and its gradient. Like [18, 14, 22, 23, 4], rather than solve the full HJB equation creftype 4 directly, we exploit the fact its characteristics evolve according to a two-point BVP, well-known in optimal control as Pontryagin’s Minimum Principle (PMP, [27]):


Here is called the costate. The two-point BVP creftype 11 provides a necessary condition for optimality, and if we further assume that the solution is optimal, then along the characteristic we have


In general, the BVP creftype 11 admits multiple solutions. So while the characteristics of the value function satisfy creftype 11, there may be other solutions to these equations which are sub-optimal and thus not characteristics. In certain problems the characteristics can also intersect, giving rise to non-smooth value functions and difficulties in applying creftype 12.

Optimality of solutions to creftype 11 can be guaranteed under some convexity conditions (see e.g. [21]). For most dynamical systems it is difficult to verify such conditions globally, but we can guarantee optimality locally around an equilibrium point [20]. Addressing the challenge of global optimality is beyond the scope of the present work, so in this paper we assume that solutions of creftype 11 are optimal. Under this assumption, the relationship between PMP and the value function given in creftype 12 holds everywhere.

Note the proposed method can still be applied even when this assumption cannot be verified. In such cases PMP is the prevailing tool for finding candidate optimal solutions, and from these the proposed method yields a feedback controller which satisfies necessary conditions for optimality.

3 Neural network value function modeling

3.1 Data generation

Like [22, 23, 4], we generate data by solving the two-point BVP creftype 11 for a set of randomly sampled initial conditions. Critically, these BVPs can be solved completely independently without knowledge of nearby solutions. Methods based on this idea are referred to as causality-free [18, 22, 23, 4]. Note that the related method proposed in [14] differs slightly because it does not allow one to choose initial conditions freely.

3.1.1 Generating infinite-horizon data

Unlike these prior works which consider fixed finite time OCPs, in this paper we are interested in infinite-horizon problems. Notice that the infinite-horizon PMP creftype 11 is obtained with the limit of a finite-horizon problem [27]. To reflect this, we solve creftype 11 up to some fixed final time . Next we check if the running cost is smaller than a desired tolerance. If not, we extend the time horizon and – using the previous solution as an initial guess – re-solve the BVP until the running cost is sufficiently small.

Once the running cost is small enough, it follows that the finite-horizon solution approximates the solution of the infinite-horizon problem. Then creftype 12 is satisfied at each point along the trajectory . Aggregating data from all infinite-horizon BVP solutions, we obtain a data set

where and . Note that there is no need to distinguish data from different trajectories as the value function and its gradient are time-independent.

While generating data in this way is efficient because we extract a lot of data from each successful BVP solution, it has the side effect of concentrating a large amount of data near the equilibrium. Meanwhile, we are interested in designing controllers which are effective over large regions of the state space and consequently we need data sets which support learning far from the equilibrium. To this end, instead of including the whole trajectory in the data set, we only take points with for a parameter . is chosen to balance efficiency in data generation with the competing objective of adequately representing the entire state space; in this paper we set .

3.1.2 LQR warm start for reliable BVP solution

In this paper, we solve the two-point BVP using the SciPy [32] implementation of the BVP solver introduced in [19]. This algorithm is highly accurate but also highly sensitive to the initial guess for and : there is no guarantee of convergence with an arbitrary initial guess. Furthermore, convergence is increasingly dependent on good initializations as we increase the length of the time interval to approximate the infinite-horizon problem.

To mitigate this difficulty we simulate the dynamics up to some large final time with an LQR controller creftype 9 to close the loop. This provides a guess for the optimal state trajectory, and a guess for the costate can be obtained with the LQR approximation . While the costate guess is often far from perfect, we find that it is usually close enough to facilitate reliable convergence over large time horizons. We refer to this strategy as LQR warm-start.

3.2 Neural network architecture

We employ a simple and intuitive architecture to model the value function. The main idea is to augment the LQR value function approximation with a NN which accounts for nonlinearities. The LQR value function is computed with respect to the dynamics linearized around , and provides a good local approximation. The NN corrects and extends the approximation throughout the training domain.

As for the NN, we use a standard fully-connected feedforward architecture. We denote the output of the network as . Feedforward NNs approximate complicated nonlinear functions by a composition of simpler functions, namely


where each layer is defined as . Within each layer and

are the weight matrices and bias vectors, respectively, and

denotes a nonlinear activation function applied component-wise to its argument. We keep the output layer linear, so is the identity function.

We combine the raw NN prediction creftype 13 with the LQR value function creftype 7 for the linearized dynamics creftype 10 as


with a trainable parameter . Intuitively, LQR provides a good approximation near . There is small and hence for all . Further away from , we have , thereby increasing the relative importance of the corrective NN. The parameter governs the radius in which this term approximates ; in particular . Notice that the model structure creftype 14 is similar to a series expansion, except that we explicitly reduce the impact of lower order terms away from the linearization point.

Finally, the NN-based feedback control is evaluated by substituting into creftype 6 in place of the gradient of the true value function:


It should be emphasized that the gradient is calculated using automatic differentiation, and is therefore exact and computationally efficient.

3.3 Physics-informed learning

Suppose we have generated a data set as discussed in Section 3.1. This data takes the form of input-output pairs: are the inputs and are the outputs to be modeled. Let denote the collection of model parameters:

In [14, 22, 23], the NN is trained by solving a supervised learning problem,


Here the first term is the usual mean square regression loss,


The second term, multiplied by the scalar weight , is defined as


This serves as a form of physics-informed regularization. The term “physics-informed” is borrowed from [28] which partially inspired the authors’ previous work [22, 23]. By incorporating the prior knowledge that , we maximize the information extracted from the available data and obtain more optimal feedback laws [14, 22, 23].

In this work we impose an additional penalty on deviating from the optimal control:


Minimizing this control penalty term contributes directly toward the ultimate goal of using the NN for optimal feedback by effectively enforcing the Hamiltonian minimization condition creftype 6 on the learned feedback policy. We now arrive at a modified version of creftype 16:


for another scalar weight .

To quantify the accuracy of the model, we generate two data sets from independently drawn initial conditions. During training, the network observes only data points from the training set . The other data set, which we call the validation set , is reserved for evaluating the NN accuracy after training. Good validation performance indicates that the NN generalizes well, i.e. it did not overfit the training data. The ability to empirically measure model accuracy in this way is a key feature of causality-free methods.

4 Application to distributed parameter system

In this section we explore the effectiveness of the proposed algorithm by solving a 64-dimensional OCP arising from a Chebyshev pseudospectral (PS) discretization of a modified Burgers’ equation with a destabilizing reaction term. Stabilization of Burgers’ equation is a common benchmark problem in distributed parameter systems and similar problems have recently been considered in e.g. [16, 6, 22, 23].

Let satisfy the following one-dimensional controlled PDE with Dirichlet boundary conditions:


Here are scalar parameters, , and . The control is designed to stabilize the open-loop unstable origin by solving the PDE-constrained OCP



We define

and consider the case with actuators active on compact supports defined by

We set , , , and .

4.1 Pseudospectral discretization

To solve creftype 22 using our framework, we perform Chebyshev PS collocation to transform creftype 21 into a system of ordinary differential equations (ODEs). Following [31], let

We collocate at the non-boundary nodes , , and set to account for the boundary conditions. We then define

and construct Chebyshev differentiation matrices . Hence the PDE creftype 21 becomes


where , “” denotes elementwise multiplication, and and are the collocated versions of and .

The integral appearing in the cost function is conveniently approximated by Clenshaw-Curtis quadrature [31]. Let , be the non-boundary Clenshaw-Curtis quadrature weights. Then

Now the original OCP creftype 22 can be reformulated as a quadratic cost ODE-constrained problem,


4.2 Optimal and LQR controllers

We apply the Hamiltonian minimization condition creftype 6 to obtain an explicit expression for the optimal control,


For the LQR approximation we linearize the dynamics about and get


The LQR value function and controller are then computed with Eqs. creftypeplural 7 to 9.

4.3 Numerical results

We now present results of our method for solving the OCP creftype 24 with collocation points. The algorithm can handle higher dimensions, but we find that points are enough to resolve the stiff dynamics for all initial conditions tested. We compare the accuracy of QRnet with that of a standard NN trained using supervised learning (comparable to [14, 22, 23]), as well as with a straight LQR approximation.

We consider initial conditions which are sums of sine functions with uniform random coefficients:


and use the following hyperparameters for training:

  • in the NN component we use

    layers each with 32 neurons, and apply the

    activation function to all hidden layers;

  • we set the weights in the loss function

    creftype 20 to and , and optimize using L-BFGS-B [7];

  • we implement the model in TensorFlow 1.11

    [1] and train it on an NVIDIA RTX 2080Ti GPU.

Figure 1:

Relative mean absolute error in estimating the value, relative mean

error in predicting the optimal control, and training time, depending on the number of trajectories seen during training. The bar graph height shows the median over ten trials and error bars indicate the 25th and 75th percentiles.

First we study the sensitivity of the method with respect to variations in the data set and parameter initialization. This is important as collecting data can be time-consuming and NN training is a highly non-convex optimization problem. To this end we vary the number of trajectories in the training set , and for each different data set size, conduct ten trials with different randomly generated training trajectories and NN weight initializations. For validation we build a data set of 400 trajectories totaling data points.

From the results shown in Figure 1

, it is clear that both the plain NN and QRnet vastly outperform LQR for value function reconstruction and optimal control prediction. Furthermore, the distribution of validation errors skews lower for QRnet than the plain NN. This suggests that QRnet is less sensitive to variations in the data set and parameter initialization, and thus enables the use of smaller data sets. The training time for smaller data sets is also shorter. Together, these properties facilitate rapid prototyping over an iterative design process, and combine naturally with the progressive model refinement strategy proposed in


Figure 2:

Difference between controlled and optimal costs, depending on the norm of the initial condition. Shaded areas cover the 15th to 85th percentiles, lines show medians, and symbols pick out the top three outliers for each group of initial conditions.

Next we select NN and QRnet models trained on the same set of 64 trajectories with nearly identical validation accuracy. We compare the performance of the LQR, NN, and QRnet feedback controllers to the open-loop optimal control, across 1200 simulations for initial conditions of different size, . Results are shown in Figure 2. As expected, the performance of LQR steadily degrades away from the origin, while the NN and QRnet controllers do well throughout the training domain. The latter two appear to be largely equivalent, except for a few initial conditions where the NN accumulates significant additional cost while LQR and QRnet do not. This supports the idea that QRnet inherits some local robustness from LQR, yielding a more reliable controller than a plain NN.

Figure 3: Closed-loop dynamics and controls of the Burgers’ system creftype 23 for a random initial condition. The full simulation is over but we show only .

We conclude with a simulation of the dynamics creftype 23 with QRnet feedback, for a typical random initial condition. Results are plotted in Figure 3. There we can see that QRnet stabilizes the system and closely approximates the optimal control, even where LQR deviates from it.

5 Concluding remarks

In this paper we have presented QRnet, an extension of the causality-free physics-informed learning framework [22, 23] to infinite-horizon OCPs. This extension is comprised of two main features: efficient infinite-horizon data generation and a structurally-motivated NN architecture. By way of the Burgers’ benchmark problem, we have illustrated the potential for use in high-dimensional nonlinear systems and the improved robustness and performance the LQR-augmented model architecture.

Much remains to be explored in the development of deep learning approaches for feedback design. For instance, we are interested in applying the proposed method to problems with control constraints and non-quadratic costs with locally quadratic approximations. It will also be useful to study how beneficial the model architecture is, depending on how well the original problem can be approximated by a linear quadratic one. Finally, since the computational framework depends crucially on data generation, in future work we plan to study different strategies for improving the robustness and efficiency of this step.