For the optimal control of nonlinear dynamical systems, it is well-known that open-loop controls are not robust to model uncertainty or disturbances. For slowly evolving processes, it is possible to use model predictive control by recomputing the open-loop optimal solution for a relatively short time horizon in the future. However, for most applications one typically desires an optimal feedback control. Using dynamic programming, the optimal feedback control is computed by solving the (discretized) Hamilton-Jacobi-Bellman (HJB) equation, a partial differential equation (PDE) inspatial dimensions plus time. The size of the discretized problem increases exponentially with , making direct solution infeasible for even moderately high dimensional problems.
For this reason, there is an extensive literature on methods of finding approximate solutions for HJB equations. We have no intention to give a full review of existing results except for a short list of some related publications, [2, 7, 9, 11, 14, 16, 18, 22, 23, 25]
. In addition to suffering from the curse of dimensionality, most existing methods suffer one or more of the following drawbacks: the accuracy of the solution cannot be verified for general systems; the solution is valid only in a small neighborhood of a point; or the system model must have certain special algebraic structure.
In [17, 18], semi-global solutions to HJB equations are computed by constructing sparse discretization of the state space, then solving a two-point boundary value problem (TPBVP) to obtain the solution at each point on the grid. This algorithm is causality-free, i.e. the solution at each grid point is computed without using the value of the solution at other nearby points. Causality-free algorithms have been used to solve various types of PDEs, such as conservation laws , HJ and HJB equations [13, 8], and semilinear parabolic PDEs . The causality-free property is achieved by means of Hopf formula and convex optimization in , backward stochastic differential equations in , and Pontryagin’s minimum principle in 
. Causality-free algorithms are attractive because the computation does not depend on a grid, and hence can be applied to high dimensional problems. Some causality-free methods are slow for online computation, but they are perfectly parallel so can be used to generate large data sets offline. Such data sets can then be used to construct faster solutions such as sparse grid interpolation[17, 18] or, as in this paper, neural networks.
In this paper, we develop a computational method for solving high dimensional HJB equations and generating fully nonlinear optimal feedback controls. Our approach is data-driven and consists of three main steps. In the first step, a small set of open-loop optimal control solutions is generated using a causality-free algorithm, which is a TPBVP derived from Pontryagin’s minimum principle. In the second step, we use the data set to train a neural network (NN) that approximates the value function of optimal control. During training, we estimate the number of samples needed to obtain a good model. Additional samples are chosen in regions where the value function is difficult to learn, and can be obtained quickly with the aid of the NN. In this sense our method involvesadaptive sampling. Lastly, the accuracy of the NN is verified on another data set that is generated using the same causality free algorithm from the first step.
The trained NN is much faster than the causality-free algorithm used to generate data, a property that is critical for real-time computation. As an example, the method is applied to the optimal attitude control of a rigid-body satellite equipped with momentum wheels. The associated HJB equation has spatial dimensions and control inputs. Using this example we demonstrate several advantages and potential capabilities of the method, including solving HJB equations with high dimensions with validated levels of accuracy, computationally efficient NN-based feedback control for real-time applications, and progressive generation of rich data sets. Similar to , the method in this paper is semi-global, i.e. the solution is computed and validated over a given region rather than a local neighborhood around an operating point. Further, we achieve a comparable level of accuracy to that in , but require far fewer data points to do so.
2 A causality-free method for HJB
Consider the optimal control problem
Here is the state, is the control,
is the Lipschitz continuous vector field,is the terminal cost, and is the running cost. For simplicity let the final time be fixed. For a given initial condition , many methods exist to compute the optimal open-loop solution
It is well known, however, that open-loop controls are not robust to model uncertainty or disturbances. For some control problems, typically slowly evolving systems, it is possible to recompute open-loop optimal control online and implement control in MPC framework to mitigate the effect of uncertainty. However, for most engineering control applications one typically desires an optimal feedback control
which solves the optimal control problem
Using dynamic programming, the optimal feedback control (2.2) is computed by solving the following Hamilton-Jacobi-Bellman (HJB) equation,
In this paper we will denote
the partial derivatives with respect to time and space, respectively. The solution, , to (2.5) is the optimal cost of (2.4) and is called the value function. Since eq. (2.5) is a partial differential equation (PDE) in dimensions plus time, the size of the discretized equation increases exponentially with , making even moderately high dimensional problems computationally intractible. This is the so-called “curse of dimensionality.”
In [17, 18], the authors make efforts to overcome this obstacle by avoiding direct discretization and solution of the HJB equation. Instead, they construct a sparse grid of initial conditions and, for each point in the grid, solve the open-loop optimal control problem (2.1) as a TPBVP. They then interpolate the costate and apply Pontryagin’s Maximum Principle (PMP) to obtain the feedback control. However, even using a sparse grid the number of points grows with
where is the state dimension and is the number of points in each dimension. Thus using the sparse grids characteristics method [17, 18], we may have to solve a prohibitively large number of BVPs for higher dimensional problems. In addition, even with the solution available, evaluating the feedback control requires interpolation of the -dimensional costate. This can be expensive for large , making feedback implementation difficult.
We adopt the same TPBVP and its computational algorithm in [17, 18], but instead of sparse grid interpolation we use the data to train a NN to approximate the value function . We now provide a brief overview of the connection between dynamic programming and the TPBVP, as well as techniques to solve it numerically.
Following the standard procedure in optimal control (see e.g. ), first we define the Hamiltonian
where is the costate. The optimal control satisfies
Now we define the value function as
The value function satisfies the HJB equation (2.5), or equivalently
where . If (2.9) can be solved, then the optimal control is computed by substituting
into (2.7) to get
This means that with available, the feedback control is obtained as the solution of a (usually straightforward) optimization problem.
To use (2.11), we need an efficient way to calculate the value function and its gradient. Rather than solve the full HJB equation (2.9) on a grid in the state space, we leverage the fact that its characteristics evolve according to
with two-point split boundary conditions
For any given initial condition , the optimal control and value function along that characteristic are then given by
The necessary condition in the form of the TPBVP (2.12-2.13) is well-known in optimal control theory. However, numerically solving this problem is often very difficult. So far, there is no general algorithm that is reliable and fast enough for real-time applications. However, in our approach the real-time computation is done by a pre-trained NN. Thus we can solve the TPBVP offline to generate data sets to be used to train and validate NN. For this purpose, numerically solving the TPBVP can be manageable although it may require some level of parameter tuning. In this paper, the computational algorithm is based on a four-point Lobatto IIIa discretization. This is a collocation formula which provides a solution that is fifth-order accurate (see Kierzenka and Shampine  for more detail). But the algorithm is highly sensitive to the initial guess: there is no guarantee of convergence with an arbitrary initial guess. Furthermore, convergence is increasingly dependent on a good initial guess as we increase the length of the time interval.
To overcome this difficulty, to generate the initial data set we employ the time-marching trick from [17, 18] in which the solution grows from an initially short time interval to the final time . More specifically, we choose a time sequence
in which is small. For the short time interval , the TPBVP solver converges given most initial guesses near the initial state . Then, the resulting trajectory is rescaled over the longer time interval . The rescaled trajectory is used as the initial guess to find a solution of the TPBVP for . We repeat this process until , at which we obtain the full solution. Still, it is necessary to tune the time sequence to achieve convergence while maintaining acceptable efficiency.
Computing many such solutions becomes expensive, which means that generating the large data sets necessary to train a NN can be difficult. Instead, we begin by generating a small data set and adaptively adding more points during training. The key to doing this efficiently is simulating the system dynamics, using the partially-trained NN to close the loop. This quickly provides good guesses for the optimal state and costate over the entire time interval , so that we can immediately solve (2.12-2.13) for all of . Details are presented in Section 4, and numerical comparisons between this method and the time-marching trick are given in Section 5.3.
3 Neural network approximation of value function
Neural networks have become a popular tool for modeling high dimensional functions, since they are not dependent on discretizing the state space. In this paper, we apply NNs to approximate semi-global solutions of the HJB equation and evaluate the resulting feedback control in real-time. Specifically, we carry out the following steps:
Initial data generation: We compute the value function, , at a random selection of initial conditions chosen by Monte Carlo (MC) or quasi-Monte Carlo (QMC) sampling. Data is generated by solving the TPBVP as discussed in Section 2. In this initial data generation step, we require relatively few data points, since more data can be added later at a low computational cost.
Model training: Given this data set, we train a NN to approximate the value function . Our training process is guided by knowledge of the underlying structure of the problem by encouraging satisfaction of eq. (2.10). In doing so, we regularize the model and make efficient use out of small data sets.
Adaptive data generation: In the initial training phase we only have a small data set, so the NN we obtain is a low-fidelity model that roughly approximates the value function. We now expand the data set by generating data at initial conditions where the gradient is predicted to be large. Such points are likely to be near regions where the value function is steep or complicated, and thus difficult to learn. Computation of additional data is made efficient by good initial guesses obtained from NN-in-the-loop simulations of the system dynamics.
Model refinement and validation: We continue training the model and increasing the size of the data set until we satisfy some convergence criteria. Then, we check the generalization accuracy of the trained NN on a new set of validation data computed at MC sampling points.
Feedback control: We compute the optimal feedback control online by evaluating the gradient of the trained NN and applying PMP. Notably, evaluation of the gradient is extremely cheap even for large , enabling implementation in high dimensional systems.
The crux of the proposed method depends on modeling the value function (2.8) over the domain . We provide details on this process below. In Section 3.1, we review the basic structure of feedforward NNs and describe how we train a NN to model the value function. Then in Section 3.2, we propose a simple way to incorporate information about the costate into the training. Finally in Section 3.3, we demonstrate how to use the trained NN for feedback control. The adaptive data generation scheme is treated separately in Section 4. The proposed method is illustrated in Section 5 by solving the optimal attitude control problem for a rigid body satellite equipped with three pairs of momentum wheels.
3.1 Feedforward neural networks
In this paper we use simple multilayer feedforward NNs. While many more sophisticated architectures have been developed for other applications, we find this basic architecture to be more than adequate for our purposes. Let be the function we wish to approximate and be its NN representation. Feedforward NNs approximate complicated nonlinear functions by a composition of simpler functions, namely
where each layer is just
are the weight matrices and bias vectors, respectively.represents a nonlinear activation function
; popular choices include ReLU, tanh, and other similar functions. The final layer,, is typically linear, so
is the identity function. In this paper, we use tanh as the activation function on all the hidden layers.
We denote by the collection of all parameters, i.e.
where are the inputs, are the outputs to be modeled, and are the indices of the data points. The NN is then trained by solving the nonlinear regression problem,
where is the number of sample data in .
3.2 Physics-informed learning
Motivated by developments in physics-informed deep learning 
, we expect that we can improve on the rudimentary loss function in (3.1) by incorporating information about the underlying physics. Although it is possible to impose the HJB equation (2.9) by means of a residual penalty like  and , the residual must be evaluated over a large number of collocation points and can be rather expensive to compute. Instead, we take the simpler approach of modeling the costate along with the value function.
Specifically, we know that the costate must satisfy (2.10), so we encourage the NN to minimize
is the gradient of the NN representation of the value function with respect to the state. This quantity is calculated using automatic differentiation. In machine learning, automatic differentiation is usually used to compute gradients with respect to the model parameters, but is just as easy to apply to computing gradients with respect to inputs. In addition, the computational graph is pre-compiled so evaluating the gradient is cheap.
where . We now define the physics-informed learning problem,
Here is a scalar weight, the loss with respect to data is
and the physics-informed gradient loss regularization is defined as
As is common practice, we randomly partition the given data set (3.2) into a training set and validation set . During optimization, the loss functions (3.4) and (3.5) are calculated with respect to the training data . We then evaluate the performance of the NN against the validation data , which it did not observe during training. Good validation performance indicates that the NN generalizes well, i.e. it did not overfit the training data.
A NN trained to minimize (3.3) learns not just to fit the value data, but it is rewarded for doing so in a way that respects the underlying structure of the problem. This physics-informed regularization takes the known problem structure into account, so is preferable to the usual or
regularization, which are based on the (heuristic) principle that simpler representations of data are likely to generalize better. Furthermore, by using information about the costate we make the most use out of a potentially small data set.
3.3 Neural network in the closed-loop system
Once the NN is trained, evaluating at new inputs is highly efficient – and since we minimized the gradient loss (3.5) during training, we also expect to approximate the true gradient well. At runtime, whenever the feedback control needs to be computed, we evaluate and then solve (2.11) based on this approximation.
For many problems of interest, the optimization problem (2.11) admits a (semi-)analytic solution. In particular, for the important class of control affine systems with running cost convex in , we can solve it analytically. Suppose that the system dynamics can be written in the form
where and . Further, suppose that the running cost is of the form
for some function and some positive definite weight matrix . Then the Hamiltonian is
Now we apply PMP, which requires
Letting and solving for gives us the optimal control in feedback form:
4 Adaptive sampling and fast BVP solutions
4.1 Motivating progressive batching
Since generating just a single data point requires solving a difficult TPBVP, it is difficult to generate large data sets which adequately represent the value function. This necessitates training with small data sets and a method to generate new data in a smart way. In this paper, effective training with small data sets is accomplished by incorporating information about the costate as discussed in Section 3.2, but also by combining progressive batching with an efficient adaptive sampling technique.
rely on accurate gradient computations, and hence have to use the entire data set. For this reason they are often referred to as batch or full-batch methods. On the other hand, first order methods based on stochastic gradient descent (SGD) use only small subsets, or mini-batches, of the full data set. That is, at each optimization iteration, the loss functions in (3.1) and (3.3) are evaluated only on a subset where . Although second order methods converge much more quickly than first order methods, the necessary gradient calculations are prohibitively expensive for large data sets. Consequently, SGD variants have become the de facto standard for machine learning applications.
However, in the context of deep learning, our NNs are small and data sets smaller. Thus we expect second order methods to be superior for our purposes. With a small initial data set, which we denote by , we find that training a low-fidelity model is very fast using L-BFGS. After this initial round, we want to increase the size of the data set so that it better captures the features of the value function, then continue training the model (with more stringent convergence tolerances) using this larger data set, . We continue this process until some conditions are satisfied.
Our approach is similar to the progressive-batch L-BFGS method proposed in . The primary difference is that the problem addressed in  is a standard machine learning problem, where a massive data set is available from the start. This allows them to increase the sample size every few iterations, and take a completely different sample. In our case this is not possible: we only have a small amount of data, and since data generation is expensive, we would like to use only as much as is needed.
4.2 Convergence test and sample size selection
To start, assume that the internal optimizer (e.g. L-BFGS) converges in optimization round . Let denote the NN parameters at the end of this round. Given convergence of the optimizer, the first order necessary optimality condition holds, so
where is the physics-informed loss defined in eq. (3.3), is the gradient with respect to the NN parameters , and is the costate-augmented data set available in the th training round. Now if is sampled uniformly from the domain of interest , it follows that
where denotes the population average taken over all possible batches sampled uniformly from . The quantity on the right is the gradient evaluated over the entire domain, which by its Monte Carlo structure gives rise to the expectation over and . Note that the quantity on the right is not directly computable; this equality simply says that, in expectation, the parameters satisfy the first order conditions for optimality.
The simplest way to test convergence is to use a validation data set with and check if, for example,
for a small positive parameter . Such tests are standard in machine learning, but for many practical problems it may be too expensive to generate a large-enough validation data set. More importantly, such tests provides no clear guidance in selecting should they fail.
Instead, we propose a general strategy along the lines of  and  which provides information on how to choose the sample size and is applicable to general data-driven optimization problems. The idea is simple: since we already assume (4.1) holds, then to ensure that
is small, it suffices to check if the population variance is relatively small. That is, we require
where is a positive parameter. Evaluating the left hand side is computationally intractible, but it is bounded by the sample variance:111For this approximation to be valid, the data set must be sufficiently large, although in practice we find that a few hundred samples is sufficient. On the other hand, if is too large, it can be inefficient to calculate gradients for a large number of individual samples, so we make a further approximation by evaluating this term over a subset of the data.
We still cannot compute the right side of (4.2) directly, so we approximate it by the sample gradient and arrive at the following condition:
If (4.3) is satisfied, then it is likely that is small. In other words, the solution should satisfy the first order optimality conditions evaluated over the entire domain, so we can stop optimization. On the other hand, when (4.3) does not hold, it still guides us in selecting the next sample size . To see this, suppose that the gradient variance doesn’t change significantly by increasing the size of the data set. That is, if
then the appropriate choice of sample size is such that
We note that the convergence test (4.3) and batch size selection scheme (4.4) we have derived are quite general in the sense that they are not specific to learning solutions to the HJB equation. They can be applied to many data-driven optimization problems where data is difficult to generate but the size of the data set can increase over time. Furthermore, these results enable the use of existing packages for second order and constrained optimization in such problems.
4.3 Application to neural network training and data generation
The sample size selection criterion (4.4) we propose indicates how many data are necessary to satisfy the convergence test (4.3), assuming a uniform sampling from the domain. In practice, since all the data we generate will be new, we can choose to generate new data where it is needed most, hence the term adaptive sampling. This condition for generating new data can be interpreted in many ways. In this paper, we simply sample points where is large. Regions of the value function with large gradients tend to be steep or complicated, and thus may benefit from having more data to learn from. Moreover, this sampling scheme is easy to implement, efficient, and physically motivated.
To implement this, suppose we would like new data points with input locations denoted by
To choose where to put these points, we first randomly sample a large set of candidate initial conditions from . Then a quick pass through the NN can give us the predicted gradient at each candidate point,
We then simply pick the points with largest norm. For each of these points, we then simulate the system dynamics using the partially-trained NN as the closed-loop controller. In most all cases, this yields a solution which is fairly close to the optimal state and costate . By using this solution as an initial guess for the BVP solver, we then quickly obtain a solution to the TPBVP (2.12-2.13) for the full time interval without the need for the time-marching trick described in Section 2.
This algorithm enables us to build up a rich data set and a high-fidelity model of . Moreover, this data set is not constrained to be within a small neighborhood of some nominal trajectory - it will contain points from the entire domain of
, and we can adjust the process as described to generate more data near complicated features of the value function. As we progressively refine the NN model, we can enforce stricter convergence tolerances for the internal optimizer, as well as adjust hyperparameters like the gradient loss weightor the number of terms in the L-BFGS Hessian approximation. As the NN is already partially-trained, fewer iterations should be needed for convergence in each round so we can afford to make each iteration more expensive.
5 Application to rigid body satellite rotation
5.1 Rigid body attitude control
As a practical test of our proposed method, we consider the six-state rigid body model of a satellite studied by Kang and Wilcox [17, 18]. Using the sparse grid characteristics method, they demonstrate the feasibility of interpolating the value function in six dimensions, and use this for model predictive feedback control (MPC) of the nonlinear system. In , this is accomplished even for the underactuated case. We use their successful results as a baseline for evaluating our method.
be an inertial frame of orthonormal vectors and letbe a body frame. The state of the satellite is then written as . Here is the attitude of the satellite represented in Euler angles,
in which , , and are the angles of rotation around , , and , respectively, in the order – see  – and is the angular velocity in the body frame,
The state dynamics are given by
Here are matrix-valued functions defined as
Further, is a combination of the inertia matrices of the momentum wheels and the rigid body without wheels, is the total constant angular momentum of the system, and is a constant matrix where is the number of momentum wheels. To control the system, we apply a torque .
We consider the fully-actuated case where . Following , we set
The optimal control problem is
We sample initial values from the domain
and solve the two-point BVP (2.12) at each initial value , , using a four-stage Lobatto IIIa method . It is important to note that in , this problem is solved only for initial time . Following their lead, we generate data only for and . This means that the system is controlled for the whole time interval by . Consequently, the control is actually implemented as MPC rather than time-dependent optimal control. In other words, at each time when the integrator needs to evaluate the control, instead of computing , we assume and return . This is justifiable because the optimal control problem (5.1) is time-invariant.
5.2 Learning the value function
In this section, we present numerical results of our implementation of a NN for learning the value function of the rigid body rotation problem (5.1). We experiment with changing the number of training data and the weight on the value gradient loss term (3.5), and compare the results to those obtained in . Notably, we demonstrate that we can match the accuracy of the sparse grid characteristics method using a data set which is over 40 times smaller, and at minimal additional computational cost.
We implement a simple feedforward NN in TensorFlow and train it to approximate
. Many different configurations of numbers of hidden layers and neurons work for this problem, but we find that three hidden layers with 64 neurons in each hidden layer sufficient. We optimize using the SciPy interface for the L-BFGS optimizer[15, 6]. For validation, we generate a data set containing data points, and keep this fixed throughout all the tests. As a baseline, the sparse grids characteristic method with grid points achieves a relative mean absolute error (RMAE) of on this data set.
Figure 1 displays the results of a series of tests in which we fix the value gradient loss weight and vary the size of the training data set. In this experiment, we do not do adaptive sampling: training is conducted only for one round using the full data set provided, which is fixed for each different value of . However, we find that we need only a small amount of data to get reasonable results. We highlight that with just 1024 data points, we can match the accuracy of the sparse grid characteristics method which uses points. With 8192 data points, the trained NN is three times as accurate as the sparse grid method. In addition, these NNs need only a couple minutes to train, so minimal computational effort beyond data generation is required.
This level of accuracy for small data sets is only reached by properly tuning . In particular, we are unable reach a level of accuracy comparable to that of the sparse grid method by pure regression (3.1). For this problem, we can improve accuracy by about an order of magnitude by choosing around to . Finding a good choice of is dependent on the problem, which generally dictates the ratio of sizes of the two loss terms (3.4) and (3.5), as well as the size of the data set and other hyperparameters. In addition, results can vary depending on randomly drawn data sets and NN parameter initializations. Even so, since training is fast, it is usually not difficult to find good hyperparameter settings and NN initializations.
5.3 Adaptive sampling
Performing a systematic study of the adaptive sampling and model refinement technique proposed in Section 4 is rather difficult, since a successful implementation depends on various hyperparameter settings (which can change each optimization round) as well as random chance, since data points are chosen in a partially random way and the randomly-initialized NN training problem is highly non-convex. For this reason, in this section we show a few conservative results which we feel demonstrate its potential.
Figure 2 shows the progress of the validation error during training when using adaptive sampling starting from a data set with points. This is the same data set with 512 points used in Section 5.2. We also keep the gradient loss weight , set the convergence tolerance in (4.3) to to compensate for the large gradients, and we use and data points in the following rounds. The data set sizes are selected to be the smallest power of 2 which is larger than the number of points recommended by eq. (4.4). Results are compared to the progress when training on a fixed data set with points (the same as used in Section 5.2). To fully realize the potential of the method, one needs to adjust hyperparameters like , , and internal optimizer parameters in each round. How to do this in an effective and efficient way is beyond the scope of this paper, and remains a topic for future research. Still, even with a naïve implementation, the final accuracy is just as good as that obtained when training on a fixed data set with , even though training started with only points.
Next, we investigate the convergence of the BVP solver with the two strategies discussed in this paper. First, we use the time-marching trick discussed in Section 2. Second, instead of time-marching we take initial guesses generated by simulating the closed-loop dynamics with a NN controller and solve the full TPBVP directly. Results are given in Tables 2 and 2. For these tests, we use 1,000 initial conditions with the largest predicted gradient norm, , picked from a set of 100,000 randomly sampled candidate points. The set of initial conditions is fixed for all tests.
|number of time steps||BVP convergence||mean solution time|
|validation RMAE||BVP convergence||mean solution time|
When we attempt to solve the TPBVP with no time-marching (the line in Table 2), the proportion of convergent solutions is extremely small. This obviates the need for good initial guesses. As shown in Table 2, we reliably obtain solutions for this problem when we use at least time intervals. We note that the initial conditions are purposefully chosen to be difficult – if we simply take random samples from the domain , the proportion of convergent solutions increases considerably. We see from the results presented in Table 2, using initial guesses generated by even a poorly-trained NN ( validation RMAE), the chance of convergence is just as high as when using time intervals for time-marching. Moreover, we find that by training even slightly more accurate NNs we easily get over convergence. We also see that these solutions take less time than equivalently reliable solutions obtained with time-marching. While the difference is insignificant for individual data points, when building large data sets this can result in significant time savings. Together, these results suggest that the proposed adaptive sampling and model refinement framework has the potential to be useful in solving other, more difficult problems.
5.4 Closed-loop performance
Lastly, we show that the trained NN is capable of stabilizing the nonlinear system dynamics. The control is calculated using eq. (3.6) to get
Since and are constant matrices, we pre-compute the product . Hence evaluation of the control requires only a forward pass through the computational graph of the gradient and a matrix multiplication. Recall that we are implementing MPC, so the control is actually computed as for all times .
In Figure 3, we plot two typical closed-loop trajectories starting from randomly sampled initial conditions with large gradient norm, and see that the controller asymptotically stabilizes these trajectories. We also note that short computation time is critical for implementation in real systems, and this is achieved here as each evaluation of the control takes only seconds.
In this paper, we have developed a new approach to solving HJB equations and real-time computational optimal feedback control. Because the method is not grid-dependent, it can be applied even to high dimensional systems. We also emphasize that while our method is fundamentally data-driven, by utilizing the costate data we realize more physically-consistent solutions. As we have demonstrated for the rigid body rotation problem, this enables us to obtain highly accurate results with surprisingly small data sets. Finally, the NN is not trained just in the neighborhood of some nominal trajectory, meaning the feedback control is valid for a large range of dynamic states.
The proposed method is not only a consumer of data, but through adaptive data generation we can also construct large data sets with points anywhere in the (semi-global) domain. In particular, such data points can be generated near complicated or non-smooth regions of the value function to aid in learning. This in turn allows us to train more accurate NN models, or even employ other data-driven methods as desired. As a tangential result of this work, the adaptive data generation scheme is quite general, so we expect that, with some modifications, it can be applied to other interesting problems.
These promising results still leave plenty of room for future development. For instance, in the paper we considered only fixed final time problems. Infinite horizon problems can be solved with a straightforward of this method, but free final time problems require a quite different solution structure. Of special interest are problems with state and control constraints. These impose additional difficulties on both data generation and neural network modeling; addressing these obstacles would allow for effective solutions of a large class of difficult and important problems.
-  Martín Abadi, Ashish Agarwal, Paul Barham, et al. TensorFlow: Large-scale machine learning on heterogeneous systems. http://www.tensorflow.org/, 2015–.
-  E.G. Al’brekht. On the optimal stabilization of nonlinear systems. Journal of Applied Mathematics and Mechanics, 25(5):1254–1266, 1961.
-  Raghu Bollapragada, Dheevatsa Mudigere, Jorge Nocedal, Hao-Jun Michael Shi, and Ping Tak Peter Tang. A progressive batching L-BFGS method for machine learning. arXiv preprint: arXiv:1802.05374, 2018.
-  Léon Bottou, Frank E. Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM Review, 60(2):223–311, 2018.
-  Richard H. Byrd, Gillian M. Chin, Jorge Nocedal, and Yuchen Wu. Sample size selection in optimization methods for machine learning. Mathematical Programming, 134(1):127–155, Aug 2012.
-  Richard H. Byrd, Peihang Lu, Jorge Nocedal, and Ciyou Zhu. A limited memory algorithm for bound constrained optimization. SIAM Journal on Scientific Computing, 16:1190–1208, September 1995.
-  Simone Cacace, Emiliano Cristiani, Maurizio Falcone, and Athena Picarelli. A patchy dynamic programming scheme for a class of Hamilton–Jacobi–Bellman equations. SIAM Journal on Scientific Computing, 34(5):A2625–A2649, 2012.
-  Yat Tin Chow, Jérôme Darbon, Stanley Osher, and Wotao Yin. Algorithm for overcoming the curse of dimensionality for state-dependent Hamilton-Jacobi equations. J. Comput. Phys., 387:376–409, 2019.
-  Jérôme Darbon and Stanley Osher. Algorithms for overcoming the curse of dimensionality for certain Hamilton-Jacobi equations arising in control theory and elsewhere. arXiv preprint: arXiv:1605.01799, 2016.
-  James Diebel. Representing attitude: Euler angles, unit quaternions, and rotation vectors. https://www.astro.rug.nl/software/kapteyn-beta/_downloads/attitude.pdf, 2006.
-  M. Falcone and R. Ferretti. Semi-Lagrangian Approximation Schemes for Linear and Hamilton-Jacobi Equations. Society for Industrial and Applied Mathematics, 2013.
-  J. Han, A. Jentzen, and W. E. Solving high-dimensional partial differential equations using deep learning. arXiv.org, arXiv:1707.02568, July 2017.
-  Yegorov I. and P.M. Dower. Perspectives on characteristics based curse-of-dimensionality-free numerical approaches for solving Hamilton-Jacobi equations. Appl. Math. Optim., 2018.
-  Frank Jiang, Glen Chou, Mo Chen, and Claire J. Tomlin. Using neural networks to compute approximate and guaranteed feasible Hamilton-Jacobi-Bellman PDE solutions. arXiv preprint: arXiv:1611.03158, 2016.
-  Eric Jones, Travis Oliphant, Pearu Peterson, et al. SciPy: Open source scientific tools for Python. http://www.scipy.org/, 2001–.
-  Wei Kang, P.K. De, and A. Isidori. Flight control in a windshear via nonlinear methods. In Proceedings of the 31st IEEE Conference on Decision and Control, pages 1135–1142, 1992.
-  Wei Kang and Lucas C. Wilcox. A causality free computational method for HJB equations with application to rigid body satellites. In AIAA Guidance, Navigations, and Control Conference. American Institute of Aeronautics and Astronautics, 2015.
-  Wei Kang and Lucas C. Wilcox. Mitigating the curse of dimensionality: Sparse grid characteristics method for optimal feedback control and HJB equations. Computational Optimization and Applications, 68(2):289–315, Nov 2017.
-  Wei Kang and Lucas C. Wilcox. Solving 1D conservation laws using Pontryagin’s minimum principle. Journal of Scientific Computing, 71(1):144–165, 2017.
-  J Kierzenka and Lawrence Shampine. A BVP solver that controls residual and error. European Society of Computational Methods in Sciences and Engineering (ESCMSE) Journal of Numerical Analysis, Industrial and Applied Mathematics, 3:27–41, 01 2008.
-  Daniel Liberzon. Calculus of Variations and Optimal Control Theory: A Concise Introduction. Princeton University Press, Princeton, NJ, USA, 2011.
-  D.L. Lukes. Optimal regulation of nonlinear dynamical systems. SIAM Journal on Control, 7(1):75–100, 1969.
-  Carmeliza Navasca and Arthur. J. Krener. Patchy solutions of Hamilton-Jacobi-Bellman partial differential equations. In A. Chiuso, S. Pinzoni, and A. Ferrante, editors, Modeling, Estimation and Control, volume 364 of Lecture Notes in Control and Information Sciences, pages 251–270. Springer Berlin Heidelberg, 2007.
-  Maziar Raissi, Paris Perdikaris, and George Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378, 11 2018.
-  Yuval Tassa and Tom Erez. Least squares solutions of the HJB equation with neural network value-function approximators. IEEE Transactions on Neural Networks, 18(4):1031–1041, July 2007.