We consider a
-dimensional system of nonlinear first-order ordinary differential equations,
is the Lipschitz continuous vector field andis a given finite-valued probability density function (PDF). This formulation includes uncertainty in parameter space if we augment the state with the set of uncertain parameters and the vector field with . Randomness in the initial condition induces randomness in the future state . We can think of as a random process defined by the forward flow map
with a distribution characterized by the PDF .
Since the vector field is nonlinear, the flow map and thus the state PDF
can be quite complicated. Often, statistical information such as the mean and variance of the stateprovide only poor characterizations of the distribution and the sensitivity of the state to perturbations in the initial conditions or parameters. Thus for many applications it is desirable to obtain a representation of the PDF itself.
Standard techniques for estimating
include kernel density estimation (e.g., Ch. 6), generalized polynomial chaos with flow map composition , probabilistic collocation , PDF methods , and data-driven methods for conditional PDFs 
. Many of these methods suffer from the curse of dimensionality, and even those that scale well to high dimensions are usually very data-hungry.
To fill this gap, we propose a computational method for approximating joint and conditional PDFs for systems where acquiring large amounts of sample data may be prohibitively expensive. Specifically, we model the time-dependent PDF with a neural network (NN), which we train by solving the PDF evolution equation (seeSection 2
) through a combination of least squares and method of characteristics approaches. This semi-supervised learning approach exploits both knowledge of the underlying dynamics and the ability to generate data. Once the NN is trained, it can predict the density atarbitrary points in the computational domain at orders of magnitude more efficiently than numerical integration.
2 The Liouville equation
for the linear operator defined as
Equation creftypecap 2.1
is a partial differential equation (PDE) inspatial dimensions plus time. Solving it directly requires a discretization of space and time which must be extremely fine because the support of the PDF can twist into thin curves within state space. Consequently, solving creftype 2.1 can be challenging even in low dimensions, and possibly intractable when the system is large.
On the other hand, using the method of characteristics one can derive a formal expression for the solution of creftype 2.1:
Here , where is the inverse flow map which maps the state back to the initial condition which generated it. For finite time , this map is the unique inverse of the forward flow map creftype 1.2 under standard smoothness assumptions on the vector field . This representation effectively decouples pointwise solutions creftype 2.1, allowing these to be computed independently. Such causality-free characteristics-based methods have been used to solve various types of PDEs, such as scalar conservation laws , semilinear parabolic PDEs , and Hamilton-Jacobi-Bellman (HJB) equations [22, 6, 14].
To use creftype 2.3, one must first find representations of the forward and inverse flow maps, which are high-dimensional nonlinear functions. Approximating these maps is perhaps more difficult than solving creftype 2.1 itself, and is a subject of ongoing research (see e.g. [12, 13]). Still, creftype 2.3 allows us to evaluate the density along characteristics quite easily. In fact, every time a sample trajectory of creftype 1.1 is computed, we can simultaneously propagate by
which can obtained by rearranging terms in creftypeplural 2.1 to 2.2. Having such density data readily available suggests a data-driven approach, but one which is augmented by knowledge of the underlying physics, i.e. that obeys the Liouville equation creftype 2.1.
3 Deep learning for probability density function approximation
Deep learning offers an efficient way to approximate high-dimensional PDFs. Deep NNs are well-known for their ability to approximate arbitrary high-dimensional nonlinear functions given sufficient training data, and are the focus of much recent research in scientific computing. They are also computationally efficient, allowing the estimation of joint and conditional densities at millions of spatio-temporal coordinates in seconds. This in turn allows computation of statistics like means and covariances in high dimensions through Markov Chain Monte Carlo (MCMC) algorithms.
In this paper, we model the desired PDF with fully-connected feedforward NNs, which we denote by for a parameterization . Thus we have
where each layer , is defined as
for weight matrices
, and nonlinear activation functions. The trainable parameters are . In this paper we use for all hidden layers , and a linear output layer (i.e. is the identity map).
In a naïve application of deep learning, we are given a large set of training data consisting of inputs and outputs , where . Each data point corresponds to a point in the time series of a sample trajectory, of which there are many. To train the NN, we solve a nonlinear regression problem over to match the NN predictions with the training data .
Unfortunately, deep NNs notoriously require enormous quantities of data to train, but we consider the case where data is not necessarily abundant. This situation can arise when numerical integration of a system is expensive, or if we are building a model from real-world measurements of a system with known structure but uncertain parameters. To overcome a relative lack of data, we adapt the physically-motivated machine learning strategies proposed by for solving high-dimensional HJB equations in optimal control, and  and  for other PDEs. We outline this process in the following sections.
We also note that  recently proposed a generative-adversarial strategy to train probabilistic surrogate models of systems like creftype 1.1. This allows one to sample from , whereas our method predicts the density itself.
3.1 Physics-informed learning
In , the authors improve data-efficiency in training by augmenting the regression loss with an additional term which encourages the NN to learn the gradient of the solution. In the context of optimal control, this gradient is computed naturally for each sample trajectory. No such simple relationship exists for the PDF evolution equation. Thus for physically-motivated regularization we turn to the least squares methods proposed by  and . Simply put, we would like to satisfy the Liouville equation creftype 2.1. To this end we define the PDE residual
which of course is everywhere zero if . The partial derivatives and appearing in creftype 3.1 can be calculated using automatic differentiation. This standard feature of machine learning software packages allows efficient computation of exact partial derivatives, and is one of the essential drivers of the success of deep learning.
We minimize the residual over a set of collocation points, . Such collocation points may include the training data as well as randomly sampled points which require no numerical integration to generate. Approximating the PDE at collocation points can alleviate the need for large amounts of training data, as well as generating a representation of the PDF which is guided by the underlying physics creftype 2.1, rather than just regression. The main difference between our method and least squares approaches is that we exploit the ability to generate data along the characteristics of the PDF. We find that this two-pronged strategy is more effective than direct minimization of the PDE residual and boundary conditions.
We now introduce the physics-informed learning problem,
Here is the mean square PDE residual,
For the regression term, , we find that a weighted mean square error on the log PDF is effective:
for some weights . Examples include
for . The choice of appropriate weight depends on the scaling of the problem. In the implementation, we actually have the NN predict the log PDF; then to obtain we exponentiate the NN output. Therefore the NN model naturally preserves positivity of the PDF.
Lastly, is a scalar weight which must be carefully chosen to balance the impact of the PDE residual term creftype 3.3 with the regression loss creftype 3.4, which serves as a boundary condition in the present context. Good choices of are highly problem-dependent, and for some problems like the rigid body problem in Section 4.2, we have found success with increasing over the course of training.
3.2 Adaptive sample size selection
The machine learning problem introduced in Section 3.1 is atypical in that data generation may be expensive for stiff, chaotic, or high-dimensional systems. Data scarcity can be alleviated by use of randomly sampled collocation points, but this is often not enough to obtain good results. In addition, randomly sampled collocation points might completely miss the support of the PDF, which can be twisted into small manifolds in the higher-dimensional space. For example, see the double gyre in  or Figure 2 in this paper. It is worth mentioning that PDFs like this are poorly characterized by simple statistics such as means and variances.
On the other hand, we have the freedom to generate additional data in parallel to training, hence we can treat this as a semi-online learning problem, incorporating additional data as it becomes available. For simplicity, in the current implementation we train the NN in multiple rounds, allowing the optimizer to converge in each round. In between rounds we generate additional data as needed, which we use to continue training the model.
Standard early stopping tests in machine learning rely on generating extra data sets for measuring generalization accuracy, which can be problematic if data is scarce in the first place. Further, such tests do not provide estimates of how much data is needed for convergence. To address these issues, we modify the convergence test and sample size selection scheme developed by  to the present problem.
Let and denote the training data and collocation sets available in the th training round. The method proposed in  works for a single data set, so we apply it to each loss term individually111 We will postpone extending this idea to deal with loss functions with multiple data sets future work, but note that it can actually be advantageous to adjust each data set independently.
We will postpone extending this idea to deal with loss functions with multiple data sets future work, but note that it can actually be advantageous to adjust each data set independently.. Thus we consider to be sufficiently large if
where is a scalar parameter and denotes the sample variance222Computing the variance for large data sets can become expensive, so when necessary we evaluate this term over a subset of the data.,
Similarly for we require
Intuitively, the convergence tests creftypeplural 3.5 to 3.6 check if the sample gradients are good approximations of the “true” gradients taken over infinitely large data sets. Satisfying creftypeplural 3.5 to 3.6 do not indicate if the trained NN is a good model, only that feeding it more data would likely not improve it significantly. The main advantage of these tests is that if they are not satisfied, then they immediately provides estimates for how much data should be generated for the next round, namely
and similarly for ,
In the above, are scalar parameters which prevent the sample size from growing too fast. For details we refer the reader to .
3.3 Transfer learning for extending time horizons
For some problems where the density changes quickly over time or where a longer time horizon is desired, it can be difficult to directly learn the distribution over all . In such cases, we employ a simple transfer learning approach to gradually extend the time horizon learned by the NN. To do this, we pick a monotone increasing sequence of time horizons , where and . We first train the NN to model the density over , then re-train the NN over . We continue in this way until the NN learns the full time horizon, .
3.4 Model validation
The final component of any machine learning algorithm is model validation. While finding theoretical convergence properties remains an active area of research (see e.g. [9, 18, 23]), our data-driven approach naturally allows for empirical testing of model accuracy. To do this, we first construct a validation data set, , of trajectories generated from initial conditions drawn independently of those used for training. We evaluate each trajectory in at the same sequence of time snapshots, , where and . Then for each we estimate the normalized root mean square error (NRMSE) of the model predictions:
Here is the th snapshot along the th trajectory. This simple validation framework provides a characterization of the NN’s performance over the time horizon, which can be compared between target densities of different scale.
We summarize the full training procedure in Algorithm 1.
4.1 Kraichnan-Orszag problem
We consider independently normally distributed initial conditions such that the joint PDFstraddles the “stochastic discontinuity” on the axis:
The Kraichnan-Orszag system is a canonical test problem in uncertainty quantification. We seek to model the time-evolving PDF for , and compare the performance of our method using different optimization and collocation point selection strategies.
, we implement a fully-connected feedforward NN with four hidden layers with 64 neurons each. For the weights in the regression loss termcreftype 3.4, we pick , . We consider three strategies to train the network:
Adam: We train for one round with a fixed data set, . We set and optimize using Adam 
for a pre-fixed number of epochs and with manually tuned learning rates. We randomly sample a mini-batch of collocation points at each iteration. This implementation is similar to the “Deep Galerkin Method” (DGM) proposed by, but in the context of solving creftype 2.1 with trajectory data serving as a boundary condition.
Adaptive L-BFGS: W use the multi-round model refinement scheme outlined in Algorithm 1, optimizing with L-BFGS.
The first two implementations use a data set of 500 sample trajectories evaluated at time snapshots each, for a total of the training data. The adaptive learning implementation starts from a smaller data set, , composed of 250 trajectories from the fixed data set, but we add new data each round. We choose to be the union of and an equally-sized set of uniformly sampled collocation points, also adding more points progressively. Lastly we set , , , and . For all implementations we directly learn the density evolution over .
for each of the implementations tested. From this plot, it is clear that training for a single round with L-BFGS is more effective than with Adam and collocation points randomly selected at each iteration. This was true for all hyperparameter configurations tested with Adam. InFigure 1, we see that the resulting NN is just as accurate as the other training strategies, even though it was trained without full access to a large data set. This suggests that the training strategies discussed in Section 3 can be useful for solving data-scarce problems.
Lastly in Figure 2 we plot the predicted marginal density,
. The NN predicts the joint density on a tensor grid, which we then marginalize by quadrature integration. While dense grid integration is impractical in higher dimensions, here we use it only to visualize the complexity of the PDF captured by the NN.
4.2 Application to quantifying controller robustness
In this section we demonstrate the potential of the proposed method for use in evaluating controllers. We consider a six-dimensional rigid body model of a satellite controlled by reaction wheels. The state is written as , where is the attitude of the satellite represented in Euler angles (roll, pitch, yaw) and denotes the angular velocity in the body frame. In this representation, the state dynamics are
where are nonlinear matrix-valued functions333See  for details.; is the inertia matrix; is the constant angular momentum of the system; and is a constant matrix. The system is controlled by applying a torque . We let
for a random parameter .
To stabilize the system, we design a linear quadratic regulator (LQR) by solving a linearized infinite horizon optimal control problem,
where , are constant matrices obtained by linearizing creftype 4.3 about and using a nominal parameter value . We choose the weights as , , and . It is well-known that the minimizer of creftype 4.4 is a linear state feedback controller, , and that the gain matrix is computed by solving a continuous algebraic Riccati equation (offline).
Using the proposed density propagation framework, we model the evolution of a density of initial states over , subject to the nonlinear dynamics creftype 4.3 under LQR control. The system is augmented with an additional state for which , bringing the total dimension to seven. Suppose that at the states are independently distributed according to
The bimodal Gaussian mixture distribution for is chosen to allow us to simultaneously study the behavior when is near the nominal value , and a scenario where , representing estimation error or actuator inefficiency.
For this problem we train a NN with four hidden layers, each with 128 neurons. The weights in the regression loss term creftype 3.4 are set to , . In this problem the magnitude of the density grows almost exponentially in time, so we implement the transfer learning strategy from Section 3.3 with time horizons,
with corresponding PDE penalty weights,
In the first three time horizons, we train for a single round on a fixed data set, , constructed from 500 sample trajectories. The collocation set includes and an equal number of randomly sampled points, re-sampling every time horizon. For the final time horizon, , we implement the adaptive learning algorithm with , and . With these parameters, the convergence criteria (3.5-3.6) are satisfied after three rounds, observing data from a total of 984 trajectories evaluated at time snapshots each.
In Figure 3 we plot the NRMSE creftype 3.9 of the NN at each time horizon, evaluated on a validation data set of 500 trajectories with time snapshots each. This is compared to a direct solution attempt over the full time horizon, which turns out to be quite inaccurate. Thus we see that the transfer learning approach makes the proposed framework viable for approximating the rapidly growing density.
Figure 4 shows predicted conditional densities of for different values of . In Figure 3(a), we plot the case where , the nominal value for which the control is designed. We can see that the density – which is initially spread out in but tight in – rotates so that converges to zero while spreads out. Then in Figure 3(b) we plot the case where . We can clearly see that the density increases more slowly than when , indicating poorer stability. Still, overall the results are qualitatively similar, showing that the LQR controller has a large domain of attraction in and is fairly robust to changes in , but does a poor job of stabilizing . Similar results are easily obtained for and with the same NN. Finally, we observe that the NN predicts seemingly erroneous mass on the sides of the domain. But this is actually reasonable since and are periodic, and a cursory examination of the data reveals that many trajectories do in fact wrap around the periodic boundaries.
5 Concluding remarks
In this paper we have introduced a new data-driven approach to uncertainty propagation in nonlinear dynamic systems. Leveraging the approximating power of NNs, we model the probability density at any location and time , without discretizing the space-time domain. Numerical results suggest that the proposed method is competitive with other techniques for solving Liouville equations in moderate to high-dimensional systems. Different from most other approaches, our method may be more effective for approximating densities which are spatially complex, but with magnitude that doesn’t vary too drastically over time. Thus we envision our method not as a replacement for existing techniques, but rather as filling a certain niche in uncertainty quantification: modeling densities which are not
characterized well by statistics like means and variances. Finally, we believe that this work represents an important step in developing physics-driven machine learning methods for optimal control in the space of probability distributions.
-  M. Abadi, A. Agarwal, P. Barham, et al., TensorFlow: Large-scale machine learning on heterogeneous systems. http://www.tensorflow.org/, 2015–.
-  C. Brennan and D. Venturi, Data-driven closures for stochastic dynamical systems, J. Comput. Phys., 372 (2018), pp. 281–298.
-  S. Brooks, A. Gelman, G. L. Jones, and X.-L. Meng, eds., Handbook of Markov Chain Monte Carlo, Chapman & Hall/CRC, Boca Raton, 2011.
-  R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu, A limited memory algorithm for bound constrained optimization, SIAM J. Sci. Comput., 16 (1995), pp. 1190–1208.
-  H. Cho, D. Venturi, and G. E. Karniadakis, Numerical methods for high-dimensional probability density function equations, J. Comput. Phys., 305 (2016), pp. 817–837.
-  Y. T. Chow, J. Darbon, S. Osher, and W. Yin, Algorithm for overcoming the curse of dimensionality for state-dependent Hamilton-Jacobi equations, J. Comput. Phys., 387 (2019), pp. 376–409.
-  M. Ehrendorfer, The Liouville equation and its potential usefulness for the prediction of forecast skill. Part I: Theory, Monthly Weather Rev., 122 (1994), pp. 703–713.
-  J. Han, A. Jentzen, and W. E, Solving high-dimensional partial differential equations using deep learning, Proceedings of the National Academy of Sciences, 115 (2018), pp. 8505–8510.
-  J. Han and J. Long, Convergence of the deep BSDE method for couplded FBSDEs, arXiv:1811.01165, (2018).
-  W. Kang and L. C. Wilcox, Solving 1D conservation laws using Pontryagin’s minimum principle, J. Sci. Comput., 71(1) (2017), pp. 144–165.
-  D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, arXiv:1412.6980, (2014).
-  D. M. Luchtenburg, S. L. Brunton, and C. W. Rowley, Long-time uncertainty propagation using generalized polynomial chaos and flow map composition, J. Comput. Phys., 274 (2014), pp. 783–802.
-  B. Lusch, J. N. Kutz, and S. L. Brunton, Deep learning for universal linear embeddings of nonlinear dynamics, Nature Communications, 9 (2018).
-  T. Nakamura-Zimmerer, Q. Gong, and W. Kang, Adaptive deep learning for high-dimensional Hamilton-Jacobi-Bellman equations, arXiv:1907.05317, (2019).
-  S. A. Orszag and L. Bissonnette, Dynamical properties of truncated Wiener‐Hermite expansions, Phys. Fluids, 10 (1967), pp. 2603–2613.
-  M. Raissi, P. Perdikaris, and G. Karniadakis, Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations, J. Comput. Phys., 378 (2019), pp. 686–707.
-  D. W. Scott, Multivariate Density Estimation: Theory, Practice, and Visualization, John Wiley & Sons, Inc., Hoboken, 2nd ed., 2015.
-  J. Sirignano and K. Spiliopoulos, DGM: A deep learning algorithm for solving partial differential equations, J. Comput. Phys., 375 (2018), pp. 1339 – 1364.
-  M. Tatang, W. Pan, R. Prinn, and G. McRae, An efficient method for parametric uncertainty analysis of numerical geophysical model, J. Geophys. Res., 102 (1997), pp. 21925–21932.
-  X. Wan and G. E. Karniadakis, An adaptive multi-element generalized polynomial chaos method for stochastic differential equations, J. Comput. Phys., 209 (2005), pp. 617–642.
-  Y. Yang and P. Perdikaris, Adversarial uncertainty quantification in physics-informed neural networks, J. Comput. Phys., 394 (2019), pp. 136–152.
-  I. Yegorov and P. M. Dower, Perspectives on characteristics based curse-of-dimensionality-free numerical approaches for solving Hamilton-Jacobi equations, Appl. Math. Optim., (2018).
-  D. Zhang, L. Lu, L. Guo, and G. E. Karniadakis, Quantifying total uncertainty in physics-informed neural networks for solving forward and inverse stochastic problems, J. Comput. Phys., 397 (2019).