1 Introduction
We consider a
dimensional system of nonlinear firstorder ordinary differential equations,
(1.1) 
where
is the Lipschitz continuous vector field and
is a given finitevalued probability density function (PDF). This formulation includes uncertainty in parameter space if we augment the state with the set of uncertain parameters and the vector field with . Randomness in the initial condition induces randomness in the future state . We can think of as a random process defined by the forward flow map(1.2) 
with a distribution characterized by the PDF .
Since the vector field is nonlinear, the flow map and thus the state PDF
can be quite complicated. Often, statistical information such as the mean and variance of the state
provide only poor characterizations of the distribution and the sensitivity of the state to perturbations in the initial conditions or parameters. Thus for many applications it is desirable to obtain a representation of the PDF itself.Standard techniques for estimating
include kernel density estimation (e.g.
[17], Ch. 6), generalized polynomial chaos with flow map composition [12], probabilistic collocation [19], PDF methods [5], and datadriven methods for conditional PDFs [2]. Many of these methods suffer from the curse of dimensionality, and even those that scale well to high dimensions are usually very datahungry.
To fill this gap, we propose a computational method for approximating joint and conditional PDFs for systems where acquiring large amounts of sample data may be prohibitively expensive. Specifically, we model the timedependent PDF with a neural network (NN), which we train by solving the PDF evolution equation (see
Section 2) through a combination of least squares and method of characteristics approaches. This semisupervised learning approach exploits both knowledge of the underlying dynamics and the ability to generate data. Once the NN is trained, it can predict the density at
arbitrary points in the computational domain at orders of magnitude more efficiently than numerical integration.2 The Liouville equation
It is wellknown (see e.g. [7]) that the PDF evolving according to the nonlinear dynamics creftype 1.1 satisfies the Liouville transport equation,
(2.1) 
for the linear operator defined as
(2.2) 
Equation creftypecap 2.1
is a partial differential equation (PDE) in
spatial dimensions plus time. Solving it directly requires a discretization of space and time which must be extremely fine because the support of the PDF can twist into thin curves within state space. Consequently, solving creftype 2.1 can be challenging even in low dimensions, and possibly intractable when the system is large.On the other hand, using the method of characteristics one can derive a formal expression for the solution of creftype 2.1:
(2.3) 
Here , where is the inverse flow map which maps the state back to the initial condition which generated it. For finite time , this map is the unique inverse of the forward flow map creftype 1.2 under standard smoothness assumptions on the vector field . This representation effectively decouples pointwise solutions creftype 2.1, allowing these to be computed independently. Such causalityfree characteristicsbased methods have been used to solve various types of PDEs, such as scalar conservation laws [10], semilinear parabolic PDEs [8], and HamiltonJacobiBellman (HJB) equations [22, 6, 14].
To use creftype 2.3, one must first find representations of the forward and inverse flow maps, which are highdimensional nonlinear functions. Approximating these maps is perhaps more difficult than solving creftype 2.1 itself, and is a subject of ongoing research (see e.g. [12, 13]). Still, creftype 2.3 allows us to evaluate the density along characteristics quite easily. In fact, every time a sample trajectory of creftype 1.1 is computed, we can simultaneously propagate by
(2.4) 
which can obtained by rearranging terms in creftypeplural 2.1 to 2.2. Having such density data readily available suggests a datadriven approach, but one which is augmented by knowledge of the underlying physics, i.e. that obeys the Liouville equation creftype 2.1.
3 Deep learning for probability density function approximation
Deep learning offers an efficient way to approximate highdimensional PDFs. Deep NNs are wellknown for their ability to approximate arbitrary highdimensional nonlinear functions given sufficient training data, and are the focus of much recent research in scientific computing. They are also computationally efficient, allowing the estimation of joint and conditional densities at millions of spatiotemporal coordinates in seconds. This in turn allows computation of statistics like means and covariances in high dimensions through Markov Chain Monte Carlo (MCMC) algorithms
[3].In this paper, we model the desired PDF with fullyconnected feedforward NNs, which we denote by for a parameterization . Thus we have
where each layer , is defined as
for weight matrices
, and nonlinear activation functions
. The trainable parameters are . In this paper we use for all hidden layers , and a linear output layer (i.e. is the identity map).In a naïve application of deep learning, we are given a large set of training data consisting of inputs and outputs , where . Each data point corresponds to a point in the time series of a sample trajectory, of which there are many. To train the NN, we solve a nonlinear regression problem over to match the NN predictions with the training data .
Unfortunately, deep NNs notoriously require enormous quantities of data to train, but we consider the case where data is not necessarily abundant. This situation can arise when numerical integration of a system is expensive, or if we are building a model from realworld measurements of a system with known structure but uncertain parameters. To overcome a relative lack of data, we adapt the physicallymotivated machine learning strategies proposed by
[14] for solving highdimensional HJB equations in optimal control, and [16] and [18] for other PDEs. We outline this process in the following sections.We also note that [21] recently proposed a generativeadversarial strategy to train probabilistic surrogate models of systems like creftype 1.1. This allows one to sample from , whereas our method predicts the density itself.
3.1 Physicsinformed learning
In [14], the authors improve dataefficiency in training by augmenting the regression loss with an additional term which encourages the NN to learn the gradient of the solution. In the context of optimal control, this gradient is computed naturally for each sample trajectory. No such simple relationship exists for the PDF evolution equation. Thus for physicallymotivated regularization we turn to the least squares methods proposed by [16] and [18]. Simply put, we would like to satisfy the Liouville equation creftype 2.1. To this end we define the PDE residual
(3.1) 
which of course is everywhere zero if . The partial derivatives and appearing in creftype 3.1 can be calculated using automatic differentiation. This standard feature of machine learning software packages allows efficient computation of exact partial derivatives, and is one of the essential drivers of the success of deep learning.
We minimize the residual over a set of collocation points, . Such collocation points may include the training data as well as randomly sampled points which require no numerical integration to generate. Approximating the PDE at collocation points can alleviate the need for large amounts of training data, as well as generating a representation of the PDF which is guided by the underlying physics creftype 2.1, rather than just regression. The main difference between our method and least squares approaches is that we exploit the ability to generate data along the characteristics of the PDF. We find that this twopronged strategy is more effective than direct minimization of the PDE residual and boundary conditions.
We now introduce the physicsinformed learning problem,
(3.2) 
Here is the mean square PDE residual,
(3.3) 
For the regression term, , we find that a weighted mean square error on the log PDF is effective:
(3.4) 
for some weights . Examples include
for . The choice of appropriate weight depends on the scaling of the problem. In the implementation, we actually have the NN predict the log PDF; then to obtain we exponentiate the NN output. Therefore the NN model naturally preserves positivity of the PDF.
Lastly, is a scalar weight which must be carefully chosen to balance the impact of the PDE residual term creftype 3.3 with the regression loss creftype 3.4, which serves as a boundary condition in the present context. Good choices of are highly problemdependent, and for some problems like the rigid body problem in Section 4.2, we have found success with increasing over the course of training.
3.2 Adaptive sample size selection
The machine learning problem introduced in Section 3.1 is atypical in that data generation may be expensive for stiff, chaotic, or highdimensional systems. Data scarcity can be alleviated by use of randomly sampled collocation points, but this is often not enough to obtain good results. In addition, randomly sampled collocation points might completely miss the support of the PDF, which can be twisted into small manifolds in the higherdimensional space. For example, see the double gyre in [12] or Figure 2 in this paper. It is worth mentioning that PDFs like this are poorly characterized by simple statistics such as means and variances.
On the other hand, we have the freedom to generate additional data in parallel to training, hence we can treat this as a semionline learning problem, incorporating additional data as it becomes available. For simplicity, in the current implementation we train the NN in multiple rounds, allowing the optimizer to converge in each round. In between rounds we generate additional data as needed, which we use to continue training the model.
Standard early stopping tests in machine learning rely on generating extra data sets for measuring generalization accuracy, which can be problematic if data is scarce in the first place. Further, such tests do not provide estimates of how much data is needed for convergence. To address these issues, we modify the convergence test and sample size selection scheme developed by [14] to the present problem.
Let and denote the training data and collocation sets available in the th training round. The method proposed in [14] works for a single data set, so we apply it to each loss term individually^{1}^{1}1
We will postpone extending this idea to deal with loss functions with multiple data sets future work, but note that it can actually be advantageous to adjust each data set independently.
. Thus we consider to be sufficiently large if(3.5) 
where is a scalar parameter and denotes the sample variance^{2}^{2}2Computing the variance for large data sets can become expensive, so when necessary we evaluate this term over a subset of the data.,
Similarly for we require
(3.6) 
Intuitively, the convergence tests creftypeplural 3.5 to 3.6 check if the sample gradients are good approximations of the “true” gradients taken over infinitely large data sets. Satisfying creftypeplural 3.5 to 3.6 do not indicate if the trained NN is a good model, only that feeding it more data would likely not improve it significantly. The main advantage of these tests is that if they are not satisfied, then they immediately provides estimates for how much data should be generated for the next round, namely
(3.7) 
and similarly for ,
(3.8) 
In the above, are scalar parameters which prevent the sample size from growing too fast. For details we refer the reader to [14].
3.3 Transfer learning for extending time horizons
For some problems where the density changes quickly over time or where a longer time horizon is desired, it can be difficult to directly learn the distribution over all . In such cases, we employ a simple transfer learning approach to gradually extend the time horizon learned by the NN. To do this, we pick a monotone increasing sequence of time horizons , where and . We first train the NN to model the density over , then retrain the NN over . We continue in this way until the NN learns the full time horizon, .
3.4 Model validation
The final component of any machine learning algorithm is model validation. While finding theoretical convergence properties remains an active area of research (see e.g. [9, 18, 23]), our datadriven approach naturally allows for empirical testing of model accuracy. To do this, we first construct a validation data set, , of trajectories generated from initial conditions drawn independently of those used for training. We evaluate each trajectory in at the same sequence of time snapshots, , where and . Then for each we estimate the normalized root mean square error (NRMSE) of the model predictions:
(3.9) 
Here is the th snapshot along the th trajectory. This simple validation framework provides a characterization of the NN’s performance over the time horizon, which can be compared between target densities of different scale.
We summarize the full training procedure in Algorithm 1.
4 Results
4.1 KraichnanOrszag problem
First we study the KraichnanOrszag problem from [15, 20]. The state dynamics are
(4.1) 
We consider independently normally distributed initial conditions such that the joint PDF
straddles the “stochastic discontinuity” on the axis:(4.2) 
The KraichnanOrszag system is a canonical test problem in uncertainty quantification. We seek to model the timeevolving PDF for , and compare the performance of our method using different optimization and collocation point selection strategies.
Using TensorFlow
[1], we implement a fullyconnected feedforward NN with four hidden layers with 64 neurons each. For the weights in the regression loss term
creftype 3.4, we pick , . We consider three strategies to train the network:
Adam: We train for one round with a fixed data set, . We set and optimize using Adam [11]
for a prefixed number of epochs and with manually tuned learning rates. We randomly sample a minibatch of collocation points at each iteration. This implementation is similar to the “Deep Galerkin Method” (DGM) proposed by
[18], but in the context of solving creftype 2.1 with trajectory data serving as a boundary condition. 
Adaptive LBFGS: W use the multiround model refinement scheme outlined in Algorithm 1, optimizing with LBFGS.
The first two implementations use a data set of 500 sample trajectories evaluated at time snapshots each, for a total of the training data. The adaptive learning implementation starts from a smaller data set, , composed of 250 trajectories from the fixed data set, but we add new data each round. We choose to be the union of and an equallysized set of uniformly sampled collocation points, also adding more points progressively. Lastly we set , , , and . For all implementations we directly learn the density evolution over .
To validate the models, we construct a second data set of 500 sample trajectories evaluated at time snapshots each, for a total of validation data. Figure 1 shows the NRMSE creftype 3.9 evaluated on
for each of the implementations tested. From this plot, it is clear that training for a single round with LBFGS is more effective than with Adam and collocation points randomly selected at each iteration. This was true for all hyperparameter configurations tested with Adam. In
Figure 1, we see that the resulting NN is just as accurate as the other training strategies, even though it was trained without full access to a large data set. This suggests that the training strategies discussed in Section 3 can be useful for solving datascarce problems.Lastly in Figure 2 we plot the predicted marginal density,
. The NN predicts the joint density on a tensor grid, which we then marginalize by quadrature integration. While dense grid integration is impractical in higher dimensions, here we use it only to visualize the complexity of the PDF captured by the NN.
4.2 Application to quantifying controller robustness
In this section we demonstrate the potential of the proposed method for use in evaluating controllers. We consider a sixdimensional rigid body model of a satellite controlled by reaction wheels. The state is written as , where is the attitude of the satellite represented in Euler angles (roll, pitch, yaw) and denotes the angular velocity in the body frame. In this representation, the state dynamics are
(4.3) 
where are nonlinear matrixvalued functions^{3}^{3}3See [14] for details.; is the inertia matrix; is the constant angular momentum of the system; and is a constant matrix. The system is controlled by applying a torque . We let
for a random parameter .
To stabilize the system, we design a linear quadratic regulator (LQR) by solving a linearized infinite horizon optimal control problem,
(4.4) 
where , are constant matrices obtained by linearizing creftype 4.3 about and using a nominal parameter value . We choose the weights as , , and . It is wellknown that the minimizer of creftype 4.4 is a linear state feedback controller, , and that the gain matrix is computed by solving a continuous algebraic Riccati equation (offline).
Using the proposed density propagation framework, we model the evolution of a density of initial states over , subject to the nonlinear dynamics creftype 4.3 under LQR control. The system is augmented with an additional state for which , bringing the total dimension to seven. Suppose that at the states are independently distributed according to
(4.5) 
The bimodal Gaussian mixture distribution for is chosen to allow us to simultaneously study the behavior when is near the nominal value , and a scenario where , representing estimation error or actuator inefficiency.


For this problem we train a NN with four hidden layers, each with 128 neurons. The weights in the regression loss term creftype 3.4 are set to , . In this problem the magnitude of the density grows almost exponentially in time, so we implement the transfer learning strategy from Section 3.3 with time horizons,
with corresponding PDE penalty weights,
In the first three time horizons, we train for a single round on a fixed data set, , constructed from 500 sample trajectories. The collocation set includes and an equal number of randomly sampled points, resampling every time horizon. For the final time horizon, , we implement the adaptive learning algorithm with , and . With these parameters, the convergence criteria (3.53.6) are satisfied after three rounds, observing data from a total of 984 trajectories evaluated at time snapshots each.
In Figure 3 we plot the NRMSE creftype 3.9 of the NN at each time horizon, evaluated on a validation data set of 500 trajectories with time snapshots each. This is compared to a direct solution attempt over the full time horizon, which turns out to be quite inaccurate. Thus we see that the transfer learning approach makes the proposed framework viable for approximating the rapidly growing density.
Figure 4 shows predicted conditional densities of for different values of . In Figure 3(a), we plot the case where , the nominal value for which the control is designed. We can see that the density – which is initially spread out in but tight in – rotates so that converges to zero while spreads out. Then in Figure 3(b) we plot the case where . We can clearly see that the density increases more slowly than when , indicating poorer stability. Still, overall the results are qualitatively similar, showing that the LQR controller has a large domain of attraction in and is fairly robust to changes in , but does a poor job of stabilizing . Similar results are easily obtained for and with the same NN. Finally, we observe that the NN predicts seemingly erroneous mass on the sides of the domain. But this is actually reasonable since and are periodic, and a cursory examination of the data reveals that many trajectories do in fact wrap around the periodic boundaries.
5 Concluding remarks
In this paper we have introduced a new datadriven approach to uncertainty propagation in nonlinear dynamic systems. Leveraging the approximating power of NNs, we model the probability density at any location and time , without discretizing the spacetime domain. Numerical results suggest that the proposed method is competitive with other techniques for solving Liouville equations in moderate to highdimensional systems. Different from most other approaches, our method may be more effective for approximating densities which are spatially complex, but with magnitude that doesn’t vary too drastically over time. Thus we envision our method not as a replacement for existing techniques, but rather as filling a certain niche in uncertainty quantification: modeling densities which are not
characterized well by statistics like means and variances. Finally, we believe that this work represents an important step in developing physicsdriven machine learning methods for optimal control in the space of probability distributions.
References
 [1] M. Abadi, A. Agarwal, P. Barham, et al., TensorFlow: Largescale machine learning on heterogeneous systems. http://www.tensorflow.org/, 2015–.
 [2] C. Brennan and D. Venturi, Datadriven closures for stochastic dynamical systems, J. Comput. Phys., 372 (2018), pp. 281–298.
 [3] S. Brooks, A. Gelman, G. L. Jones, and X.L. Meng, eds., Handbook of Markov Chain Monte Carlo, Chapman & Hall/CRC, Boca Raton, 2011.
 [4] R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu, A limited memory algorithm for bound constrained optimization, SIAM J. Sci. Comput., 16 (1995), pp. 1190–1208.
 [5] H. Cho, D. Venturi, and G. E. Karniadakis, Numerical methods for highdimensional probability density function equations, J. Comput. Phys., 305 (2016), pp. 817–837.
 [6] Y. T. Chow, J. Darbon, S. Osher, and W. Yin, Algorithm for overcoming the curse of dimensionality for statedependent HamiltonJacobi equations, J. Comput. Phys., 387 (2019), pp. 376–409.
 [7] M. Ehrendorfer, The Liouville equation and its potential usefulness for the prediction of forecast skill. Part I: Theory, Monthly Weather Rev., 122 (1994), pp. 703–713.
 [8] J. Han, A. Jentzen, and W. E, Solving highdimensional partial differential equations using deep learning, Proceedings of the National Academy of Sciences, 115 (2018), pp. 8505–8510.
 [9] J. Han and J. Long, Convergence of the deep BSDE method for couplded FBSDEs, arXiv:1811.01165, (2018).
 [10] W. Kang and L. C. Wilcox, Solving 1D conservation laws using Pontryagin’s minimum principle, J. Sci. Comput., 71(1) (2017), pp. 144–165.
 [11] D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, arXiv:1412.6980, (2014).
 [12] D. M. Luchtenburg, S. L. Brunton, and C. W. Rowley, Longtime uncertainty propagation using generalized polynomial chaos and flow map composition, J. Comput. Phys., 274 (2014), pp. 783–802.
 [13] B. Lusch, J. N. Kutz, and S. L. Brunton, Deep learning for universal linear embeddings of nonlinear dynamics, Nature Communications, 9 (2018).
 [14] T. NakamuraZimmerer, Q. Gong, and W. Kang, Adaptive deep learning for highdimensional HamiltonJacobiBellman equations, arXiv:1907.05317, (2019).
 [15] S. A. Orszag and L. Bissonnette, Dynamical properties of truncated Wiener‐Hermite expansions, Phys. Fluids, 10 (1967), pp. 2603–2613.
 [16] M. Raissi, P. Perdikaris, and G. Karniadakis, Physicsinformed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations, J. Comput. Phys., 378 (2019), pp. 686–707.
 [17] D. W. Scott, Multivariate Density Estimation: Theory, Practice, and Visualization, John Wiley & Sons, Inc., Hoboken, 2nd ed., 2015.
 [18] J. Sirignano and K. Spiliopoulos, DGM: A deep learning algorithm for solving partial differential equations, J. Comput. Phys., 375 (2018), pp. 1339 – 1364.
 [19] M. Tatang, W. Pan, R. Prinn, and G. McRae, An efficient method for parametric uncertainty analysis of numerical geophysical model, J. Geophys. Res., 102 (1997), pp. 21925–21932.
 [20] X. Wan and G. E. Karniadakis, An adaptive multielement generalized polynomial chaos method for stochastic differential equations, J. Comput. Phys., 209 (2005), pp. 617–642.
 [21] Y. Yang and P. Perdikaris, Adversarial uncertainty quantification in physicsinformed neural networks, J. Comput. Phys., 394 (2019), pp. 136–152.
 [22] I. Yegorov and P. M. Dower, Perspectives on characteristics based curseofdimensionalityfree numerical approaches for solving HamiltonJacobi equations, Appl. Math. Optim., (2018).
 [23] D. Zhang, L. Lu, L. Guo, and G. E. Karniadakis, Quantifying total uncertainty in physicsinformed neural networks for solving forward and inverse stochastic problems, J. Comput. Phys., 397 (2019).
Comments
There are no comments yet.