1 Introduction
Despite their immense recent success, many stateofthe art machine learning techniques (e.g., deep neural nets, convolutional networks, recurrent networks
[1, 2, 3, 4], etc.) are lacking robustness and often fail to provide any guarantees of convergence or quantify the error/uncertainty associated with their predictions. Hence, the ability to quantify predictive uncertainty and learn in a sampleefficient manner is a necessity [5, 6, 7, 8], especially in datalimited domains [9]. Even less well understood is how one can constrain such algorithms to leverage domainspecific knowledge and return predictions that satisfy certain physical principles [10] (e.g., conservation of mass, momentum, etc.). In recent work, Raissi et. al. [11, 12, 13, 14]explored this new interface between classical scientific computing and machine learning by revisiting the idea of penalizing the loss function of deep neural networks using differential equation constraints, as first put forth by Psichogios and Ungar
[15] and Lagaris et. al. [16]. In this work, we revisit these approaches from a probabilistic standpoint [17, 18, 19] and develop an adversarial inference framework [20, 21, 22, 23] to enable the posterior characterization [24, 25]of the uncertainty associated with the model predictions. These uncertainty estimates reflect how observation noise and/or randomness in the system’s inputs and outputs are propagated through complex infinitedimensional dynamical systems described by nonlinear partial differential equations (PDEs).
2 Methods
In Raissi et. al. [11, 12, 13, 14], the authors have considered constructing deep neural networks that return predictions which are constrained by PDEs of the form , where is represented by a deep neural network parametrized by a set of parameters , i.e. ,
is a vector of space coordinates,
is the time coordinate, and is a nonlinear differential operator. As neural networks are differentiable representations, this construction defines a socalled physics informed neural network that corresponds to the PDE residual, i.e. . By construction, this network shares the same architecture and parameters with, but it has different activation functions corresponding to the action of the differential operator. The resulting training procedure allows us to recover the shared network parameters
using a few scattered observations of , namely , , along with a larger number of collocation points , , that aim to penalize the PDE residual at a finite set ofcollocation nodes. This way, the resulting optimization problem can be effectively solved using standard stochastic gradient descent without necessitating any elaborate constrained optimization techniques, simply by minimizing the composite loss function
(1) 
where the required gradients can be readily obtained using automatic differentiation [26].
In the proposed work we aim to model uncertainty in such physicsinformed models by constructing conditional latent variable models of the form
(2) 
and encourage the resulting samples to be constrained by a given physical law according to a likelihood
(3) 
This setting encapsulates a wide range of deterministic and stochastic problems, where is a potentially multivariate field, and is a collection of latent variables.
Following the recent findings of [27]
we will train the generative model by matching the joint distribution of the generated samples
with the joint distribution of the observed data by minimizing the reverse KullbackLeibler (KL) divergence, which can be decomposed as(4) 
where denotes the entropy of the generative model. This decomposition reveals the interplay between two competing mechanisms that define the model training dynamics. On one hand, minimizing the negative entropy term is encouraging the support of to spread to infinity, while the second term will penalize regions for which the support of and do not overlap. As observed in [27], this introduces a regularization mechanism for mitigating the pathology of mode collapse.
As the entropy term is intractable, we can obtain a computable training objective by deriving the following lower bound (see proof in Appendix A)
(5) 
where the inference model plays the role of a variational approximation to the true posterior over the latent variables, and appears naturally using information theoretic arguments in the derivation of the lower bound [27].
This construction leads to two coupled objectives that define an adversarial game for training the model parameters
(6)  
(7) 
where is a parametrized discriminator used to approximate the KL divergence directly from samples using the density ratio trick [28], and
is the logistic sigmoid function. Moreover, notice how the inference model
promotes cycle consistency in the latent variables, and serves as an entropic regularization term than allows us to stabilize model training and mitigate the pathology of mode collapse, as controlled by the user defined parameter [27]. The final training objective that encourages the generated samples to satisfy a given PDE reads as(8)  
where positive values of can be selected to place more emphasis on penalizing the PDE residual. For , the residual loss acts as a regularization term that encourages the generator to produce samples that satisfy the underlying partial differential equation.
3 Results
Here we demonstrate the performance of the proposed methodology through the lens of a canonical problem in transport dynamics modeled by the Burgers equation with appropriate initial and boundary conditions. This equation arises in various areas of applied mathematics, including fluid mechanics, nonlinear acoustics, gas dynamics, and traffic flow [29]. In one space dimension the equation reads as
(9)  
where is a viscosity parameter, small values of which can lead to solutions developing shock formations that are notoriously hard to resolve by classical numerical methods [29].
Here, we construct a probabilistic representation for the unknown solution using a physicsinformed deep generative model , and we introduce parametric mappings corresponding to a generator , an encoder , and a discriminator
all constructed using deep feedforward neural networks with 4 hidden layers with 50 neurons each. The activation function in all cases is chosen to be a hyperbolic tangent nonlinearity. The prior over the latent variables is chosen to be a onedimensional isotropic Gaussian distribution, i.e.
. We train our probabilistic model using stochastic gradient Adam updates [30] using a learning rate of on a dataset comprising of input/output pairs for – 100 points for the initial condition and 50 points for each of the domain boundaries – plus an additional collocation points for enforcing the residual of the Burgers equation using the loss of equation 8 with and . A systematic study with respect to the model hyperparameters is provided in Appendix B. Notice that the initial condition here is corrupted by a nonadditive noise process asand our goal here is to propagate the effect of this uncertainty into the prediction of future system states. Notice how the noise variance is chosen to be larger around
, therefore amplifying the effect of uncertainty on the shock formation.The results of this experiment are summarized in figure 1, where we report the predicted mean solution, as well as the uncertainty associated with this prediction. We observe that the resulting generative model can effectively capture the uncertainty in the resulting spatiotemporal solution due to the propagation of the input noise process through the complex nonlinear dynamics of the Burgers equation. As expected, the uncertainty concentrates around the shock discontinuity that the solution develops around
. Although we only plot the first two moments of the solution, we must emphasize that the generative model
provides a complete probabilistic characterization of its nonGaussian statistics.References

[1]
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In
Advances in neural information processing systems (pp. 10971105). 
[2]
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning.
Nature, 521(7553), 436.  [3] Lake, B. M., Salakhutdinov, R., & Tenenbaum, J. B. (2015). Humanlevel concept learning through probabilistic program induction. Science, 350(6266), 13321338.
 [4] Alipanahi, B., Delong, A., Weirauch, M. T., & Frey, B. J. (2015). Predicting the sequence specificities of DNAand RNAbinding proteins by deep learning. Nature biotechnology, 33(8), 831.
 [5] Bui, T., HernándezLobato, D., HernandezLobato, J., Li, Y., & Turner, R. (2016, June). Deep Gaussian processes for regression using approximate expectation propagation. In International Conference on Machine Learning (pp. 14721481).
 [6] Gal, Y., & Ghahramani, Z. (2016, June). Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning (pp. 10501059).
 [7] Gal, Y. (2016). Uncertainty in deep learning. University of Cambridge.

[8]
Ghahramani, Z. (2015). Probabilistic machine learning and artificial intelligence.
Nature, 521(7553), 452.  [9] Ma, C., Li, Y., & HernándezLobato, J. M. (2018). Variational Implicit Processes. arXiv preprint arXiv:1806.02390.
 [10] Welling, M., & Teh, Y. W. (2011). Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML11) (pp. 681688).
 [11] Raissi, M., Perdikaris, P., & Karniadakis, G. E. (2017). Physics Informed Deep Learning (Part I): Datadriven solutions of nonlinear partial differential equations. arXiv preprint arXiv:1711.10561.
 [12] Raissi, M., Perdikaris, P., & Karniadakis, G. E. (2017). Physics informed deep learning (Part II): datadriven discovery of nonlinear partial differential equations. arXiv preprint arXiv:1711.10566.
 [13] Raissi, M., & Karniadakis, G. E. (2018). Hidden physics models: Machine learning of nonlinear partial differential equations. Journal of Computational Physics, 357, 125141.
 [14] Raissi, M. (2018). Deep Hidden Physics Models: Deep Learning of Nonlinear Partial Differential Equations. arXiv preprint arXiv:1801.06637.
 [15] Psichogios, D. C., & Ungar, L. H. (1992). A hybrid neural network‐first principles approach to process modeling. AIChE Journal, 38(10), 14991511.
 [16] Lagaris, I. E., Likas, A., & Fotiadis, D. I. (1998). Artificial neural networks for solving ordinary and partial differential equations. IEEE transactions on neural networks, 9(5), 9871000.
 [17] Kevin, M. (2012). Machine learning, a probabilistic perspective.
 [18] Kingma, D. P., & Welling, M. (2013). Autoencoding variational Bayes. arXiv preprint arXiv:1312.6114.

[19]
Weber, M., Welling, M., & Perona, P. (2000, June). Unsupervised learning of models for recognition. In
European conference on computer vision
(pp. 1832). Springer, Berlin, Heidelberg.  [20] Goodfellow, I., PougetAbadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., … & Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems (pp. 26722680).
 [21] Louizos, C., Ullrich, K., & Welling, M. (2017). Bayesian compression for deep learning. In Advances in Neural Information Processing Systems (pp. 32883298).
 [22] Saatci, Y., & Wilson, A. G. (2017). Bayesian gan. In Advances in neural information processing systems (pp. 36223631).
 [23] Wilson, A. G., Hu, Z., Salakhutdinov, R. R., & Xing, E. P. (2016). Stochastic variational deep kernel learning. In Advances in Neural Information Processing Systems (pp. 25862594).

[24]
Ghahramani, Z. (2001). An introduction to hidden Markov models and Bayesian networks.
International journal of pattern recognition and artificial intelligence, 15
(01), 942.  [25] Alemi, A. A., Poole, B., Fischer, I., Dillon, J. V., Saurous, R. A., & Murphy, K. (2017). An informationtheoretic analysis of deep latentvariable models. arXiv preprint arXiv:1711.00464.
 [26] Baydin, A. G., Pearlmutter, B. A., Radul, A. A., & Siskind, J. M. (2017). Automatic differentiation in machine learning: a survey. Journal of Machine Learning Research, 18(153), 1153.
 [27] Li, C., Li, J., Wang, G., & Carin, L. (2018). Learning to Sample with Adversarially Learned LikelihoodRatio.
 [28] Sugiyama, M., Suzuki, T., & Kanamori, T. (2012). Density ratio estimation in machine learning. Cambridge University Press.
 [29] Basdevant, C., Deville, M., Haldenwang, P., Lacroix, J. M., Ouazzani, J., Peyret, R., … & Patera, A. T. (1986). Spectral and finite difference solutions of the Burgers equation. Computers & fluids, 14(1), 2341.
 [30] Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
 [31] Cover, T. M., & Thomas, J. A. (2012). Elements of information theory. John Wiley & Sons.
 [32] Glorot, X., & Bengio, Y. (2010, March). Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics (pp. 249256).
Appendix A Proof of the Entropy Lower Bound
Here we follow the derivation of Li [27] to construct a computable lower bound for the entropy
. To this end, we start by considering random variables
under the joint distributionwhere , and is the Dirac delta function. The mutual information between and satisfies the information theoretic identity
where , are the marginal entropies and , are the conditional entropies [31]. Since in our setup and are deterministic variables independent of , and samples of are generated by a deterministic function , it follows that . We therefore have
(10) 
where is a constant with respect to the model parameters .
Now consider a general variational distribution parametrized by a set of parameters . Then,
(11) 
Viewing as a set of latent variables, then is a variational approximation to the true intractable posterior over the latent variables . Therefore, if is introduced as an auxiliary encoder associated with the generative model , for which and , then we can use equations 10 and 11 to bound the entropy term in equation 4 as
(12) 
Appendix B Systematic Studies
Here we provide results on a series of comprehensive systematic studies that aim to quantify the sensitivity of the resulting predictions on: (i) the neural network initialization, (ii) the total number of training and collocation points, (iii) the neural network architecture, and (iv) the adversarial training procedure. In all cases we have used the nonlinear Burgers equation as a prototype problem.
b.1 Sensitivity with respect to the neural network initialization
In order to quantify the sensitivity of the proposed methods with respect to the initialization of the neural networks, we have considered a noisefree data set comprising of and training and collocation points, respectively, and fixed the architecture for generator neural networks to include 4 hidden layers with 50 neurons each and discriminator neural networks to include 3 hidden layers with 50 neurons each , and a hyperbolic tangent activation function. Then we have trained an ensemble of 15 cases all starting from a normal Xavier initialization [32] for all network weights (with a randomized seed), and a zero initialization for all bias parameters. In table 1 we report the relative error between the predicted mean solution and the known exact solution for this problem for all 15 randomized trials using at set of
randomly selected test points. Evidently, our results are robust with respect to the the neural network initialization as in all cases the stochastic gradient descent training procedure converged roughly to the same solution. We can summarize this result by reporting the mean and the standard deviation of the relative
error asRelative error  

4.1e02  7.9e02  4.4e02  4.0e02  3.8e02 
3.2e02  5.7e02  4.7e02  6.5e02  4.0e02 
3.5e02  3.5e02  6.4e02  4.0e02  4.9e02 
b.2 Sensitivity with respect to the total number of training and collocation points
In this study our goal is to quantify the sensitivity of our predictions with respect to the total number of training and collocation points and , respectively. As before, we have considered noisefree data sets, and fixed the architecture for generator neural networks to include 4 hidden layers with 50 neurons each and discriminator neural networks to include 3 hidden layers with 50 neurons each, a hyperbolic tangent activation function, and a normal Xavier initialization [32] for all network weights and zero initialization for all network biases. The results of this study are summarized in table 2, indicating that as the number of collocation points are increased, a more accurate prediction is obtained. This observation is in agreement with the original results of Raissi et. al. [11] for deterministic physicsinformed neural networks, indicating the role of the PDE residual loss as an effective regularization mechanism for training deep generative models in small data regimes.
10  100  250  500  1000  5000  10000  

60  9.3e01  5.6e01  4.8e01  5.0e02  1.9e01  5.0e02  5.1e02 
90  5.8e01  5.3e01  3.5e01  1.5e01  4.9e02  1.0e01  5.8e02 
150  6.7e01  1.4e01  3.0e01  3.6e02  4.9e02  1.2e01  4.7e02 
b.3 Sensitivity with respect to the neural network architecture
In this study we aim to quantify the sensitivity of our predictions with respect to the architecture of the neural networks that parametrize the generator, the discriminator, and the encoder. Here we have fixed the number of noisefree training data to and , and we kept the number of layers for discriminator to always be one less than the number of layers for generator (e.g., if the number of layers for generator is two then the number of layers for discriminator is one, etc.). In all cases, we have used a hyperbolic tangent nonlinearity and a normal Xavier initialization [32]. In table 3 we report the relative prediction error for different feedforward architectures for the generator, discriminator, and encoder (i.e., different number of layers and number of nodes in each layer). The general trend suggests that as the neural network capacity is increased we obtain more accurate predictions, indicating that our physicsinformed constraint on the PDE residual can effectively regularize the training process and safeguard against overfitting. We note number of neurons in each layer as and number of layers for generator (encoder) as .
20  50  100  

2  4.2e01  3.8e01  5.7e01 
3  6.5e02  3.5e02  2.1e02 
4  9.3e02  4.7e02  5.4e02 
b.4 Sensitivity with respect to the adversarial training procedure
Finally, we test the sensitivity with respect to the adversarial training process. To this end, we have fixed the number of noisefree training data to and , and the neural network architecture to be the same as B.2, and we vary the total number of training steps for the generator and the discriminator within each stochastic gradient descent iteration. The results of this study are presented in table 4 where we report the relative prediction error. These results reveal the high sensitivity of the training dynamics on the interplay between the generator and discriminator networks, and pinpoint on the well known peculiarity of adversarial inference procedures which require a careful tuning of and for achieving stable performance in practice.
1  2  5  

1  3.5e01  5.0e01  1.5e+00 
2  4.3e02  3.2e01  5.4e01 
5  4.7e02  2.3e01  7.0e01 
error with different number of training for generator and discriminator in each epoch.
Comments
There are no comments yet.