1 Introduction
Recent advances in machine learning and data analytics have yielded transformative results across diverse scientific disciplines, including image recognition
Krizhevsky et al. (2012)LeCun et al. (2015), cognitive science Lake et al. (2015), and genomics Alipanahi et al. (2015). In all aforementioned areas, the volume of data has increased substantially compared to even a decade ago, but analyzing big data is expensive and timeconsuming. Datadriven methods, which have been enabled by the availability of sensors, data storage, and computational resources, are taking center stage across many disciplines of science. We now have highly scalable solutions for problems in object detection and recognition, machine translation, texttospeech conversion, recommender systems, and information retrieval LeCun et al. (2015). All of these solutions attain stateoftheart performance when trained with large amounts of data.However, more often than not, in laboratory experiments and largescale simulations aiming to elucidate and predict complex phenomena, a large number of quality and errorfree data is prohibitively costly to obtain. Under this setting, purely datadriven approaches for machine learning present difficulties when the data is scarce relative to the complexity of the system. The vast majority of stateofthe art machine learning techniques (e.g., deep neural nets, convolutional networks, recurrent networks, etc. Goodfellow et al. (2016)) are lacking robustness and fail to provide any guarantees of convergence or quantify the error/uncertainty associated with their predictions. Hence, the ability to learn in a robust and sampleefficient manner is a necessity in these datalimited domains. Even less well understood is how one can constrain such algorithms to leverage domainspecific knowledge and return predictions that satisfy certain physical principles (e.g., conservation of mass, momentum, etc.).
These shortcomings often generate skepticism and disbelief among applied mathematicians and engineers regarding the solid grounding of purely datadriven machine learning approaches. In recent work, Raissi et. al. Raissi et al. (2017a, b, 2018); Raissi and Karniadakis (2018); Raissi (2018)
set foot exactly at this relatively unexplored interface between applied mathematics and contemporary machine learning by revisiting the idea of penalizing the loss function of deep neural networks using differential equation constraints, as first put forth by Psichogios and Ungar
Psichogios and Ungar (1992) and Lagaris et. al. Lagaris et al. (1998). This line of work has empirically demonstrated how such physicsinformed constraints regularize learning in small data regimes, can lead to the discovery of governing equations and reducedorder models, as well as enable the prediction of complex dynamics from incomplete models and incomplete data. Despite a series of impressive results in canonical problems, Raissi et. al. Raissi et al. (2017a, b)have also pointed out cases in which the training phase of these algorithms faces severe difficulties for reasons that are currently poorly understood. In lack of supporting theory on convergence and aposteriori error estimation, this naturally poses the need for for scalable algorithms for uncertainty quantification.
A literature review of the current stateoftheart in uncertainty quantification reveals a subtle dichotomy between different communities. On one hand, researchers in applied mathematics and scientific computing predominately rely on mathematical models that are rigorously derived from first physical principles. At the dawn of exascale computing, such models have enabled the accurate simulation of increasingly more complex phenomena (see for e.g., Perdikaris et al. (2016); Rossinelli et al. (2015)). They have also enabled insilico systematic studies in which the behavior of a system can be probed in a controlled fashion for different conditions, parameter settings, external inputs, etc. Šukys et al. (2017). The latter aims to both elucidate the key mechanisms that govern the behavior of a system, but also characterize the robustness of the resulting predictions with respect to epistemic and aleatory uncertainty Oden et al. (2010). However, despite the fact that much progress has been made over the last two decades, the most popular methods for scientific computing under uncertainty, such as polynomial chaos expansions Ghanem and Spanos (1990); Xiu and Karniadakis (2002); Najm (2009), sparse grid quadratures Gerstner and Griebel (1998); Eldred and Burkardt (2009), multilevel/multifidelity Monte Carlo sampling Barth et al. (2011); Peherstorfer et al. (2016), proper orthogonal decomposition Berkooz et al. (1993); Le Maître and Knio (2010), and Gaussian process regression models Bilionis and Zabaras (2012); Bilionis et al. (2013); Perdikaris et al. (2016), all face severe limitations in view of the nonGaussian likelihoods and highdimensional posterior distributions commonly encountered in realistic applications.
On the other hand, the recent explosive growth of machine learning research has put forth new effective ways of learning and manipulating complex highdimensional probability distributions. Inference tools like variational autoencoders
Kingma and Welling (2013) and generative adversarial networks Goodfellow et al. (2014), formulated on top of flexible building blocks such as feedforward/convolutional/recurrent neural networks
Goodfellow et al. (2016) have introduced highly scalable solutions, albeit for problems where not much prior information is assumed, but instead, large amounts of data can be obtained at relatively low cost (e.g., image recognition Krizhevsky et al. (2012), natural language processing LeCun et al. (2015)). In this work, we aim to leverage recent developments in machine learning to put forth a scalable framework for uncertainty propagation in physical systems for which the cost of data acquisition is high and training datasets are typically small, but strong prior information exists by means of known governing laws expressed by partial differential equations. Specifically, we construct a class of probabilistic physicsinformed neural networks that enables us to obtain a posterior characterization of the uncertainty associated with their predicted outputs. Moreover, we will develop a flexible variational inference framework that will allow us to train such models directly from noisy input/output data, and predict outcomes of nonlinear dynamical systems that are partially observed with quantified uncertainty. This setting necessitates a departure from the classical deterministic realm of modeling and scientific computation, and, consequently, our main building blocks can no longer be crisp deterministic numbers and governing laws, but instead we must operate with probabilistic models.This paper is structured as follows. In section 2.1 we provide a brief overview of physicsinformed neural networks in sync with the recent developments in Raissi et al. (2017a, b); Raissi and Karniadakis (2018); Raissi (2018). In sections 2.2 and 2.3 we provide an outline of the proposed probabilistic formulation and the proposed variational inference framework. Finally, in section 3 we will demonstrate the effectiveness of our approach through a series of examples involving uncertainty propagation in nonlinear conservation laws, and the discovery of constitutive laws for flow through porous media directly from noisy data.
2 Methods
2.1 Physicsinformed neural networks
The recent works of Raissi et. al. Raissi et al. (2017a, b); Raissi and Karniadakis (2018); Raissi (2018) have demonstrated how classical conservation laws and numerical discretization schemes can be used as structured prior information that can enhance the robustness and efficiency of modern machine learning algorithms, introducing a new class of datadriven solvers, as well as a physicsinformed machine learning approach to model discovery. To this end, the authors have considered constructing deep neural networks that return predictions which are constrained by parametrized partial differential equations (PDE) of the form
(1) 
where is represented by a deep neural network parametrized by a set of parameters , i.e. ,
is a vector of space coordinates,
is the time coordinate, and is a nonlinear differential operator. As neural networks are differentiable representations, this construction defines a socalled physics informed neural network that corresponds to the PDE residual, i.e. . This new network has the same parameters as the network representing, albeit different activation functions due to the action of the differential operator
Raissi et al. (2017a); Psichogios and Ungar (1992); Lagaris et al. (1998). From an implementation perspective, this network can be readily obtained by leveraging recent progress in automatic differentiation Baydin et al. (2015); Abadi et al. (2016). The resulting training procedure allows us to recover the shared network parameters using a few scattered observations of , namely , , along with a larger number of collocation points , , that aim to penalize the PDE residual at a finite set of collocation nodes. This simple, yet remarkably effective regularization procedure allows us to introduce the PDE residual as a soft penalty constraint penalty in the likelihood function of the model Raissi et al. (2017a, b), and the resulting optimization problem can be effectively solved using standard stochastic gradient descent without necessitating any elaborate constrained optimization techniques, simply by minimizing the composite loss function
(2) 
where the required gradients can be readily obtained using automatic differentiation Baydin et al. (2015). Finally, as the resulting predictions are encouraged to inherit any physical properties imposed by the PDE constraint (e.g., conservation, invariance, symmetries, etc.), this approach showcases how one can approximately encode physical and domainspecific constraints in modern machine learning algorithms and introduce a new form of regularization for learning from small datasets.
2.2 Probabilistic physicsinformed neural networks
Here we put forth a probabilistic formulation for propagating uncertainty through physicsinformed neural networks using latent variable models of the form
(3) 
This setting encapsulates a wide range of deterministic and stochastic problems, where is a potentially multivariate field, and is a collection of random latent variables. The ability to learn such a model from data is the cornerstone of probabilistic scientific computing and uncertainty quantification in physical systems. Knowledge of the conditional probability subject to domain knowledge constraints introduces a regularization mechanism that limits the space of admissible solutions to a manageable size (e.g., in fluid mechanics problems by discarding any nonrealistic flow solutions that violate the conservation of mass principle), thus enables training of probabilistic deep learning algorithms in small data
regimes. Moreover, by providing a complete characterization of uncertainty, it enhances the robustness of our predictions, and provides aposteriori error estimates for assessing model inadequacy. The latter, can also enable downstream tasks such as the formulation of adaptive data acquisition policies for active learning or Bayesian optimization
Shahriari et al. (2016) with domain knowledge constraints. Finally, thanks to the structure encoded by the PDE itself, the resulting latent variables can potentially lead to the extraction of physically relevant and interpretable lowdimensional feature representations, which can subsequently introduce new techniques for nonlinear model order reduction and coarsegraining of complex systems.2.3 Adversarial inference for joint distribution matching
Following the recent findings of Li et al. (2018)
we argue that matching the joint distribution of the generated data
with the joint distribution of the observed databy minimizing the reverse KullbackLeibler divergence
is a promising approach to train the generative model presented in equation 3. This also implies that the respective marginal and conditional distributions are also encouraged to match. The use of the reverse KullbackLeibler divergence (in contrast to the maximum likelihood setup) is motivated by examining the following decomposition(4) 
where denotes the entropy of the generative model. The second term can be further decomposed as
(5)  
where and denote the support of the distributions and , respectively, while denotes the complement of . Notice that by minimizing the KullbackLeibler divergence in equation 4 we introduce a mechanism that is trying to balance the effect of two competing objectives. Specifically, maximization of the entropy term encourages to spread over its support set as wide, while the second integral term in equation 5 introduces a strong (negative) penalty when the support of and do not overlap. Hence, the support of is encouraged to spread only up to the point that , implying that . When the pathological issue of “modecollapse” (commonly encountered in the training of generative adversarial networks Goodfellow et al. (2014)) is manifested Salimans et al. (2016). This issue is present if one seeks to directly minimize the reverse KullbackLeibler objective in equation 4 as this provides no control on the relative importance of the two terms. As discussed in Li et al. (2018), we may rather minimize , with to allow for control of how much emphasis is placed on mitigating mode collapse. It is then clear that the entropic regularization introduced by provides an effective mechanism for controlling and mitigating the effect of mode collapse, and, therefore, potentially enhancing the robustness adversarial inference procedures for learning .
Minimization of equation 4 with respect to the generative model parameters presents two fundamental difficulties. First, the evaluation of both distributions and typically involves intractable integrals in high dimensions, and we may only have samples drawn from the two distributions, not their explicit analytical forms. Second, the differential entropy term is intractable as is not known apriori. In the next sections we revisit the unsupervised formulation put forth in Li et al. (2018) and derive a tractable inference procedure for learning from scattered observation pairs of , namely , .
2.3.1 Density ratio estimation by probabilistic classification
By definition, the computation of the KullbackLeibler divergence in equation 4 involves computing an expectation over a logdensity ratio, i.e.
In general, given samples from two distributions, we can approximate their density ratio by constructing a binary classifier that distinguishes between samples from the two distributions. To this end, we assume that
data points are drawn from and are assigned a label . Similarly, we assume that samples are drawn from and assigned label . Consequently, we can write these probabilities in a conditional form, namelywhere and are the class probabilities predicted by a binary classifier . Using Bayes rule, it is then straightforward to show that the density ratio of and can be computed as
(6) 
This simple procedure suggests that we can harness the power of deep neural network classifiers to obtain accurate estimates of the reverse KullbackLeibler divergence in equation 4 directly from data and without the need to assume any specific parametrization for the generative model distribution .
2.3.2 Entropic regularization bound
Here we follow the derivation of Li et. al Li et al. (2018) to construct a computable lower bound for the entropy
. To this end, we start by considering random variables
under the joint distributionwhere , and is the Dirac delta function. The mutual information between and satisfies the information theoretic identity
where , are the marginal entropies and , are the conditional entropies Akaike (1998). Since in our setup and are deterministic variables independent of , and samples of are generated by a deterministic function , it follows that . We therefore have
(7) 
where does not depend on the generative model parameters .
Now consider a general variational distribution parametrized by a set of parameters . Then,
(8) 
Viewing as a set of latent variables, then is a variational approximation to the true intractable posterior over the latent variables . Therefore, if is introduced as an auxiliary inference model associated with the generative model , for which and , then we can use equations 7 and 8 to bound the entropy term in equation 4 as
(9) 
Note that the inference model plays the role of a variational approximation to the true posterior over the latent variables, and appears naturally using information theoretic arguments in the derivation of the lower bound.
2.3.3 Adversarial training objective
By leveraging the density ratio estimation procedure described in section 2.3.1 and the entropy bound derived in section 2.3.2, we can derive the following loss functions for minimizing the reverse KullbackLeibler divergence with entropy regularization
(10)  
(11) 
where
is the logistic sigmoid function. Notice how the binary crossentropy objective of equation
10 aims to progressively improve the ability of the classifier to discriminate between “fake” samples obtained from the generative model and “true” samples originating from the observed data distribution . Simultaneously, the objective of equation 11 aims at improving the ability of the generator to generate increasingly more realistic samples that can “fool” the discriminator . Moreover, the encoder not only serves as an entropic regularization term than allows us to stabilize model training and mitigate the pathology of mode collapse, but also provides a variational approximation to true posterior over the latent variables. The way it naturally appears in the objective of equation 11 also encourages the cycleconsistency of the latent variables ; a process that is known to result in disentangled and interpretable lowdimensional representations of the observed data Friedman et al. (2001), which could be subsequently used as good features for nonlinear model order reduction.In theory, the optimal set of parameters correspond to the Nash equilibrium of the two player game defined by the loss functions in equations 10,11, for which one can show that the exact model distribution and the exact posterior over the latent variables can be recovered Goodfellow et al. (2014); Pu et al. (2017). In practice, although there is no guarantee that this optimal solution can be attained, the generative model can be trained by alternating between optimizing the two objectives in equations 10,11 using stochastic gradient descent as
(12)  
(13) 
2.3.4 Adversarial training with physicsinformed constrains
In order to learn the physicsinformed probabilistic model of equation 3 from data we can extend the adversarial inference framework presented above by appropriately penalizing the loss function of the generator (see equation 11). The available data correspond to scattered observation pairs , , originating from known initial or boundary conditions, or any other (potentially noisy) measurements of . In analogy to the deterministic setting put for in Raissi et al. (2017a) and summarized in section 2.1, by defining we essentially introduce a new conditional probability model that shares the same parameters as , albeit the underlying neural network that serves as its approximation has different activation functions. However, since we would like to encourage every sample produced by the generator to satisfy the PDE constraint, we can simply treat the residual as a deterministic variable, i.e, , and enforce the constraint at a finite set of collocation points by simply minimizing the mean square loss
(14) 
Then, the resulting adversarial game for training the physicsinformed model of equation 3 takes the form
(15)  
where positive values of can be selected to place more emphasis on penalizing the PDE residual. For , the residual loss acts as a regularization term that approximately enforces the given physical constraint, and, therefore, encourages the generator to produce samples that satisfy the underlying partial differential equation. Also note that this structured approach also encourages the encoder to learn a set of spatiotemporal latent variables that are relevant to the underlying physics, possibly opening a new directions for probabilistic model order reduction of complex systems.
2.3.5 Predictive distribution
Once the model is trained we can construct a probabilistic ensemble for the solution by sampling latent variables from the prior and passing them through the generator to yield samples that are distributed according to the predictive model distribution
. Note that although the explicit form of this distribution is not known, we can efficiently compute any of its moments via Monte Carlo sampling. The cost of this prediction step is negligible compared to the cost of training the model, as it only involves a single forward pass through the generator function
. Typically, we compute the mean and variance of the predictive distribution at a new test point
as(16)  
(17) 
where , , and corresponds to the total number of Monte Carlo samples.
2.3.6 Advantages and caveats of adversarial learning
Since their recent introduction Goodfellow et al. (2014); Makhzani et al. (2015); Dumoulin et al. (2016); Mescheder et al. (2017), adversarial learning techniques have provided great flexibility for performing probabilistic computations with arbitrarily complex implicit distributions. Essentially, they have lifted the oversimplified approximations typically used in variational inference (Gaussian approximations, exponential families, etc.) Blei et al. (2017), yielding very general and flexible schemes for statistical inference. However, this flexibility comes at a price, as such methods in practice require very careful tuning in order to achieve stable and accurate performance. To this end, recall the training objective defined in equation 15 that introduces an adversarial game between the generator and discriminator networks Goodfellow et al. (2014). In practice, this minimax optimization problem is solved by alternating stochastic gradient updates between the two competing objectives, and it is highly sensitive on the capacity of the neural networks modeling the generator and discriminator, as well as the relative frequency with which the parameters of each network are updated within each iteration of stochastic gradient descent. To this end, we provide a series of empirical observations and lessons we learned throughout this study that can enhance the robustness and stability of this training procedure:

Changing the relative number of stochastic gradient updates for the generator and the discriminator is equivalent to changing their neural network architecture. For example, we can reduce the capacity of discriminator by either performing more stochastic gradient updates for the generator, or remove one layer in the neural network architecture of the discriminator.

Given enough collocation points for penalizing the PDE residual, we can obtain robust uncertainty estimates together with precise predictions simply by tuning the capacity of discriminator and generator networks.

Typically, by fixing the generator, we expect the discriminator to have some capacity so that the model training dynamics remain stable. But, we do not want the discriminator to be very powerful as in that case there will by very little information from the discriminator that can help the generator to improve towards producing more realistic samples (this a common characteristic of adversarial inference procedures Goodfellow et al. (2014)).

For cases with a small number of training data we should reduce the capacity of the discriminator. This can be achieved by either changing the relative frequency of stochastic gradient updates for the generator and discriminator, or by reducing the capacity of the discriminator neural network architecture.
3 Results
In all examples we have trained the models for 30,000 stochastic gradient descent steps using the Adam optimizer Kingma and Ba (2014) with a learning rate of , while fixing a onetofive ratio for the discriminator versus generator updates. Moreover, we have fixed the entropic regularization and the residual penalty parameters to and
, respectively. The proposed algorithms were implemented in Tensorflow v1.10
Abadi et al. (2016), and computations were performed in single precision arithmetic on a single NVIDIA Tesla P100 GPU card. All data and code accompanying this manuscript will be made available at https://github.com/PredictiveIntelligenceLab/UQPINNs.3.1 A pedagogical example
Let us illustrate the basic capabilities of the proposed methods through a simple example corresponding to the following nonlinear secondorder ordinary differential equation
(18)  
subject to random boundary conditions . For this simple example, the deterministic solution corresponding to can be readily obtained as . Given observations of corresponding to different realizations of the random boundary conditions our goal is to obtain a probabilistic representation of the solution by training a physicsinformed generative model of the form , that is constrained by equation 18. To this end, we introduce three deterministic mappings parametrized by deep neural networks, namely , , and corresponding to the generator, encoder, and discriminator functions introduced in section 2.3. By construction, we also obtain a physicsinformed neural network corresponding to the deterministic residual of equation 18 that will be used to approximately enforce the differential equation constraint at a set of randomly distributed collocation points
. All neural networks were chosen to have two hidden layers with 50 neurons in each layer, and a hyperbolic tangent activation function. Moreover, the dimensionality of the latent variables was set to one, i.e.
, and we have assumed an isotropic standard normal prior, namely . As the training data for reflects the uncertainty in the boundary conditions, the role of the latent variables is to enable the propagation of this uncertainty into the predicted solution obtained through the generative model .Here we have considered two cases corresponding to deterministic and random boundary conditions, namely (i) (i.e., noisefree data), and (ii) (i.e., Gaussian uncorrelated noise). In all cases, the training data consists of realizations for each boundary point, , and a total of collocation points for enforcing the residual of equation 18. Our probabilistic predictions for this example are summarized in figures 1 and 2. Specifically, figure 1(a) shows the generative model predictive mean and two standard deviations, plotted against the exact solution of this problem. Note that this case corresponds to deterministic training data for the boundary conditions, hence the exact solution is deterministic, and the prediction error here is measured as in the relative norm
(19) 
where denotes the total number of equidistant test points in the interval . Moreover, the variance shown in the inset of figure 1(a) serves as an aposteriori error estimate that quantifies the uncertainty associated with the generative model predictions. Figure 1(b) shows the resulting prediction and uncertainty estimates corresponding to random boundary conditions, compared against a reference mean solution obtained numerically using a spectral method with 2,000 Monte Carlo samples. In this case, the predictive uncertainty of the generative model reflects the aggregate total uncertainty due to both randomness in the boundary conditions and the inherent epistemic uncertainty in the neural network approximation. As the generative model can return a complete statistical characterization of the solution by means of its conditional probability density , in figure 2 we provide a visual comparison of the onedimensional marginals between our predictions and the reference Monte Carlo solution corresponding to the spatial locations and .
Albeit simple, this example aims to demonstrate the basic capabilities of the proposed methodology in propagating uncertainty through nonlinear partial differential equations. In contrast to previous approaches to inferring solutions of partial differential equations from data Cockayne et al. (2016); Raissi et al. (2017, 2018); Cockayne et al. (2017), the proposed methodology does not rely on Gaussian assumptions, and it can directly tackle nonlinear problems without any need for linearization.
3.2 Burgers equation
In this example we aim to provide a comprehensive systematic study to quantify the robustness of the proposed methods with respect to different parameter choices. We will do so through the lens of a more challenging canonical problem involving the nonlinear timedependent Burgers equation in one spatial dimension:
(20)  
where the viscosity parameter is chosen as in order to generate a strongly nonlinear response that leads to the development of shock discontinuities in finite time. This is one of the few nonlinear partial differential equations that admits an exact solution through the ColeHopf transformation Hopf (1950); a solution that will be subsequently used to test the validity of our predictions.
Here we represent the unknown solution using a physicsinformed generative model of the form , and we will introduce parametric functions corresponding to a generator , an encoder , and a discriminator
all constructed using deep feedforward neural networks. The baseline architectures for the generator and the encoder have 4 hidden layers with 50 neurons per layers, while the discriminator network has 3 hidden layers and 50 neurons per layer. The activation function in all cases is chosen to be a hyperbolic tangent nonlinearity. The prior over the latent variables
is chosen again to be a onedimensional isotropic Gaussian distribution, i.e.
, .First we consider a baseline scenario, in which we train our probabilistic model using a dataset comprising of noisyfree input/output pairs for – 50 points for the initial condition (see figure 3(a)) and 50 points for each of the domain boundaries – plus an additional collocation points for enforcing the residual of the Burgers equation using the loss of equation 14. All data points were randomly selected within the bounds given in equation 20. The result of this experiment is summarized in 4 where we report the predicted mean solution, as well as the uncertainty associated with this prediction as quantified by two standard deviations of the generative model . As the training data for this case is noisefree, the solution to this problem is deterministic, and the resulting uncertainty captured in can be viewed as an aposteriori error estimate of the neural network approximation error due to the finite number of training data, which is measured as in the relative norm. As discussed in Raissi et al. (2017), a higher approximation accuracy can be achieved by training the generative model using a quasiNewton optimizer (e.g. LBFGS Liu and Nocedal (1989)), however here we chose to use stochastic gradient descent using Adam updates Kingma and Ba (2014) in order to highlight the ability of the proposed method to return uncertainty estimates when the model predictions are not perfectly accurate.
Second, we repeat the same test for a more complicated scenario in which the initial condition has been corrupted by nonadditive, nonGaussian noise as shown in figure 3(b), where the noise variance is larger around , therefore amplifying the effect of uncertainty on the shock formation. Here the neural network architecture as well as the number and location of training points have been kept fixed as described above, but the initial condition is now corrupted as
(21) 
The results of this experiment are summarized in figure 5. We observe that the resulting generative model can effectively capture the uncertainty in the resulting spatiotemporal solution due to the propagation of the input noise process through the complex nonlinear dynamics of the Burgers equation. As expected, the uncertainty concentrates around the shock. Although we only plot the first two moments of the solution, we must emphasize that the generative model provides a complete probabilistic characterization of its nonGaussian statistics.
In order to further investigate the performance of the proposed methodology for different parameter settings, we have performed a series of comprehensive systematic studies that aim to quantify the sensitivity of the resulting predictions on: (i) the neural network initialization, (ii) the total number of training and collocation points, (iii) the neural network architecture, and (iv) the adversarial training procedure. The results of these systematic studies are provided in A.
3.3 Discovery of constitutive laws for flow through porous media
In our final example, we aim to demonstrate the ability of the proposed methods to discover unknown constitutive relationships directly from data with quantified uncertainty. To this end, we revisit the Darcy flow example put forth in Tartakovsky et al. (2018) corresponding to a twodimensional nonlinear diffusion equation with an unknown statedependent diffusion coefficient
(22)  
where m/s and m are known boundary conditions. In order to benchmark and validate our model predictions we consider a realistic dataset generated using the Subsurface Transport Over Multiple Phases (STOMP) code White et al. (1995) with the van Genuchten model Van Genuchten (1980) for which reads as
(23)  
with the following parameter values: m/s, , , , m and m.
Our goal is twofold: we aim to construct a physicsinformed probabilistic model for , and simultaneously learn the unknown statedependent diffusion coefficient directly from data on (i.e., we assume no measurements of ). To this end, in addition to the three deterministic mappings , , and corresponding to the generator, encoder, and discriminator described in section 2.3, here we also introduce another neural network for approximating . The parameters of are essentially inherited by the physicsinformed residual network that aims to enforce the residual of equation 22 at the collocation points for any set of latent variables . All neural networks are chosen to have 2 hidden layers with 50 neurons per each, and a hyperbolic tangent activation function, while the probabilistic model for assumes a two dimensional latent space with an isotropic Gaussian prior, i.e., .
By construction, our probabilistic model for can return predictions of the unknown solution with quantified uncertainty. We can then use this model to propagate uncertainty in our predictions of via Monte Carlo sampling. Specifically, once the model is trained endtoend, we can easily generate samples of from and propagate them through to obtain a samples for . Essentially this results in an implicit generative model which can fully characterize uncertainty in our predictions of the unknown statedependent diffusion coefficient.
Here we also have considered two cases corresponding to noisefree training data and noisy data corrupted by Gaussian uncorrelated noise. In the noisefree case we used scattered measurements of the unknown solution – 200 inside the domain and 100 on each one of the four boundaries – and total number of randomly selected collocation points inside the domain for penalizing the residual of equation 22. For the noisy case we chose scattered measurements of the unknown solution – 1,000 inside the domain and 100 on each one of the four boundaries – while still keeping collocation points. Figure 6 summarizes the results for both cases by showing the predictive mean and two standard deviations of the corresponding generative model , against the reference (deterministic) solution obtained from the Subsurface Transport Over Multiple Phases (STOMP) code White et al. (1995) with the van Genuchten model Van Genuchten (1980) (see equation 23). Evidently, the generative model is able to recover a sensible prediction for the unknown statedependent diffusion coefficient with quantitative uncertainty, even when the training data on is heavily corrupted by noise. Moreover, notice that implicitly depends on the spatial coordinates . In figure 7 we present the resulting prediction for corresponding to the noisefree case, against the reference solution, as well as their pointwise absolute error.
Again we must emphasize that these predictions are obtained without ever observing any data on , while they are accompanied by quantitative estimates that jointly characterize the uncertainty due to noise in the training data for , and the underlying approximation error of the neural networks. This theme of consistently inferring correlated continuous quantities of interest from a small set of measurements by leveraging the underlying laws of physics is a great example of the exciting capabilities that physics informed machine learning has to offer.
4 Conclusions
We presented a class of probabilistic physicsinformed neural networks that are capable of approximating arbitrary conditional probability densities, while being constrained to generate samples that approximately satisfy given partial differential equations. Moreover, we have derived a flexible regularized adversarial inference framework that enables the endtoend training of such models directly from noisy and incomplete measurements. Uncertainty in the system inputs and/or outputs is captured through a set of latent variables that are relevant to the underlying physics, and could possibly open new directions for probabilistic modelorder reduction of complex systems. These developments allow us to perform probabilistic computations for uncertain systems, train deep generative models in small data regimes, handle complex noise processes, and seamlessly carry out uncertainty propagation studies for physical systems without the need for repeated evaluation of experiments and numerical simulations.
Although the proposed adversarial inference framework provides great flexibility for performing probabilistic computations and approximating arbitrarily complex and highdimensional probability distributions, it relies on carefully tuning the interplay between the generator and discriminator networks. This is a known limitation of adversarial algorithms, and, although several works have led to improvements Tolstikhin et al. (2017); Arjovsky et al. (2017), it still largely remains an open research problem. An alternative path for enhancing the robustness of the inference procedure, while not compromising its ability to handle complex probability distributions, comes through the use of invertible transformations and flowbased generative models Rezende and Mohamed (2015); Kingma and Dhariwal (2018). Future work will examine the applications of such models in the context of physicsinformed neural networks with the goal of robustifying the proposed methods and scaling them to more realistic systems.
Acknowledgements
This work received support from the US Department of Energy under the Advanced Scientific Computing Research program (grant DESC0019116) and the Defense Advanced Research Projects Agency under the Physics of Artificial Intelligence program. We would also like to thank Dr. Alexandre Tartakovsky from the Pacific Nortwest National Laboratory for providing the Darcy flow dataset.
Appendix A Sensitivity studies
Here we provide results on a series of comprehensive systematic studies that aim to quantify the sensitivity of the resulting predictions on: (i) the neural network initialization, (ii) the total number of training and collocation points, (iii) the neural network architecture, and (iv) the adversarial training procedure. In all cases we have used the nonlinear Burgers defined in section 3.2 as a prototype problem.
a.1 Sensitivity with respect to the neural network initialization
In order to quantify the sensitivity of the proposed methods with respect to the initialization of the neural networks, we have considered a noisefree data set comprising of and training and collocation points, respectively, and fixed the architecture for generator neural networks to include 4 hidden layers with 50 neurons each and discriminator neural networks to include 3 hidden layers with 50 neurons each , and a hyperbolic tangent activation function. Then we have trained an ensemble of 15 cases all starting from a normal Xavier initialization (Glorot and Bengio, 2010) for all network weights (with a randomized seed), and a zero initialization for all bias parameters. In table 1 we report the relative error between the predicted mean solution and the known exact solution for this problem for all 15 randomized trials using at set of randomly selected test points. Evidently, our results are robust with respect to the the neural network initialization as in all cases the stochastic gradient descent training procedure converged roughly to the same solution. We can summarize this result by reporting the mean and the standard deviation of the relative error as
Relative error  

4.1e02  7.9e02  4.4e02  4.0e02  3.8e02 
3.2e02  5.7e02  4.7e02  6.5e02  4.0e02 
3.5e02  3.5e02  6.4e02  4.0e02  4.9e02 
a.2 Sensitivity with respect to the total number of training and collocation points
In this study our goal is to quantify the sensitivity of our predictions with respect to the total number of training and collocation points and , respectively. As before, we have considered noisefree data sets, and fixed the architecture for generator neural networks to include 4 hidden layers with 50 neurons each and discriminator neural networks to include 3 hidden layers with 50 neurons each, a hyperbolic tangent activation function, and a normal Xavier initialization (Glorot and Bengio, 2010) for all network weights and zero initialization for all network biases. The results of this study are summarized in table 2, indicating that as the number of collocation points are increased, a more accurate prediction is obtained. This observation is in agreement with the original results of Raissi et. al. (Raissi et al., 2017a, b) for deterministic physicsinformed neural networks, indicating the role of the residual loss as an effective regularization mechanism for training deep generative models in small data regimes.
10  100  250  500  1000  5000  10000  

60  9.3e01  5.6e01  4.8e01  5.0e02  1.9e01  5.0e02  5.1e02 
90  5.8e01  5.3e01  3.5e01  1.5e01  4.9e02  1.0e01  5.8e02 
150  6.7e01  1.4e01  3.0e01  3.6e02  4.9e02  1.2e01  4.7e02 
a.3 Sensitivity with respect to the neural network architecture
In this study we aim to quantify the sensitivity of our predictions with respect to the architecture of the neural networks that parametrize the generator, the discriminator, and the encoder. Here we have fixed the number of noisefree training data to and , and we kept the number of layers for discriminator to always be one less than the number of layers for generator (e.g., if the number of layers for generator is two then the number of layers for discriminator is one, etc.). In all cases, we have used a hyperbolic tangent nonlinearity and a normal Xavier initialization (Glorot and Bengio, 2010). In table 3 we report the relative prediction error for different feedforward architectures for the generator, discriminator, and encoder (i.e., different number of layers and number of nodes in each layer). The general trend suggests that as the neural network capacity is increased we obtain more accurate predictions, indicating that our physicsinformed constraint on the PDE residual can effectively regularize the training process and safeguard against overfitting. We note number of neurons in each layer as and number of layers for generator (encoder) as .
20  50  100  

2  4.2e01  3.8e01  5.7e01 
3  6.5e02  3.5e02  2.1e02 
4  9.3e02  4.7e02  5.4e02 
a.4 Sensitivity with respect to the adversarial training procedure
Finally, we test the sensitivity with respect to the adversarial training process. To this end, we have fixed the number of noisefree training data to and , and the neural network architecture to be the same as A.2, and we vary the total number of training steps for the generator and the discriminator within each stochastic gradient descent iteration. The results of this study are presented in table 4 where we report the relative prediction error. These results reveal the high sensitivity of the training dynamics on the interplay between the generator and discriminator networks, and pinpoint on the well known peculiarity of adversarial inference procedures which require a careful tuning of and for achieving stable performance in practice.
1  2  5  

1  3.5e01  5.0e01  1.5e+00 
2  4.3e02  3.2e01  5.4e01 
5  4.7e02  2.3e01  7.0e01 
error with different number of training for generator and discriminator in each epoch.
References
 Krizhevsky et al. (2012) A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, in: Advances in neural information processing systems, pp. 1097–1105.
 LeCun et al. (2015) Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature 521 (2015) 436–444.
 Lake et al. (2015) B. M. Lake, R. Salakhutdinov, J. B. Tenenbaum, Humanlevel concept learning through probabilistic program induction, Science 350 (2015) 1332–1338.
 Alipanahi et al. (2015) B. Alipanahi, A. Delong, M. T. Weirauch, B. J. Frey, Predicting the sequence specificities of DNAand RNAbinding proteins by deep learning, Nature biotechnology 33 (2015) 831–838.
 Goodfellow et al. (2016) I. Goodfellow, Y. Bengio, A. Courville, Y. Bengio, Deep learning, volume 1, MIT press Cambridge, 2016.
 Raissi et al. (2017a) M. Raissi, P. Perdikaris, G. E. Karniadakis, Physics informed deep learning (part i): Datadriven solutions of nonlinear partial differential equations, arXiv preprint arXiv:1711.10561 (2017a).
 Raissi et al. (2017b) M. Raissi, P. Perdikaris, G. E. Karniadakis, Physics informed deep learning (part ii): Datadriven discovery of nonlinear partial differential equations, arXiv preprint arXiv:1711.10566 (2017b).
 Raissi et al. (2018) M. Raissi, P. Perdikaris, G. E. Karniadakis, Multistep neural networks for datadriven discovery of nonlinear dynamical systems, arXiv preprint arXiv:1801.01236 (2018).
 Raissi and Karniadakis (2018) M. Raissi, G. E. Karniadakis, Hidden physics models: Machine learning of nonlinear partial differential equations, Journal of Computational Physics 357 (2018) 125–141.
 Raissi (2018) M. Raissi, Deep hidden physics models: Deep learning of nonlinear partial differential equations, arXiv preprint arXiv:1801.06637 (2018).
 Psichogios and Ungar (1992) D. C. Psichogios, L. H. Ungar, A hybrid neural networkfirst principles approach to process modeling, AIChE Journal 38 (1992) 1499–1511.
 Lagaris et al. (1998) I. E. Lagaris, A. Likas, D. I. Fotiadis, Artificial neural networks for solving ordinary and partial differential equations, IEEE Transactions on Neural Networks 9 (1998) 987–1000.
 Perdikaris et al. (2016) P. Perdikaris, L. Grinberg, G. E. Karniadakis, Multiscale modeling and simulation of brain blood flow, Phys. Fluids 28 (2016) 021304.
 Rossinelli et al. (2015) D. Rossinelli, Y.H. Tang, K. Lykov, D. Alexeev, M. Bernaschi, P. Hadjidoukas, M. Bisson, W. Joubert, C. Conti, G. Karniadakis, et al., The insilico labonachip: petascale and highthroughput simulations of microfluidics at cell resolution, in: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ACM, p. 2.
 Šukys et al. (2017) J. Šukys, U. Rasthofer, F. Wermelinger, P. Hadjidoukas, P. Koumoutsakos, Optimal fidelity multilevel monte carlo for quantification of uncertainty in simulations of cloud cavitation collapse, arXiv preprint arXiv:1705.04374 (2017).
 Oden et al. (2010) J. T. Oden, R. Moser, O. Ghattas, Computer predictions with quantified uncertainty, part ii, SIAM News 43 (2010) 1–4.
 Ghanem and Spanos (1990) R. Ghanem, P. D. Spanos, Polynomial chaos in stochastic finite elements, Journal of Applied Mechanics 57 (1990) 197–202.
 Xiu and Karniadakis (2002) D. Xiu, G. E. Karniadakis, The wiener–askey polynomial chaos for stochastic differential equations, SIAM journal on scientific computing 24 (2002) 619–644.
 Najm (2009) H. N. Najm, Uncertainty quantification and polynomial chaos techniques in computational fluid dynamics, Annual review of fluid mechanics 41 (2009) 35–52.
 Gerstner and Griebel (1998) T. Gerstner, M. Griebel, Numerical integration using sparse grids, Numerical algorithms 18 (1998) 209.
 Eldred and Burkardt (2009) M. Eldred, J. Burkardt, Comparison of nonintrusive polynomial chaos and stochastic collocation methods for uncertainty quantification, in: 47th AIAA Aerospace Sciences Meeting including The New Horizons Forum and Aerospace Exposition, p. 976.
 Barth et al. (2011) A. Barth, C. Schwab, N. Zollinger, Multilevel monte carlo finite element method for elliptic pdes with stochastic coefficients, Numerische Mathematik 119 (2011) 123–161.
 Peherstorfer et al. (2016) B. Peherstorfer, K. Willcox, M. Gunzburger, Optimal model management for multifidelity monte carlo estimation, SIAM Journal on Scientific Computing 38 (2016) A3163–A3194.
 Berkooz et al. (1993) G. Berkooz, P. Holmes, J. L. Lumley, The proper orthogonal decomposition in the analysis of turbulent flows, Annual review of fluid mechanics 25 (1993) 539–575.
 Le Maître and Knio (2010) O. Le Maître, O. M. Knio, Spectral methods for uncertainty quantification: with applications to computational fluid dynamics, Springer Science & Business Media, 2010.
 Bilionis and Zabaras (2012) I. Bilionis, N. Zabaras, Multioutput local Gaussian process regression: Applications to uncertainty quantification, Journal of Computational Physics 231 (2012) 5718–5746.
 Bilionis et al. (2013) I. Bilionis, N. Zabaras, B. A. Konomi, G. Lin, Multioutput separable Gaussian process: Towards an efficient, fully Bayesian paradigm for uncertainty quantification, Journal of Computational Physics 241 (2013) 212–239.
 Perdikaris et al. (2016) P. Perdikaris, D. Venturi, G. E. Karniadakis, Multifidelity information fusion algorithms for highdimensional systems and massive data sets, SIAM J. Sci. Comput. 38 (2016) B521–B538.
 Kingma and Welling (2013) D. P. Kingma, M. Welling, Autoencoding variational Bayes, arXiv preprint arXiv:1312.6114 (2013).
 Goodfellow et al. (2014) I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial nets, in: Advances in neural information processing systems, pp. 2672–2680.
 Baydin et al. (2015) A. G. Baydin, B. A. Pearlmutter, A. A. Radul, J. M. Siskind, Automatic differentiation in machine learning: a survey, arXiv preprint arXiv:1502.05767 (2015).
 Abadi et al. (2016) M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al., Tensorflow: A system for largescale machine learning., in: OSDI, volume 16, pp. 265–283.
 Shahriari et al. (2016) B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, N. De Freitas, Taking the human out of the loop: A review of Bayesian optimization, Proceedings of the IEEE 104 (2016) 148–175.
 Li et al. (2018) C. Li, J. Li, G. Wang, L. Carin, Learning to sample with adversarially learned likelihoodratio (2018).
 Salimans et al. (2016) T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, Improved techniques for training gans, in: Advances in Neural Information Processing Systems, pp. 2234–2242.
 Akaike (1998) H. Akaike, Information theory and an extension of the maximum likelihood principle, in: Selected papers of hirotugu akaike, Springer, 1998, pp. 199–213.
 Friedman et al. (2001) J. Friedman, T. Hastie, R. Tibshirani, The elements of statistical learning, volume 1, Springer series in statistics New York, NY, USA:, 2001.
 Pu et al. (2017) Y. Pu, L. Chen, S. Dai, W. Wang, C. Li, L. Carin, Symmetric variational autoencoder and connections to adversarial learning, arXiv preprint arXiv:1709.01846 (2017).
 Makhzani et al. (2015) A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, B. Frey, Adversarial autoencoders, arXiv preprint arXiv:1511.05644 (2015).
 Dumoulin et al. (2016) V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky, A. Courville, Adversarially learned inference, arXiv preprint arXiv:1606.00704 (2016).
 Mescheder et al. (2017) L. Mescheder, S. Nowozin, A. Geiger, Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks, arXiv preprint arXiv:1701.04722 (2017).
 Blei et al. (2017) D. M. Blei, A. Kucukelbir, J. D. McAuliffe, Variational inference: A review for statisticians, Journal of the American Statistical Association 112 (2017) 859–877.
 Kingma and Ba (2014) D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980 (2014).
 Cockayne et al. (2016) J. Cockayne, C. Oates, T. Sullivan, M. Girolami, Probabilistic meshless methods for partial differential equations and Bayesian inverse problems, arXiv preprint arXiv:1605.07811 (2016).
 Raissi et al. (2017) M. Raissi, P. Perdikaris, G. E. Karniadakis, Inferring solutions of differential equations using noisy multifidelity data, Journal of Computational Physics 335 (2017) 736–746.
 Raissi et al. (2018) M. Raissi, P. Perdikaris, G. E. Karniadakis, Numerical Gaussian processes for timedependent and nonlinear partial differential equations, SIAM Journal on Scientific Computing 40 (2018) A172–A198.
 Cockayne et al. (2017) J. Cockayne, C. Oates, T. Sullivan, M. Girolami, Bayesian probabilistic numerical methods, arXiv preprint arXiv:1702.03673 (2017).
 Hopf (1950) E. Hopf, The partial differential equation ut+ uux= xx, Communications on Pure and Applied mathematics 3 (1950) 201–230.
 Raissi et al. (2017) M. Raissi, P. Perdikaris, G. E. Karniadakis, Physics informed deep learning (part i): Datadriven solutions of nonlinear partial differential equations, arXiv preprint arXiv:1711.10561 (2017).
 Liu and Nocedal (1989) D. C. Liu, J. Nocedal, On the limited memory bfgs method for large scale optimization, Mathematical programming 45 (1989) 503–528.
 Tartakovsky et al. (2018) A. M. Tartakovsky, C. O. Marrero, D. Tartakovsky, D. BarajasSolano, Learning parameters and constitutive relationships with physics informed deep neural networks, arXiv preprint arXiv:1808.03398 (2018).
 White et al. (1995) M. White, M. Oostrom, R. Lenhard, Modeling fluid flow and transport in variably saturated porous media with the STOMP simulator. 1. nonvolatile threephase model description, Advances in Water Resources 18 (1995) 353–364.
 Van Genuchten (1980) M. T. Van Genuchten, A closedform equation for predicting the hydraulic conductivity of unsaturated soils 1, Soil science society of America journal 44 (1980) 892–898.
 Tolstikhin et al. (2017) I. Tolstikhin, O. Bousquet, S. Gelly, B. Schoelkopf, Wasserstein autoencoders, arXiv preprint arXiv:1711.01558 (2017).
 Arjovsky et al. (2017) M. Arjovsky, S. Chintala, L. Bottou, Wasserstein GAN, arXiv preprint arXiv:1701.07875 (2017).
 Rezende and Mohamed (2015) D. J. Rezende, S. Mohamed, Variational inference with normalizing flows, arXiv preprint arXiv:1505.05770 (2015).
 Kingma and Dhariwal (2018) D. P. Kingma, P. Dhariwal, Glow: Generative flow with invertible 1x1 convolutions, arXiv preprint arXiv:1807.03039 (2018).
 Glorot and Bengio (2010) X. Glorot, Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, in: Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249–256.
Comments
There are no comments yet.