Adversarial Uncertainty Quantification in Physics-Informed Neural Networks

11/09/2018 ∙ by Yibo Yang, et al. ∙ 6

We present a deep learning framework for quantifying and propagating uncertainty in systems governed by non-linear differential equations using physics-informed neural networks. Specifically, we employ latent variable models to construct probabilistic representations for the system states, and put forth an adversarial inference procedure for training them on data, while constraining their predictions to satisfy given physical laws expressed by partial differential equations. Such physics-informed constraints provide a regularization mechanism for effectively training deep generative models as surrogates of physical systems in which the cost of data acquisition is high, and training data-sets are typically small. This provides a flexible framework for characterizing uncertainty in the outputs of physical systems due to randomness in their inputs or noise in their observations that entirely bypasses the need for repeatedly sampling expensive experiments or numerical simulators. We demonstrate the effectiveness of our approach through a series of examples involving uncertainty propagation in non-linear conservation laws, and the discovery of constitutive laws for flow through porous media directly from noisy data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 19

page 20

page 23

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent advances in machine learning and data analytics have yielded transformative results across diverse scientific disciplines, including image recognition

Krizhevsky et al. (2012)

, natural language processing

LeCun et al. (2015), cognitive science Lake et al. (2015), and genomics Alipanahi et al. (2015). In all aforementioned areas, the volume of data has increased substantially compared to even a decade ago, but analyzing big data is expensive and time-consuming. Data-driven methods, which have been enabled by the availability of sensors, data storage, and computational resources, are taking center stage across many disciplines of science. We now have highly scalable solutions for problems in object detection and recognition, machine translation, text-to-speech conversion, recommender systems, and information retrieval LeCun et al. (2015). All of these solutions attain state-of-the-art performance when trained with large amounts of data.

However, more often than not, in laboratory experiments and large-scale simulations aiming to elucidate and predict complex phenomena, a large number of quality and error-free data is prohibitively costly to obtain. Under this setting, purely data-driven approaches for machine learning present difficulties when the data is scarce relative to the complexity of the system. The vast majority of state-of-the art machine learning techniques (e.g., deep neural nets, convolutional networks, recurrent networks, etc. Goodfellow et al. (2016)) are lacking robustness and fail to provide any guarantees of convergence or quantify the error/uncertainty associated with their predictions. Hence, the ability to learn in a robust and sample-efficient manner is a necessity in these data-limited domains. Even less well understood is how one can constrain such algorithms to leverage domain-specific knowledge and return predictions that satisfy certain physical principles (e.g., conservation of mass, momentum, etc.).

These shortcomings often generate skepticism and disbelief among applied mathematicians and engineers regarding the solid grounding of purely data-driven machine learning approaches. In recent work, Raissi et. al. Raissi et al. (2017a, b, 2018); Raissi and Karniadakis (2018); Raissi (2018)

set foot exactly at this relatively unexplored interface between applied mathematics and contemporary machine learning by revisiting the idea of penalizing the loss function of deep neural networks using differential equation constraints, as first put forth by Psichogios and Ungar

Psichogios and Ungar (1992) and Lagaris et. al. Lagaris et al. (1998). This line of work has empirically demonstrated how such physics-informed constraints regularize learning in small data regimes, can lead to the discovery of governing equations and reduced-order models, as well as enable the prediction of complex dynamics from incomplete models and incomplete data. Despite a series of impressive results in canonical problems, Raissi et. al. Raissi et al. (2017a, b)

have also pointed out cases in which the training phase of these algorithms faces severe difficulties for reasons that are currently poorly understood. In lack of supporting theory on convergence and a-posteriori error estimation, this naturally poses the need for for scalable algorithms for uncertainty quantification.

A literature review of the current state-of-the-art in uncertainty quantification reveals a subtle dichotomy between different communities. On one hand, researchers in applied mathematics and scientific computing predominately rely on mathematical models that are rigorously derived from first physical principles. At the dawn of exascale computing, such models have enabled the accurate simulation of increasingly more complex phenomena (see for e.g., Perdikaris et al. (2016); Rossinelli et al. (2015)). They have also enabled in-silico systematic studies in which the behavior of a system can be probed in a controlled fashion for different conditions, parameter settings, external inputs, etc. Šukys et al. (2017). The latter aims to both elucidate the key mechanisms that govern the behavior of a system, but also characterize the robustness of the resulting predictions with respect to epistemic and aleatory uncertainty Oden et al. (2010). However, despite the fact that much progress has been made over the last two decades, the most popular methods for scientific computing under uncertainty, such as polynomial chaos expansions Ghanem and Spanos (1990); Xiu and Karniadakis (2002); Najm (2009), sparse grid quadratures Gerstner and Griebel (1998); Eldred and Burkardt (2009), multi-level/multi-fidelity Monte Carlo sampling Barth et al. (2011); Peherstorfer et al. (2016), proper orthogonal decomposition Berkooz et al. (1993); Le Maître and Knio (2010), and Gaussian process regression models Bilionis and Zabaras (2012); Bilionis et al. (2013); Perdikaris et al. (2016), all face severe limitations in view of the non-Gaussian likelihoods and high-dimensional posterior distributions commonly encountered in realistic applications.

On the other hand, the recent explosive growth of machine learning research has put forth new effective ways of learning and manipulating complex high-dimensional probability distributions. Inference tools like variational auto-encoders

Kingma and Welling (2013) and generative adversarial networks Goodfellow et al. (2014)

, formulated on top of flexible building blocks such as feed-forward/convolutional/recurrent neural networks

Goodfellow et al. (2016) have introduced highly scalable solutions, albeit for problems where not much prior information is assumed, but instead, large amounts of data can be obtained at relatively low cost (e.g., image recognition Krizhevsky et al. (2012), natural language processing LeCun et al. (2015)). In this work, we aim to leverage recent developments in machine learning to put forth a scalable framework for uncertainty propagation in physical systems for which the cost of data acquisition is high and training data-sets are typically small, but strong prior information exists by means of known governing laws expressed by partial differential equations. Specifically, we construct a class of probabilistic physics-informed neural networks that enables us to obtain a posterior characterization of the uncertainty associated with their predicted outputs. Moreover, we will develop a flexible variational inference framework that will allow us to train such models directly from noisy input/output data, and predict outcomes of non-linear dynamical systems that are partially observed with quantified uncertainty. This setting necessitates a departure from the classical deterministic realm of modeling and scientific computation, and, consequently, our main building blocks can no longer be crisp deterministic numbers and governing laws, but instead we must operate with probabilistic models.

This paper is structured as follows. In section 2.1 we provide a brief overview of physics-informed neural networks in sync with the recent developments in Raissi et al. (2017a, b); Raissi and Karniadakis (2018); Raissi (2018). In sections 2.2 and 2.3 we provide an outline of the proposed probabilistic formulation and the proposed variational inference framework. Finally, in section 3 we will demonstrate the effectiveness of our approach through a series of examples involving uncertainty propagation in non-linear conservation laws, and the discovery of constitutive laws for flow through porous media directly from noisy data.

2 Methods

2.1 Physics-informed neural networks

The recent works of Raissi et. al. Raissi et al. (2017a, b); Raissi and Karniadakis (2018); Raissi (2018) have demonstrated how classical conservation laws and numerical discretization schemes can be used as structured prior information that can enhance the robustness and efficiency of modern machine learning algorithms, introducing a new class of data-driven solvers, as well as a physics-informed machine learning approach to model discovery. To this end, the authors have considered constructing deep neural networks that return predictions which are constrained by parametrized partial differential equations (PDE) of the form

(1)

where is represented by a deep neural network parametrized by a set of parameters , i.e. ,

is a vector of space coordinates,

is the time coordinate, and is a nonlinear differential operator. As neural networks are differentiable representations, this construction defines a so-called physics informed neural network that corresponds to the PDE residual, i.e. . This new network has the same parameters as the network representing

, albeit different activation functions due to the action of the differential operator

Raissi et al. (2017a); Psichogios and Ungar (1992); Lagaris et al. (1998). From an implementation perspective, this network can be readily obtained by leveraging recent progress in automatic differentiation Baydin et al. (2015); Abadi et al. (2016). The resulting training procedure allows us to recover the shared network parameters using a few scattered observations of , namely , , along with a larger number of collocation points , , that aim to penalize the PDE residual at a finite set of collocation nodes. This simple, yet remarkably effective regularization procedure allows us to introduce the PDE residual as a soft penalty constraint penalty in the likelihood function of the model Raissi et al. (2017a, b)

, and the resulting optimization problem can be effectively solved using standard stochastic gradient descent without necessitating any elaborate constrained optimization techniques, simply by minimizing the composite loss function

(2)

where the required gradients can be readily obtained using automatic differentiation Baydin et al. (2015). Finally, as the resulting predictions are encouraged to inherit any physical properties imposed by the PDE constraint (e.g., conservation, invariance, symmetries, etc.), this approach showcases how one can approximately encode physical and domain-specific constraints in modern machine learning algorithms and introduce a new form of regularization for learning from small data-sets.

2.2 Probabilistic physics-informed neural networks

Here we put forth a probabilistic formulation for propagating uncertainty through physics-informed neural networks using latent variable models of the form

(3)

This setting encapsulates a wide range of deterministic and stochastic problems, where is a potentially multi-variate field, and is a collection of random latent variables. The ability to learn such a model from data is the cornerstone of probabilistic scientific computing and uncertainty quantification in physical systems. Knowledge of the conditional probability subject to domain knowledge constraints introduces a regularization mechanism that limits the space of admissible solutions to a manageable size (e.g., in fluid mechanics problems by discarding any non-realistic flow solutions that violate the conservation of mass principle), thus enables training of probabilistic deep learning algorithms in small data

regimes. Moreover, by providing a complete characterization of uncertainty, it enhances the robustness of our predictions, and provides a-posteriori error estimates for assessing model inadequacy. The latter, can also enable downstream tasks such as the formulation of adaptive data acquisition policies for active learning or Bayesian optimization

Shahriari et al. (2016) with domain knowledge constraints. Finally, thanks to the structure encoded by the PDE itself, the resulting latent variables can potentially lead to the extraction of physically relevant and interpretable low-dimensional feature representations, which can subsequently introduce new techniques for nonlinear model order reduction and coarse-graining of complex systems.

2.3 Adversarial inference for joint distribution matching

Following the recent findings of Li et al. (2018)

we argue that matching the joint distribution of the generated data

with the joint distribution of the observed data

by minimizing the reverse Kullback-Leibler divergence

is a promising approach to train the generative model presented in equation 3. This also implies that the respective marginal and conditional distributions are also encouraged to match. The use of the reverse Kullback-Leibler divergence (in contrast to the maximum likelihood setup) is motivated by examining the following decomposition

(4)

where denotes the entropy of the generative model. The second term can be further decomposed as

(5)

where and denote the support of the distributions and , respectively, while denotes the complement of . Notice that by minimizing the Kullback-Leibler divergence in equation 4 we introduce a mechanism that is trying to balance the effect of two competing objectives. Specifically, maximization of the entropy term encourages to spread over its support set as wide, while the second integral term in equation 5 introduces a strong (negative) penalty when the support of and do not overlap. Hence, the support of is encouraged to spread only up to the point that , implying that . When the pathological issue of “mode-collapse” (commonly encountered in the training of generative adversarial networks Goodfellow et al. (2014)) is manifested Salimans et al. (2016). This issue is present if one seeks to directly minimize the reverse Kullback-Leibler objective in equation 4 as this provides no control on the relative importance of the two terms. As discussed in Li et al. (2018), we may rather minimize , with to allow for control of how much emphasis is placed on mitigating mode collapse. It is then clear that the entropic regularization introduced by provides an effective mechanism for controlling and mitigating the effect of mode collapse, and, therefore, potentially enhancing the robustness adversarial inference procedures for learning .

Minimization of equation 4 with respect to the generative model parameters presents two fundamental difficulties. First, the evaluation of both distributions and typically involves intractable integrals in high dimensions, and we may only have samples drawn from the two distributions, not their explicit analytical forms. Second, the differential entropy term is intractable as is not known a-priori. In the next sections we revisit the unsupervised formulation put forth in Li et al. (2018) and derive a tractable inference procedure for learning from scattered observation pairs of , namely , .

2.3.1 Density ratio estimation by probabilistic classification

By definition, the computation of the Kullback-Leibler divergence in equation 4 involves computing an expectation over a log-density ratio, i.e.

In general, given samples from two distributions, we can approximate their density ratio by constructing a binary classifier that distinguishes between samples from the two distributions. To this end, we assume that

data points are drawn from and are assigned a label . Similarly, we assume that samples are drawn from and assigned label . Consequently, we can write these probabilities in a conditional form, namely

where and are the class probabilities predicted by a binary classifier . Using Bayes rule, it is then straightforward to show that the density ratio of and can be computed as

(6)

This simple procedure suggests that we can harness the power of deep neural network classifiers to obtain accurate estimates of the reverse Kullback-Leibler divergence in equation 4 directly from data and without the need to assume any specific parametrization for the generative model distribution .

2.3.2 Entropic regularization bound

Here we follow the derivation of Li et. al Li et al. (2018) to construct a computable lower bound for the entropy

. To this end, we start by considering random variables

under the joint distribution

where , and is the Dirac delta function. The mutual information between and satisfies the information theoretic identity

where , are the marginal entropies and , are the conditional entropies Akaike (1998). Since in our setup and are deterministic variables independent of , and samples of are generated by a deterministic function , it follows that . We therefore have

(7)

where does not depend on the generative model parameters .

Now consider a general variational distribution parametrized by a set of parameters . Then,

(8)

Viewing as a set of latent variables, then is a variational approximation to the true intractable posterior over the latent variables . Therefore, if is introduced as an auxiliary inference model associated with the generative model , for which and , then we can use equations 7 and 8 to bound the entropy term in equation 4 as

(9)

Note that the inference model plays the role of a variational approximation to the true posterior over the latent variables, and appears naturally using information theoretic arguments in the derivation of the lower bound.

2.3.3 Adversarial training objective

By leveraging the density ratio estimation procedure described in section 2.3.1 and the entropy bound derived in section 2.3.2, we can derive the following loss functions for minimizing the reverse Kullback-Leibler divergence with entropy regularization

(10)
(11)

where

is the logistic sigmoid function. Notice how the binary cross-entropy objective of equation

10 aims to progressively improve the ability of the classifier to discriminate between “fake” samples obtained from the generative model and “true” samples originating from the observed data distribution . Simultaneously, the objective of equation 11 aims at improving the ability of the generator to generate increasingly more realistic samples that can “fool” the discriminator . Moreover, the encoder not only serves as an entropic regularization term than allows us to stabilize model training and mitigate the pathology of mode collapse, but also provides a variational approximation to true posterior over the latent variables. The way it naturally appears in the objective of equation 11 also encourages the cycle-consistency of the latent variables ; a process that is known to result in disentangled and interpretable low-dimensional representations of the observed data Friedman et al. (2001), which could be subsequently used as good features for nonlinear model order reduction.

In theory, the optimal set of parameters correspond to the Nash equilibrium of the two player game defined by the loss functions in equations 10,11, for which one can show that the exact model distribution and the exact posterior over the latent variables can be recovered Goodfellow et al. (2014); Pu et al. (2017). In practice, although there is no guarantee that this optimal solution can be attained, the generative model can be trained by alternating between optimizing the two objectives in equations 10,11 using stochastic gradient descent as

(12)
(13)

2.3.4 Adversarial training with physics-informed constrains

In order to learn the physics-informed probabilistic model of equation 3 from data we can extend the adversarial inference framework presented above by appropriately penalizing the loss function of the generator (see equation 11). The available data correspond to scattered observation pairs , , originating from known initial or boundary conditions, or any other (potentially noisy) measurements of . In analogy to the deterministic setting put for in Raissi et al. (2017a) and summarized in section 2.1, by defining we essentially introduce a new conditional probability model that shares the same parameters as , albeit the underlying neural network that serves as its approximation has different activation functions. However, since we would like to encourage every sample produced by the generator to satisfy the PDE constraint, we can simply treat the residual as a deterministic variable, i.e, , and enforce the constraint at a finite set of collocation points by simply minimizing the mean square loss

(14)

Then, the resulting adversarial game for training the physics-informed model of equation 3 takes the form

(15)

where positive values of can be selected to place more emphasis on penalizing the PDE residual. For , the residual loss acts as a regularization term that approximately enforces the given physical constraint, and, therefore, encourages the generator to produce samples that satisfy the underlying partial differential equation. Also note that this structured approach also encourages the encoder to learn a set of spatio-temporal latent variables that are relevant to the underlying physics, possibly opening a new directions for probabilistic model order reduction of complex systems.

2.3.5 Predictive distribution

Once the model is trained we can construct a probabilistic ensemble for the solution by sampling latent variables from the prior and passing them through the generator to yield samples that are distributed according to the predictive model distribution

. Note that although the explicit form of this distribution is not known, we can efficiently compute any of its moments via Monte Carlo sampling. The cost of this prediction step is negligible compared to the cost of training the model, as it only involves a single forward pass through the generator function

. Typically, we compute the mean and variance of the predictive distribution at a new test point

as

(16)
(17)

where , , and corresponds to the total number of Monte Carlo samples.

2.3.6 Advantages and caveats of adversarial learning

Since their recent introduction Goodfellow et al. (2014); Makhzani et al. (2015); Dumoulin et al. (2016); Mescheder et al. (2017), adversarial learning techniques have provided great flexibility for performing probabilistic computations with arbitrarily complex implicit distributions. Essentially, they have lifted the over-simplified approximations typically used in variational inference (Gaussian approximations, exponential families, etc.) Blei et al. (2017), yielding very general and flexible schemes for statistical inference. However, this flexibility comes at a price, as such methods in practice require very careful tuning in order to achieve stable and accurate performance. To this end, recall the training objective defined in equation 15 that introduces an adversarial game between the generator and discriminator networks Goodfellow et al. (2014). In practice, this mini-max optimization problem is solved by alternating stochastic gradient updates between the two competing objectives, and it is highly sensitive on the capacity of the neural networks modeling the generator and discriminator, as well as the relative frequency with which the parameters of each network are updated within each iteration of stochastic gradient descent. To this end, we provide a series of empirical observations and lessons we learned throughout this study that can enhance the robustness and stability of this training procedure:

  • Changing the relative number of stochastic gradient updates for the generator and the discriminator is equivalent to changing their neural network architecture. For example, we can reduce the capacity of discriminator by either performing more stochastic gradient updates for the generator, or remove one layer in the neural network architecture of the discriminator.

  • Given enough collocation points for penalizing the PDE residual, we can obtain robust uncertainty estimates together with precise predictions simply by tuning the capacity of discriminator and generator networks.

  • Typically, by fixing the generator, we expect the discriminator to have some capacity so that the model training dynamics remain stable. But, we do not want the discriminator to be very powerful as in that case there will by very little information from the discriminator that can help the generator to improve towards producing more realistic samples (this a common characteristic of adversarial inference procedures Goodfellow et al. (2014)).

  • For cases with a small number of training data we should reduce the capacity of the discriminator. This can be achieved by either changing the relative frequency of stochastic gradient updates for the generator and discriminator, or by reducing the capacity of the discriminator neural network architecture.

3 Results

In all examples we have trained the models for 30,000 stochastic gradient descent steps using the Adam optimizer Kingma and Ba (2014) with a learning rate of , while fixing a one-to-five ratio for the discriminator versus generator updates. Moreover, we have fixed the entropic regularization and the residual penalty parameters to and

, respectively. The proposed algorithms were implemented in Tensorflow v1.10

Abadi et al. (2016), and computations were performed in single precision arithmetic on a single NVIDIA Tesla P100 GPU card. All data and code accompanying this manuscript will be made available at https://github.com/PredictiveIntelligenceLab/UQPINNs.

3.1 A pedagogical example

Let us illustrate the basic capabilities of the proposed methods through a simple example corresponding to the following nonlinear second-order ordinary differential equation

(18)

subject to random boundary conditions . For this simple example, the deterministic solution corresponding to can be readily obtained as . Given observations of corresponding to different realizations of the random boundary conditions our goal is to obtain a probabilistic representation of the solution by training a physics-informed generative model of the form , that is constrained by equation 18. To this end, we introduce three deterministic mappings parametrized by deep neural networks, namely , , and corresponding to the generator, encoder, and discriminator functions introduced in section 2.3. By construction, we also obtain a physics-informed neural network corresponding to the deterministic residual of equation 18 that will be used to approximately enforce the differential equation constraint at a set of randomly distributed collocation points

. All neural networks were chosen to have two hidden layers with 50 neurons in each layer, and a hyperbolic tangent activation function. Moreover, the dimensionality of the latent variables was set to one, i.e.

, and we have assumed an isotropic standard normal prior, namely . As the training data for reflects the uncertainty in the boundary conditions, the role of the latent variables is to enable the propagation of this uncertainty into the predicted solution obtained through the generative model .

Figure 1: A pedagogical example:

(a) Mean and two standard deviations of

against the exact solution for deterministic boundary data. (b) Mean and two standard deviations of against the reference Monte Carlo solution for random boundary data corresponding to Gaussian uncorrelated noise.
Figure 2: A pedagogical example: Predicted marginal densities against the reference Monte Carlo solution. (a) . (b) .

Here we have considered two cases corresponding to deterministic and random boundary conditions, namely (i) (i.e., noise-free data), and (ii) (i.e., Gaussian uncorrelated noise). In all cases, the training data consists of realizations for each boundary point, , and a total of collocation points for enforcing the residual of equation 18. Our probabilistic predictions for this example are summarized in figures 1 and 2. Specifically, figure 1(a) shows the generative model predictive mean and two standard deviations, plotted against the exact solution of this problem. Note that this case corresponds to deterministic training data for the boundary conditions, hence the exact solution is deterministic, and the prediction error here is measured as in the relative norm

(19)

where denotes the total number of equidistant test points in the interval . Moreover, the variance shown in the inset of figure 1(a) serves as an a-posteriori error estimate that quantifies the uncertainty associated with the generative model predictions. Figure 1(b) shows the resulting prediction and uncertainty estimates corresponding to random boundary conditions, compared against a reference mean solution obtained numerically using a spectral method with 2,000 Monte Carlo samples. In this case, the predictive uncertainty of the generative model reflects the aggregate total uncertainty due to both randomness in the boundary conditions and the inherent epistemic uncertainty in the neural network approximation. As the generative model can return a complete statistical characterization of the solution by means of its conditional probability density , in figure 2 we provide a visual comparison of the one-dimensional marginals between our predictions and the reference Monte Carlo solution corresponding to the spatial locations and .

Albeit simple, this example aims to demonstrate the basic capabilities of the proposed methodology in propagating uncertainty through non-linear partial differential equations. In contrast to previous approaches to inferring solutions of partial differential equations from data Cockayne et al. (2016); Raissi et al. (2017, 2018); Cockayne et al. (2017), the proposed methodology does not rely on Gaussian assumptions, and it can directly tackle nonlinear problems without any need for linearization.

3.2 Burgers equation

In this example we aim to provide a comprehensive systematic study to quantify the robustness of the proposed methods with respect to different parameter choices. We will do so through the lens of a more challenging canonical problem involving the non-linear time-dependent Burgers equation in one spatial dimension:

(20)

where the viscosity parameter is chosen as in order to generate a strongly nonlinear response that leads to the development of shock discontinuities in finite time. This is one of the few nonlinear partial differential equations that admits an exact solution through the Cole-Hopf transformation Hopf (1950); a solution that will be subsequently used to test the validity of our predictions.

Here we represent the unknown solution using a physics-informed generative model of the form , and we will introduce parametric functions corresponding to a generator , an encoder , and a discriminator

all constructed using deep feed-forward neural networks. The baseline architectures for the generator and the encoder have 4 hidden layers with 50 neurons per layers, while the discriminator network has 3 hidden layers and 50 neurons per layer. The activation function in all cases is chosen to be a hyperbolic tangent non-linearity. The prior over the latent variables

is chosen again to be a one-dimensional isotropic Gaussian distribution, i.e.

, .

First we consider a baseline scenario, in which we train our probabilistic model using a data-set comprising of noisy-free input/output pairs for – 50 points for the initial condition (see figure 3(a)) and 50 points for each of the domain boundaries – plus an additional collocation points for enforcing the residual of the Burgers equation using the loss of equation 14. All data points were randomly selected within the bounds given in equation 20. The result of this experiment is summarized in 4 where we report the predicted mean solution, as well as the uncertainty associated with this prediction as quantified by two standard deviations of the generative model . As the training data for this case is noise-free, the solution to this problem is deterministic, and the resulting uncertainty captured in can be viewed as an a-posteriori error estimate of the neural network approximation error due to the finite number of training data, which is measured as in the relative norm. As discussed in Raissi et al. (2017), a higher approximation accuracy can be achieved by training the generative model using a quasi-Newton optimizer (e.g. L-BFGS Liu and Nocedal (1989)), however here we chose to use stochastic gradient descent using Adam updates Kingma and Ba (2014) in order to highlight the ability of the proposed method to return uncertainty estimates when the model predictions are not perfectly accurate.

Figure 3: Burgers equation: (a) Exact initial condition and noise-free training data (50 points). (b) Training data corresponding to a single realization of the non-additive noise corruption process (100 points, generated by equation 21).
Figure 4: Burgers equation with noise-free data: Top: Mean of , along with the location of the noisy training data , . Middle: Prediction and predictive uncertainty at , and . Bottom: Variance of .

Second, we repeat the same test for a more complicated scenario in which the initial condition has been corrupted by non-additive, non-Gaussian noise as shown in figure 3(b), where the noise variance is larger around , therefore amplifying the effect of uncertainty on the shock formation. Here the neural network architecture as well as the number and location of training points have been kept fixed as described above, but the initial condition is now corrupted as

(21)

The results of this experiment are summarized in figure 5. We observe that the resulting generative model can effectively capture the uncertainty in the resulting spatio-temporal solution due to the propagation of the input noise process through the complex non-linear dynamics of the Burgers equation. As expected, the uncertainty concentrates around the shock. Although we only plot the first two moments of the solution, we must emphasize that the generative model provides a complete probabilistic characterization of its non-Gaussian statistics.

Figure 5: Burgers equation with noisy data: Top: Mean of , along with the location of the training data , . Middle: Prediction and predictive uncertainty at , and . Bottom: Variance of .

In order to further investigate the performance of the proposed methodology for different parameter settings, we have performed a series of comprehensive systematic studies that aim to quantify the sensitivity of the resulting predictions on: (i) the neural network initialization, (ii) the total number of training and collocation points, (iii) the neural network architecture, and (iv) the adversarial training procedure. The results of these systematic studies are provided in A.

3.3 Discovery of constitutive laws for flow through porous media

In our final example, we aim to demonstrate the ability of the proposed methods to discover unknown constitutive relationships directly from data with quantified uncertainty. To this end, we revisit the Darcy flow example put forth in Tartakovsky et al. (2018) corresponding to a two-dimensional nonlinear diffusion equation with an unknown state-dependent diffusion coefficient

(22)

where m/s and m are known boundary conditions. In order to benchmark and validate our model predictions we consider a realistic data-set generated using the Subsurface Transport Over Multiple Phases (STOMP) code White et al. (1995) with the van Genuchten model Van Genuchten (1980) for which reads as

(23)

with the following parameter values: m/s, , , , m and m.

Our goal is twofold: we aim to construct a physics-informed probabilistic model for , and simultaneously learn the unknown state-dependent diffusion coefficient directly from data on (i.e., we assume no measurements of ). To this end, in addition to the three deterministic mappings , , and corresponding to the generator, encoder, and discriminator described in section 2.3, here we also introduce another neural network for approximating . The parameters of are essentially inherited by the physics-informed residual network that aims to enforce the residual of equation 22 at the collocation points for any set of latent variables . All neural networks are chosen to have 2 hidden layers with 50 neurons per each, and a hyperbolic tangent activation function, while the probabilistic model for assumes a two dimensional latent space with an isotropic Gaussian prior, i.e., .

By construction, our probabilistic model for can return predictions of the unknown solution with quantified uncertainty. We can then use this model to propagate uncertainty in our predictions of via Monte Carlo sampling. Specifically, once the model is trained end-to-end, we can easily generate samples of from and propagate them through to obtain a samples for . Essentially this results in an implicit generative model which can fully characterize uncertainty in our predictions of the unknown state-dependent diffusion coefficient.

Here we also have considered two cases corresponding to noise-free training data and noisy data corrupted by Gaussian uncorrelated noise. In the noise-free case we used scattered measurements of the unknown solution – 200 inside the domain and 100 on each one of the four boundaries – and total number of randomly selected collocation points inside the domain for penalizing the residual of equation 22. For the noisy case we chose scattered measurements of the unknown solution – 1,000 inside the domain and 100 on each one of the four boundaries – while still keeping collocation points. Figure 6 summarizes the results for both cases by showing the predictive mean and two standard deviations of the corresponding generative model , against the reference (deterministic) solution obtained from the Subsurface Transport Over Multiple Phases (STOMP) code White et al. (1995) with the van Genuchten model Van Genuchten (1980) (see equation 23). Evidently, the generative model is able to recover a sensible prediction for the unknown state-dependent diffusion coefficient with quantitative uncertainty, even when the training data on is heavily corrupted by noise. Moreover, notice that implicitly depends on the spatial coordinates . In figure 7 we present the resulting prediction for corresponding to the noise-free case, against the reference solution, as well as their point-wise absolute error.

Again we must emphasize that these predictions are obtained without ever observing any data on , while they are accompanied by quantitative estimates that jointly characterize the uncertainty due to noise in the training data for , and the underlying approximation error of the neural networks. This theme of consistently inferring correlated continuous quantities of interest from a small set of measurements by leveraging the underlying laws of physics is a great example of the exciting capabilities that physics informed machine learning has to offer.

Figure 6: Prediction with quantified uncertainty of unknown state-dependent diffusion coefficient compared against the reference solution obtained from the Subsurface Transport Over Multiple Phases (STOMP) code White et al. (1995) with the van Genuchten model Van Genuchten (1980) (see equation 23). (a) Noise-free training data for . (b) Noisy training data for with noise level of .
Figure 7: Prediction of unknown state-dependent diffusion coefficient . (a) Reference solution obtained from the Subsurface Transport Over Multiple Phases (STOMP) code White et al. (1995) with the van Genuchten model Van Genuchten (1980) (see equation 23). (b) Predictive mean of the generative model trained on noise-free data for . (c) Absolute point-wise prediction error.

4 Conclusions

We presented a class of probabilistic physics-informed neural networks that are capable of approximating arbitrary conditional probability densities, while being constrained to generate samples that approximately satisfy given partial differential equations. Moreover, we have derived a flexible regularized adversarial inference framework that enables the end-to-end training of such models directly from noisy and incomplete measurements. Uncertainty in the system inputs and/or outputs is captured through a set of latent variables that are relevant to the underlying physics, and could possibly open new directions for probabilistic model-order reduction of complex systems. These developments allow us to perform probabilistic computations for uncertain systems, train deep generative models in small data regimes, handle complex noise processes, and seamlessly carry out uncertainty propagation studies for physical systems without the need for repeated evaluation of experiments and numerical simulations.

Although the proposed adversarial inference framework provides great flexibility for performing probabilistic computations and approximating arbitrarily complex and high-dimensional probability distributions, it relies on carefully tuning the interplay between the generator and discriminator networks. This is a known limitation of adversarial algorithms, and, although several works have led to improvements Tolstikhin et al. (2017); Arjovsky et al. (2017), it still largely remains an open research problem. An alternative path for enhancing the robustness of the inference procedure, while not compromising its ability to handle complex probability distributions, comes through the use of invertible transformations and flow-based generative models Rezende and Mohamed (2015); Kingma and Dhariwal (2018). Future work will examine the applications of such models in the context of physics-informed neural networks with the goal of robustifying the proposed methods and scaling them to more realistic systems.

Acknowledgements

This work received support from the US Department of Energy under the Advanced Scientific Computing Research program (grant DE-SC0019116) and the Defense Advanced Research Projects Agency under the Physics of Artificial Intelligence program. We would also like to thank Dr. Alexandre Tartakovsky from the Pacific Nortwest National Laboratory for providing the Darcy flow data-set.

Appendix A Sensitivity studies

Here we provide results on a series of comprehensive systematic studies that aim to quantify the sensitivity of the resulting predictions on: (i) the neural network initialization, (ii) the total number of training and collocation points, (iii) the neural network architecture, and (iv) the adversarial training procedure. In all cases we have used the non-linear Burgers defined in section 3.2 as a prototype problem.

a.1 Sensitivity with respect to the neural network initialization

In order to quantify the sensitivity of the proposed methods with respect to the initialization of the neural networks, we have considered a noise-free data set comprising of and training and collocation points, respectively, and fixed the architecture for generator neural networks to include 4 hidden layers with 50 neurons each and discriminator neural networks to include 3 hidden layers with 50 neurons each , and a hyperbolic tangent activation function. Then we have trained an ensemble of 15 cases all starting from a normal Xavier initialization (Glorot and Bengio, 2010) for all network weights (with a randomized seed), and a zero initialization for all bias parameters. In table 1 we report the relative error between the predicted mean solution and the known exact solution for this problem for all 15 randomized trials using at set of randomly selected test points. Evidently, our results are robust with respect to the the neural network initialization as in all cases the stochastic gradient descent training procedure converged roughly to the same solution. We can summarize this result by reporting the mean and the standard deviation of the relative error as

Relative error
4.1e-02 7.9e-02 4.4e-02 4.0e-02 3.8e-02
3.2e-02 5.7e-02 4.7e-02 6.5e-02 4.0e-02
3.5e-02 3.5e-02 6.4e-02 4.0e-02 4.9e-02
Table 1: Relative prediction error for different neural network initializations using a randomized seed.

a.2 Sensitivity with respect to the total number of training and collocation points

In this study our goal is to quantify the sensitivity of our predictions with respect to the total number of training and collocation points and , respectively. As before, we have considered noise-free data sets, and fixed the architecture for generator neural networks to include 4 hidden layers with 50 neurons each and discriminator neural networks to include 3 hidden layers with 50 neurons each, a hyperbolic tangent activation function, and a normal Xavier initialization (Glorot and Bengio, 2010) for all network weights and zero initialization for all network biases. The results of this study are summarized in table 2, indicating that as the number of collocation points are increased, a more accurate prediction is obtained. This observation is in agreement with the original results of Raissi et. al. (Raissi et al., 2017a, b) for deterministic physics-informed neural networks, indicating the role of the residual loss as an effective regularization mechanism for training deep generative models in small data regimes.

10 100 250 500 1000 5000 10000
60 9.3e-01 5.6e-01 4.8e-01 5.0e-02 1.9e-01 5.0e-02 5.1e-02
90 5.8e-01 5.3e-01 3.5e-01 1.5e-01 4.9e-02 1.0e-01 5.8e-02
150 6.7e-01 1.4e-01 3.0e-01 3.6e-02 4.9e-02 1.2e-01 4.7e-02
Table 2: Relative prediction error for different number of training and collocation points and , respectively.

a.3 Sensitivity with respect to the neural network architecture

In this study we aim to quantify the sensitivity of our predictions with respect to the architecture of the neural networks that parametrize the generator, the discriminator, and the encoder. Here we have fixed the number of noise-free training data to and , and we kept the number of layers for discriminator to always be one less than the number of layers for generator (e.g., if the number of layers for generator is two then the number of layers for discriminator is one, etc.). In all cases, we have used a hyperbolic tangent non-linearity and a normal Xavier initialization (Glorot and Bengio, 2010). In table 3 we report the relative prediction error for different feed-forward architectures for the generator, discriminator, and encoder (i.e., different number of layers and number of nodes in each layer). The general trend suggests that as the neural network capacity is increased we obtain more accurate predictions, indicating that our physics-informed constraint on the PDE residual can effectively regularize the training process and safe-guard against over-fitting. We note number of neurons in each layer as and number of layers for generator (encoder) as .

20 50 100
2 4.2e-01 3.8e-01 5.7e-01
3 6.5e-02 3.5e-02 2.1e-02
4 9.3e-02 4.7e-02 5.4e-02
Table 3: Relative prediction error for different feed-forward architectures for the generator, encoder, and the discriminator. The total number of layers of the latter was always chosen to be one less than the number of layers for generator.

a.4 Sensitivity with respect to the adversarial training procedure

Finally, we test the sensitivity with respect to the adversarial training process. To this end, we have fixed the number of noise-free training data to and , and the neural network architecture to be the same as A.2, and we vary the total number of training steps for the generator and the discriminator within each stochastic gradient descent iteration. The results of this study are presented in table 4 where we report the relative prediction error. These results reveal the high sensitivity of the training dynamics on the interplay between the generator and discriminator networks, and pinpoint on the well known peculiarity of adversarial inference procedures which require a careful tuning of and for achieving stable performance in practice.

1 2 5
1 3.5e-01 5.0e-01 1.5e+00
2 4.3e-02 3.2e-01 5.4e-01
5 4.7e-02 2.3e-01 7.0e-01
Table 4: Relative

error with different number of training for generator and discriminator in each epoch.

References