Physics-informed deep generative models

12/09/2018 ∙ by Yibo Yang, et al. ∙ University of Pennsylvania 0

We consider the application of deep generative models in propagating uncertainty through complex physical systems. Specifically, we put forth an implicit variational inference formulation that constrains the generative model output to satisfy given physical laws expressed by partial differential equations. Such physics-informed constraints provide a regularization mechanism for effectively training deep probabilistic models for modeling physical systems in which the cost of data acquisition is high and training data-sets are typically small. This provides a scalable framework for characterizing uncertainty in the outputs of physical systems due to randomness in their inputs or noise in their observations. We demonstrate the effectiveness of our approach through a canonical example in transport dynamics.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Despite their immense recent success, many state-of-the art machine learning techniques (e.g., deep neural nets, convolutional networks, recurrent networks

[1, 2, 3, 4], etc.) are lacking robustness and often fail to provide any guarantees of convergence or quantify the error/uncertainty associated with their predictions. Hence, the ability to quantify predictive uncertainty and learn in a sample-efficient manner is a necessity [5, 6, 7, 8], especially in data-limited domains [9]. Even less well understood is how one can constrain such algorithms to leverage domain-specific knowledge and return predictions that satisfy certain physical principles [10] (e.g., conservation of mass, momentum, etc.). In recent work, Raissi et. al. [11, 12, 13, 14]

explored this new interface between classical scientific computing and machine learning by revisiting the idea of penalizing the loss function of deep neural networks using differential equation constraints, as first put forth by Psichogios and Ungar

[15] and Lagaris et. al. [16]. In this work, we revisit these approaches from a probabilistic standpoint [17, 18, 19] and develop an adversarial inference framework [20, 21, 22, 23] to enable the posterior characterization [24, 25]

of the uncertainty associated with the model predictions. These uncertainty estimates reflect how observation noise and/or randomness in the system’s inputs and outputs are propagated through complex infinite-dimensional dynamical systems described by non-linear partial differential equations (PDEs).

2 Methods

In Raissi et. al. [11, 12, 13, 14], the authors have considered constructing deep neural networks that return predictions which are constrained by PDEs of the form , where is represented by a deep neural network parametrized by a set of parameters , i.e. ,

is a vector of space coordinates,

is the time coordinate, and is a nonlinear differential operator. As neural networks are differentiable representations, this construction defines a so-called physics informed neural network that corresponds to the PDE residual, i.e. . By construction, this network shares the same architecture and parameters with

, but it has different activation functions corresponding to the action of the differential operator. The resulting training procedure allows us to recover the shared network parameters

using a few scattered observations of , namely , , along with a larger number of collocation points , , that aim to penalize the PDE residual at a finite set of

collocation nodes. This way, the resulting optimization problem can be effectively solved using standard stochastic gradient descent without necessitating any elaborate constrained optimization techniques, simply by minimizing the composite loss function

(1)

where the required gradients can be readily obtained using automatic differentiation [26].

In the proposed work we aim to model uncertainty in such physics-informed models by constructing conditional latent variable models of the form

(2)

and encourage the resulting samples to be constrained by a given physical law according to a likelihood

(3)

This setting encapsulates a wide range of deterministic and stochastic problems, where is a potentially multi-variate field, and is a collection of latent variables.

Following the recent findings of [27]

we will train the generative model by matching the joint distribution of the generated samples

with the joint distribution of the observed data by minimizing the reverse Kullback-Leibler (KL) divergence, which can be decomposed as

(4)

where denotes the entropy of the generative model. This decomposition reveals the interplay between two competing mechanisms that define the model training dynamics. On one hand, minimizing the negative entropy term is encouraging the support of to spread to infinity, while the second term will penalize regions for which the support of and do not overlap. As observed in [27], this introduces a regularization mechanism for mitigating the pathology of mode collapse.

As the entropy term is intractable, we can obtain a computable training objective by deriving the following lower bound (see proof in Appendix A)

(5)

where the inference model plays the role of a variational approximation to the true posterior over the latent variables, and appears naturally using information theoretic arguments in the derivation of the lower bound [27].

This construction leads to two coupled objectives that define an adversarial game for training the model parameters

(6)
(7)

where is a parametrized discriminator used to approximate the KL divergence directly from samples using the density ratio trick [28], and

is the logistic sigmoid function. Moreover, notice how the inference model

promotes cycle consistency in the latent variables, and serves as an entropic regularization term than allows us to stabilize model training and mitigate the pathology of mode collapse, as controlled by the user defined parameter [27]. The final training objective that encourages the generated samples to satisfy a given PDE reads as

(8)

where positive values of can be selected to place more emphasis on penalizing the PDE residual. For , the residual loss acts as a regularization term that encourages the generator to produce samples that satisfy the underlying partial differential equation.

3 Results

Here we demonstrate the performance of the proposed methodology through the lens of a canonical problem in transport dynamics modeled by the Burgers equation with appropriate initial and boundary conditions. This equation arises in various areas of applied mathematics, including fluid mechanics, nonlinear acoustics, gas dynamics, and traffic flow [29]. In one space dimension the equation reads as

(9)

where is a viscosity parameter, small values of which can lead to solutions developing shock formations that are notoriously hard to resolve by classical numerical methods [29].

Here, we construct a probabilistic representation for the unknown solution using a physics-informed deep generative model , and we introduce parametric mappings corresponding to a generator , an encoder , and a discriminator

all constructed using deep feed-forward neural networks with 4 hidden layers with 50 neurons each. The activation function in all cases is chosen to be a hyperbolic tangent non-linearity. The prior over the latent variables is chosen to be a one-dimensional isotropic Gaussian distribution, i.e.

. We train our probabilistic model using stochastic gradient Adam updates [30] using a learning rate of on a data-set comprising of input/output pairs for – 100 points for the initial condition and 50 points for each of the domain boundaries – plus an additional collocation points for enforcing the residual of the Burgers equation using the loss of equation 8 with and . A systematic study with respect to the model hyper-parameters is provided in Appendix B. Notice that the initial condition here is corrupted by a non-additive noise process as

and our goal here is to propagate the effect of this uncertainty into the prediction of future system states. Notice how the noise variance is chosen to be larger around

, therefore amplifying the effect of uncertainty on the shock formation.

The results of this experiment are summarized in figure 1, where we report the predicted mean solution, as well as the uncertainty associated with this prediction. We observe that the resulting generative model can effectively capture the uncertainty in the resulting spatio-temporal solution due to the propagation of the input noise process through the complex non-linear dynamics of the Burgers equation. As expected, the uncertainty concentrates around the shock discontinuity that the solution develops around

. Although we only plot the first two moments of the solution, we must emphasize that the generative model

provides a complete probabilistic characterization of its non-Gaussian statistics.

Figure 1: Top: Mean and variance of , along with the location of the noisy training data . Bottom: Noisy data for the initial condition, and the resulting prediction and predictive uncertainty at , and .

References

Appendix A Proof of the Entropy Lower Bound

Here we follow the derivation of Li [27] to construct a computable lower bound for the entropy

. To this end, we start by considering random variables

under the joint distribution

where , and is the Dirac delta function. The mutual information between and satisfies the information theoretic identity

where , are the marginal entropies and , are the conditional entropies [31]. Since in our setup and are deterministic variables independent of , and samples of are generated by a deterministic function , it follows that . We therefore have

(10)

where is a constant with respect to the model parameters .

Now consider a general variational distribution parametrized by a set of parameters . Then,

(11)

Viewing as a set of latent variables, then is a variational approximation to the true intractable posterior over the latent variables . Therefore, if is introduced as an auxiliary encoder associated with the generative model , for which and , then we can use equations 10 and 11 to bound the entropy term in equation 4 as

(12)

Appendix B Systematic Studies

Here we provide results on a series of comprehensive systematic studies that aim to quantify the sensitivity of the resulting predictions on: (i) the neural network initialization, (ii) the total number of training and collocation points, (iii) the neural network architecture, and (iv) the adversarial training procedure. In all cases we have used the non-linear Burgers equation as a prototype problem.

b.1 Sensitivity with respect to the neural network initialization

In order to quantify the sensitivity of the proposed methods with respect to the initialization of the neural networks, we have considered a noise-free data set comprising of and training and collocation points, respectively, and fixed the architecture for generator neural networks to include 4 hidden layers with 50 neurons each and discriminator neural networks to include 3 hidden layers with 50 neurons each , and a hyperbolic tangent activation function. Then we have trained an ensemble of 15 cases all starting from a normal Xavier initialization [32] for all network weights (with a randomized seed), and a zero initialization for all bias parameters. In table 1 we report the relative error between the predicted mean solution and the known exact solution for this problem for all 15 randomized trials using at set of

randomly selected test points. Evidently, our results are robust with respect to the the neural network initialization as in all cases the stochastic gradient descent training procedure converged roughly to the same solution. We can summarize this result by reporting the mean and the standard deviation of the relative

error as

Relative error
4.1e-02 7.9e-02 4.4e-02 4.0e-02 3.8e-02
3.2e-02 5.7e-02 4.7e-02 6.5e-02 4.0e-02
3.5e-02 3.5e-02 6.4e-02 4.0e-02 4.9e-02
Table 1: Relative error with different seed of initialization in non-noise case.

b.2 Sensitivity with respect to the total number of training and collocation points

In this study our goal is to quantify the sensitivity of our predictions with respect to the total number of training and collocation points and , respectively. As before, we have considered noise-free data sets, and fixed the architecture for generator neural networks to include 4 hidden layers with 50 neurons each and discriminator neural networks to include 3 hidden layers with 50 neurons each, a hyperbolic tangent activation function, and a normal Xavier initialization [32] for all network weights and zero initialization for all network biases. The results of this study are summarized in table 2, indicating that as the number of collocation points are increased, a more accurate prediction is obtained. This observation is in agreement with the original results of Raissi et. al. [11] for deterministic physics-informed neural networks, indicating the role of the PDE residual loss as an effective regularization mechanism for training deep generative models in small data regimes.

10 100 250 500 1000 5000 10000
60 9.3e-01 5.6e-01 4.8e-01 5.0e-02 1.9e-01 5.0e-02 5.1e-02
90 5.8e-01 5.3e-01 3.5e-01 1.5e-01 4.9e-02 1.0e-01 5.8e-02
150 6.7e-01 1.4e-01 3.0e-01 3.6e-02 4.9e-02 1.2e-01 4.7e-02
Table 2: Relative prediction error for different number of training and collocation points and , respectively.

b.3 Sensitivity with respect to the neural network architecture

In this study we aim to quantify the sensitivity of our predictions with respect to the architecture of the neural networks that parametrize the generator, the discriminator, and the encoder. Here we have fixed the number of noise-free training data to and , and we kept the number of layers for discriminator to always be one less than the number of layers for generator (e.g., if the number of layers for generator is two then the number of layers for discriminator is one, etc.). In all cases, we have used a hyperbolic tangent non-linearity and a normal Xavier initialization [32]. In table 3 we report the relative prediction error for different feed-forward architectures for the generator, discriminator, and encoder (i.e., different number of layers and number of nodes in each layer). The general trend suggests that as the neural network capacity is increased we obtain more accurate predictions, indicating that our physics-informed constraint on the PDE residual can effectively regularize the training process and safe-guard against over-fitting. We note number of neurons in each layer as and number of layers for generator (encoder) as .

20 50 100
2 4.2e-01 3.8e-01 5.7e-01
3 6.5e-02 3.5e-02 2.1e-02
4 9.3e-02 4.7e-02 5.4e-02
Table 3: Relative prediction error for different feed-forward architectures for the generator, encoder, and the discriminator. The total number of layers of the latter was always chosen to be one less than the number of layers for generator.

b.4 Sensitivity with respect to the adversarial training procedure

Finally, we test the sensitivity with respect to the adversarial training process. To this end, we have fixed the number of noise-free training data to and , and the neural network architecture to be the same as B.2, and we vary the total number of training steps for the generator and the discriminator within each stochastic gradient descent iteration. The results of this study are presented in table 4 where we report the relative prediction error. These results reveal the high sensitivity of the training dynamics on the interplay between the generator and discriminator networks, and pinpoint on the well known peculiarity of adversarial inference procedures which require a careful tuning of and for achieving stable performance in practice.

1 2 5
1 3.5e-01 5.0e-01 1.5e+00
2 4.3e-02 3.2e-01 5.4e-01
5 4.7e-02 2.3e-01 7.0e-01
Table 4: Relative

error with different number of training for generator and discriminator in each epoch.