Bayesian Deep Convolutional Encoder-Decoder Networks for Surrogate Modeling and Uncertainty Quantification

01/21/2018 ∙ by Yinhao Zhu, et al. ∙ University of Notre Dame 1

We are interested in the development of surrogate models for uncertainty quantification and propagation in problems governed by stochastic PDEs using a deep convolutional encoder-decoder network in a similar fashion to approaches considered in deep learning for image-to-image regression tasks. Since normal neural networks are data intensive and cannot provide predictive uncertainty, we propose a Bayesian approach to convolutional neural nets. A recently introduced variational gradient descent algorithm based on Stein's method is scaled to deep convolutional networks to perform approximate Bayesian inference on millions of uncertain network parameters. This approach achieves state of the art performance in terms of predictive accuracy and uncertainty quantification in comparison to other approaches in Bayesian neural networks as well as techniques that include Gaussian processes and ensemble methods even when the training data size is relatively small. To evaluate the performance of this approach, we consider standard uncertainty quantification benchmark problems including flow in heterogeneous media defined in terms of limited data-driven permeability realizations. The performance of the surrogate model developed is very good even though there is no underlying structure shared between the input (permeability) and output (flow/pressure) fields as is often the case in the image-to-image regression models used in computer vision problems. Studies are performed with an underlying stochastic input dimensionality up to 4,225 where most other uncertainty quantification methods fail. Uncertainty propagation tasks are considered and the predictive output Bayesian statistics are compared to those obtained with Monte Carlo estimates.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 22

page 27

page 30

page 31

page 32

page 34

page 35

page 36

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Uncertainty in complex systems arises from model error and model parametrization, unknown/incomplete material properties, boundary conditions or forcing terms, and other. Uncertainty propagation takes place by reformulating the problem of interest as a system of stochastic partial differential equations (SPDEs). Solution of such problems often needs to rely on the solution of the deterministic problem at a finite number of realizations of the random input using Monte Carlo sampling, or collocation methods. Considering the computational cost of solving complex multiscale/multiphysics deterministic problems, one often relies on Bayesian surrogate models that are trained with only a small number of deterministic solution runs while at the same time they are capable of capturing the epistemic uncertainty introduced from the limited training data 

bilionis2016bayesian .

For realistic problems in science and engineering, we only have access to limited number (e.g. or so) of deterministic simulation runs. Vanilla Monte Carlo for uncertainty propagation is thus hopeless. A dominant solution is to train a surrogate model using the limited simulation-based training data, and then perform prediction and uncertainty propagation tasks using the surrogate instead of solving the actual PDEs. Unfortunately most existing surrogate models have difficulty scaling to high-dimensional problems, such as the ones based on Gaussian processes (GP) bilionis2013multi ; bilionis2012multi or generalized polynomial chaos expansions (gPC xiu2002wiener ). High dimensionality often arises from the discretization of properties with small correlation lengths (e.g. permeability in heterogeneous media flows), random distributed sources or force input fields with multiple scales torquato2013random .

To alleviate the curse of stochastic input dimensionality, we usually assume that the given input data lie on an embedded non-linear manifold within the higher dimensional space. This intrinsic dimensionality is captured by dimensionality reduction techniques van2009dimensionality , such as the Karhunen-Loève expansion (KLE), t-SNE maaten2008visualizing , auto-encoders hinton2006reducing , probabilistic methods like variational auto-encoders kingma2013auto , Gaussian process latent variable models (GP-LVM) lawrence2004gaussian

, and many more. Most dimensionality reduction models are unsupervised learning problems that do not explicitly take the regression task into account. Thus the classical approach to uncertainty quantification is to first reduce the dimensionality of the input to obtain a low-dimensional latent representation of the input field, then to built a regression model from this latent representation to the output. This approach is certainly not efficient is particular when the map from the latent representation to the physical space of the input data is not available. In 

bilionis2013multi , KLE was used for dimensionality reduction of the permeability field and then GP was performed as independent task for Bayesian regression. In grigo2017bayesian , this approach was taken one step further with the probabilistic mappings from input to latent space and from latent space to output being modeled by generalized linear models both trained simultaneously end-to-end using stochastic variational inference instead of performing the unsupervised and supervised/regression tasks separately.

One of the essential upcoming approaches for handling high-dimensional data is to learn the latent input representation automatically by supervision with the output in regression tasks. This is the central idea of deep neural networks 

bengio2009learning

, especially convolutional neural networks (CNNs) 

krizhevsky2012imagenet which stack (deeper) layers of linear convolutions with nonlinear activations to automatically extract the multi-scale features or concepts from high-dimensional input zeiler2014visualizing , thus alleviating the hand-craft feature engineering, such as searching for the right set of basis functions, or relying on expert knowledge.

However, the general perspective for using deep neural networks lecun2015deep ; hinton2006fast in the context of surrogate modeling is that physical problems in uncertainty quantification (UQ) are not big data problems, thus not suitable for addressing them with deep learning approaches. However, we argue otherwise in the sense that each simulation run generates large amount of data which potentially reveal the essential characteristics about the underlying system. In addition, even for a relatively small dataset, deep neural networks show unique generalization property zhang2016understanding ; dziugaite2017computing . These are typically over-parameterized models (hundreds and thousands of times more parameters than training data), but they do not overfit, i.e. the test error does not grow as the network parameters increase. Deep learning has been explored as a competitive methodology across fiels such as fluid mechanics kutz2017deep ; DBLP:journals/corr/abs-1711-04315 , hydrology marccais2017prospective , bioinformatics min2017deep , high energy physics baldi2014searching and other.

This unique generalization behavior makes it possible to use deep neural networks for surrogate modeling. They are capable to capture the complex nonlinear mapping between high-dimensional input and output due to their expressiveness raghu2016expressive , while they only use small number of data from simulation runs. In addition, there has been a resurgence of interest in putting deep neural network under a formal Bayesian framework. Bayesian deep learning mackay1992practical ; neal2012bayesian ; gal2016dropout ; blundell2015weight ; liu2016stein ; kingma2015variational ; hernandez2015probabilistic ; louizos2017multiplicative ; louizos2017bayesian

enables the network to express its uncertainty on its predictions when using a small number of training data. The Bayesian neural networks can quantify the predictive uncertainty by treating the network parameters as random variables, and perform Bayesian inference on those uncertain parameters conditioned on limited observations.

In this work, we mainly consider surrogate modeling of physical systems governed by stochastic partial differential equations with high-dimensional stochastic input such as flow in random porous media CHAN2018493 . The spatially discretized stochastic input field and the corresponding output fields are high-dimensional. We adopt an end-to-end image-to-image regression approach for this challenging surrogate modeling problem. More specifically, a fully convolutional encoder-decoder network is designed to capture the complex mapping directly from the high-dimensional input field to the output fields without using any explicit intermediate dimensionality reduction method. To make the model more parameter efficient and compact, we use DenseNet to build the feature extractor within the encoder and decoder paths huang2016densely . Intuitively, the encoder network extracts the multi-scale features from the input data which are used by the decoder network to reconstruct the output fields. In similarity with problems in computer vision, we treat the input-output map as an image-to-image map. To account for the limited training data and endow the network with uncertainty estimates, we further treat the convolutional encoder-decoder network to be Bayesian and scale a recently proposed approximate inference method called Stein Variational Gradient Descent to modern deep convolutional networks. We will show that the methodology can learn a Bayesian surrogate for a problem with an intrinsic dimensionality of , achieving promising results on both predictive accuracy and uncertainty estimates using as few as training data. More importantly we develop a surrogate for the case of dimensionality using training data. We also show these uncertainty estimates are well-calibrated using a reliability diagram. To this end, we believe that Bayesian neural networks are strong candidates for surrogate modeling and uncertainty propagation in high-dimensional problems with limited training data.

The remaining of the paper is organized as follows: In Section 2, we present the problem setup for surrogate modeling with high-dimensional input and the proposed approach in treating it as an image regression problem. We then introduce the CNNs and the encoder-decoder network used in our model. In Section 3, we present the Bayesian formulation of neural networks and a non-parametric variational method for the underlying challenging approximate inference task. In Section 4, we provide implementation details and show the performed experiments on a porous media flow problem. We finally conclude and discuss the various unexplored research directions in Section 5.

2 Methodology

2.1 Surrogate Modeling as Image-to-Image Regression

The physical systems considered here are modeled by stochastic PDEs (SPDEs) with solutions , i.e. the model response at the spatial location (), with one realization of the random field , where is the index set and is the sample space. The formulation allows for multiple input channels (i.e.

) even though our interest here is on one input property represented as a vector random field. This random field appears in the coefficients of the SPDEs, and is used to model material properties, such as the permeability or porosity fields in geological media flows. We assume the computer simulation for the physical systems is performed over a given set of spatial grid locations

(e.g. mesh nodes in finite element methods). In this case, the random field is discretized over the fixed grids , thus is equivalent to a high-dimensional random vector, denoted as , where . The corresponding response is solved over , thus can be represented as a vector .

With the discretization described above and assuming for simplicity fixed boundary and initial conditions and source terms as appropriate, we consider the computation simulation as a black-box mapping of the form:

(1)

In order to tackle the limitations of using the deterministic computationally expensive simulator for uncertainty propagation, a surrogate model is trained using limited simulation data , to approximate the ‘ground-truth’ simulation-induced function , where are the model parameters, and is the number of simulation runs (number of training simulation-based data).

Let us consider that the PDEs governing the physical system described above are solved over 2D regular grids of , where and denote the number of grid points in the two axes of the spatial domain (height and width), and . It is very natural to organize the simulation data as an image dataset , where is one input field realization, and is the simulated steady-state output fields for discretized over the same grids. Here , are the number of dimensions for the input and the output at one location. These are treated herein as the number of channels in input and output images, similar to RGB channels in natural images. It is easy to generalize to the 3D spatial domain by adding an extra depth axis to the images, e.g. , and .

Therefore, we transform the surrogate modeling problem to an image-to-image regression problem, with the regression function as

(2)

In distinction from an image classification problem which requires image-wise prediction, the image regression problem is concerned with pixel-wise predictions, e.g. predicting the depth of each pixel in an image, or in our physical problem, predicting the output fields at each grid point. Such problems have been intensively studied within the computer vision community by leveraging the rapid recent progress of convolutional neural networks (CNNs), such as AlexNet krizhevsky2012imagenet , VGG simonyan2014very , Inception szegedy2016rethinking , ResNet he2016deep , DenseNet huang2016densely , and many more. A common model design pattern for semantic segmentation ronneberger2015u or depth regression eigen2014depth is the encoder-decoder architecture. The intuition behind regression between two high-dimensional objects is to go through a coarse-refine process, i.e. to reduce the spatial dimension of the input image to high-level coarse features using an encoder, and then recover the spatial dimension by refining the coarse features through a decoder. One of the characteristics shared by those vision tasks is that the input and output images share the underlying structure, or they are different renderings of the same underlying structure pix2pix2016 . However, for our surrogate modeling tasks, the input and output images appear to be quite different, due to the complex physical influences (defined by PDEs) of the random input field, forcing terms and boundary conditions on the system response. This was after all the reason of pursuing the training of a surrogate model that avoids the repeated solution of the PDEs for different input realizations. Surprisingly, as we will discuss later on in this paper, the encoder-decoder network still works very well.

The training data for the surrogate model of interest here include the realizations of the random input field and the corresponding multi-output obtained from simulation. Of interest is to address problems with limited training data sets considering the high-computational cost of each simulation run. However, note that key UQ tasks include the ability to predict the system response and our confidence on it using input realizations (testing dataset) consistent with the given training data but also the computation of the statistics of the response induced by the random input. Both of these tasks require the availability of a high-number of input data sets (e.g. data points for testing, and input data points for Monte Carlo calculation of the output statistics). The problem of generating more input realizations using only the training dataset is the solution of a generative model problem. There has been significant progress in recent years in the topic, such as the generative adversarial networks (GANs) goodfellow2014generative and its ever exploding variants, variational auto-encoders (VAEs) kingma2013auto

, autoregressive models like PixelCNN 

van2016conditional , PixelRNN oord2016pixel , and other. However, note that in this work our focus is on the image-to-image mapping and its performance on uncertainty quantification tasks. We will thus assume that enough input samples are provided both for testing and output statistics calculation even though only a small dataset will be used for training. In our examples in Section 4, synthetic log-permeability datasets are generated by sampling a Gaussian random field with an exponential kernel. The output for each permeability sample is generated using a deterministic simulator.

2.2 Dense Convolutional Encoder-Decoder Networks

In this subsection, we briefly introduce a state-of-the-art CNN architecture called DenseNet huang2016densely and fully convolutional encoder-decoder networks developed in computer vision, and then present how to utilize these advances to build our baseline network for surrogate modeling in uncertainty quantification.

2.2.1 Densely Connected Convolutional Networks

DenseNet huang2016densely is a recently proposed CNN architecture which extends the ideas of ResNet he2016deep and Highway Networks srivastava2015training to create dense connections between all layers, so as to improve the information (gradient) flow through the network for better parameter efficiency.

Let be the output of the layer. Traditionally CNNs pass the output of one layer only to the input of the next layer, i.e. , where denotes the nonlinear function of the hidden layer. In current CNNs,

is commonly defined as a composition of Batch Normalization 

ioffe2015batch

(BatchNorm), Rectified Linear Unit 

glorot2011deep

(ReLU) and Convolution (Conv) or transposed convolution (ConvT) 

2016arXiv160502688short . ResNets he2016deep create an additional identity mapping that bypasses the nonlinear layer, i.e. . In this way, the nonlinear layer only needs to learn a residual function which facilitates the training of deeper networks.

DenseNets huang2016densely introduce connections from any layer to all subsequent layers, i.e. . To put it in another way, the input features of one layer are concatenated to the output features of this layer and this serves as the input features to the next layer. Assume that the input image has channels, and each layer outputs feature maps, then the layer would have input with feature maps, i.e. the number of feature maps in DenseNet grows linearly with the depth. is here referred to as the growth rate.

For image regression based on encoder-decoder networks, downsampling and upsampling are required to change the size of feature maps, which makes concatenation of feature maps unfeasible. Dense blocks and transition layers are introduced to solve this problem and modularize the network design. A dense block contains multiple densely connected layers whose input and output feature maps are of the same size. It contains two design parameters, namely the number of layers within and the growth rate for each layer. An illustration of the dense block is in Fig. 1.

(a) Dense block.
(b) The second layer.
Figure 1: (a) A dense block contains layers with growth rate . (b) The second layer of the dense block, where is its output feature map. Notice that the input to the third layer is the concatenation of the output and input features of , i.e. . As is often the case, each layer is composed of Batch Normalization ioffe2015batch (BatchNorm), Rectified Linear Unit glorot2011deep (ReLU) and Convolution (Conv). The convolution kernel has kernel size

, stride

and zero padding

, which keep the size of the feature maps the same as the input.

Transition layers are used to change the size of feature maps and reduce their number between dense blocks. More specifically, the encoding layer typically halfs the size of feature maps, while the decoding layer doubles the feature map size. Both of the two layers reduce the number of feature maps. This is illustrated in Fig. 2.

(a) Encoding layer.
(b) Decoding layer.
Figure 2: Both (a) encoding layer and (b) decoding layer contain two convolutions. The first convolution reduces the number of feature maps while keeps their size the same using a kernel with parameters ; the second convolution changes the size of the feature maps but not their number using a kernel . The main difference between (a) and (b) is in the type of the second convolution layer, which is Conv and ConvT respectively, for downsampling and upsampling. Note that no pooling is used at transition layers for maintaining the location information. The colored feature maps used here are independent from the feature maps with the same color shown in other figures.

2.2.2 Fully Convolutional Networks

The fully convolutional networks (FCNs) long2015fully are extensions of CNNs for pixel-wise prediction, e.g. semantic segmentation. FCNs replace the fully connected layers in CNNs with convolution layers, add upsampling layers in the end to recover the input spatial resolution, and introduce the skip connections between feature maps in downsampling and upsampling path to recover finer information lost in the downsampling path. Most of the recent work focuses in improving the upsampling path and increase the connectivity within and between upsampling and downsampling paths. U-nets ronneberger2015u extend the upsampling path as symmetric to the downsampling path and add skip connections between each size of feature maps in the downsampling and upsampling paths. Within SegNets badrinarayanan2015segnet

, the decoder uses pooling indices computed in the max-pooling step of the corresponding encoder to perform non-linear upsampling. Fully convolutional DenseNets 

jegou2017one extend DenseNets to FCNs, which are closest to our network design but with several differences. We keep all the feature maps of a dense block concatenated so far before passing to the transition layers, while they only keep the output feature maps of the last convolution layer within the dense block. The feature maps explosion problem is addressed by the first convolution layer within the transition layer. Besides that we do not use skip connections between encoding and decoding paths because of the weak correspondence between the input and output images. We also do not use max-pooling for encoding layers, instead we use convolution with stride .

2.3 Network architecture: DenseED

We follow the fully convolutional networks (FCNs) long2015fully for image segmentation without using any fully connected layers, and encode-decoder architecture similar to U-net ronneberger2015u and SegNet badrinarayanan2015segnet but without the concatenation of feature maps between the encoder paths and decoder paths. Furthermore, we adapt the DenseNet huang2016densely

structure into the encoder and decoder networks. After extensive hyperparameter and architecture search, we arrived at a baseline dense convolutional encoder-decoder network, called

DenseED, similar to the network proposed in jegou2017one but with noticeable differences as stated above and shown in Fig. 3.

Figure 3: Network architecture: DenseED.

In the encoding path, the input field realizations are fed into the first convolution layer with large kernel size , stride and zero padding . Then the extracted feature maps are passed through an alternative cascade of dense blocks and encoding layers as introduced in Figs. 1 and 2. The dense block after the last encoding layer outputs the high-level coarse feature maps extracted from the input, as shown in purple at the right end of the network in Fig. 3, which are subsequently fed into the decoder path. The decoding network consists of an alternation of dense blocks and decoding layers, with the last decoding layer directly leading to the prediction of the output fields.

2.4 Network architecture engineering and hyperparameter search

Network architecture engineering and hyperparameter search are among the main challenges and source of innovations in deep learning, mostly problem-specific and empirical. The general network architecture is introduced in Section 2.2, which is based on the recent development of neural network design for image segmentation. For our image regression problem, the main design considerations include the following:

  • Downsampling layers: convolution or pooling;

  • Upsampling layers: bilinear upsampling or transposed convolution;

  • Smallest spatial dimensions of feature maps: this is determined by the number of downsampling layers;

  • Add or not of skip connections between the encoding and decoding paths;

  • Kernel of convolution layers, kernel size , stride , zero padding ;

  • Number of layers and growth rate within each dense block;

  • Regularizations: weight decay, batch normalization, dropout, etc;

  • Optimizer: stochastic gradient descent algorithms and their variants, such as Adam, Adagrad, RMSprop, and others;

  • Training hyperparameters: batch size, learning rate and its scheduler.

The details of architecture search and hyperparameter selection for the particular problem considered are presented in A where we also report various experiments using DenseED for surrogate modeling with limited training data. No overfitting was observed in our calculations and the obtained results were quite good. This is an intriguing and active research topic in the deep learning community zhang2016understanding .

In our non-Bayesian calculations, we have considered or

regularized MSE training loss function. Given an input image

, and a target image , the prediction , the regularized MSE loss is

(3)

where the penalty function for regularization, and for regularization, and is the number of pixels in all channels of one output image. denotes all the parameters in the network. For our case, it includes the kernel weights in all the convolution and transposed convolution layers (no bias is used in convolutional kernel), the scale and shift parameters in all the batch normalization layers. See Section 4.3 for an example of the fully convolutional encoder-decoder network used for the Darcy flow problem. Note that

regularization is implemented in PyTorch optimizers by specifying

weight decay, which is in Eq. (3).

The network architecture selected for the non-Bayesian model is the same as that used for our Bayesian model introduced next.

In the encoder-decoder network, the batch normalization layer used after each convolutional layer can also be considered as an effective regularizer111https://openreview.net/forum?id=BJlrSmbAZ&noteId=BJlrSmbAZ. It is commonly adopted nowadays in deep convolutional networks222https://github.com/pytorch/vision/tree/master/torchvision/models replacing dropout333https://www.reddit.com/r/MachineLearning/comments/5l3f1c/d_what_happened_to_dropout/.

3 Bayesian Neural Networks

Consider a deterministic neural net with input , output , and all parameters including the weights and biases. 444http://pytorch.org/docs/master/nn.html?ht=conv2d##torch.nn.functional.conv2d Bayesian neural networks (BNNs) treat the parameters as random variables instead of deterministic unknowns to account for epistemic uncertainty induced by lack of training data. Besides that, usually additive noise is introduced to model the aleatoric uncertainty which can not be reduced by having more observations, also to make the probabilistic model have an explicit likelihood depending on the noise distribution, i.e.

(4)

where is the output of a neural network with the uncertain , and is the additive noise.

3.1 Sparsity inducing prior on weights

Given the large amount of ‘un-interpretable’ parameters in a deep neural net, there are not many choices of priors. But the demand for compression han2015deep ; louizos2017bayesian

of the neural net for lower memory and computation cost calls for sparsity promoting priors. We assume a fully factorized Gaussian prior with zero mean and Gamma-distributed precision

on parameters

(5)

This results in a student’s t-prior for , which has heavy tails and more mass close to zero.

3.2 Additive Noise Model

Additive noise can be considered of the following form:

  • Output-wise: , same for all output pixels;

  • Channel-wise: , same across each of the output channels/fields;

  • Pixel-wise: , distinct for each output pixel.

Here, , are scalars, is a field with the same dimension as the output and denotes the element-wise product operator. In this work, we have considered both Gaussian noise, and Laplacian noise, 555http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/VELDHUIZEN/node11.html. In the numerical results discussed in Section 4, we concentrate in the the output-wise and channel-wise cases above. We treat the noise precision

as a random variable with a conjugate prior

. For the generated training data, a priori

we assume that the noise variance (also known as nugget 

gramacy2012cases ) to be very small e.g. . Thus the values provide a good initial guess for the prior hyperparameters.

We can also model the noise varying with input (pixel-wise noise model), resulting in a heteroscedastic noise model, i.e. or . Again

can be Gaussian or Laplacian. The heteroscedastic noise 

nix1994estimating ; le2005heteroscedastic ; kendall2017uncertainties can be implemented as extending the output of the neural net as:

(6)

The output of the system becomes . The pixel-wise case may help capture large variations for example near discontinuous regions of the output. In practice, we apply a softplus transformation to the second part of the output of the neural net to enforce the positive variance constraint, i.e. , where for numerical stability.

3.3 Stein Variational Gradient Descent (SVGD)

Approximate inference for Bayesian deep neural network is a daunting task because of the large number of uncertain parameters, e.g. tens or hundreds of millions in modern deep networks. In our surrogate problem, the task is to find a high-dimensional posterior distribution over millions of random variables using less than hundreds or thousands of training data. As reviewed in Section 1, most of variational inference methods blei2017variational restrict the approximate posterior within certain parametric variational family, while sampling-based methods are slow and difficult to converge. Here we adopt a recently proposed non-parametric variational inference method called stochastic variational gradient descent (SVGD) liu2016stein ; liu2017stein that is similar to standard gradient descent while maintaining the efficiency of particle methods.

For a prescribed probabilistic model with likelihood function and prior , we are interested in Bayesian inference of the uncertain parameters , i.e. to find the posterior distribution , where denote the i.i.d. observations (training data), i.e. . For the BNNs with homoescedastic Gaussian noise case, . Variational inference aims to approximate the target posterior distribution with a variational distribution which lies in a restricted set of distributions by minimizing the KL divergence between the two, i.e.

where is the unnormalized posterior, and is the normalization constant or model evidence, which is usually computationally intractable, but can be ignored when we optimize the KL divergence.

The variational family considered here is a set of distributions obtained by smooth transforms from an initial tractable distribution (e.g. the prior) represented in terms of particles. The transforms applied to each particle take the following form:

(7)

where is the step size, is the perturbation direction within a function space . When is small, transforms the initial density to

Instead of using parametric form for the variational posterior, a particle approximation is used, i.e. a set of particles with empirical measure . We would like to have to weakly converge to the measure of the true posterior . We apply the transform to those particles, and denote the pushforward measure of as .

The problem is to find out the direction to maximally decrease the KL divergence of the variational approximation and the target distribution, i.e. to solve the following functional optimization problem:

(8)

It turns out that liu2016stein

(9)

where is called the Stein operator associated to the distribution ,

The expectation evaluates the difference between and , and its maximum is defined as the Stein discrepancy,

(10)

It has been shown liu2016kernelized that when the functional space is chosen to be the unit ball in a product reproducing kernel Hilbert space with the positive kernel , the maximal direction to perturb (or the Stein discrepancy) has a closed-form solution,

Thus we have the following algorithm to transform an initial distribution to the target posterior .

Input : A set of initial particles , score function , kernel , step-size scheme
Result: A set of particles that approximate the target posterior
for iteration  do
       ;
      
end for
Algorithm 1 Bayesian inference by Stein Variational Gradient Descent.

This is an one-line algorithm, where the gradient pushes the particles towards the high posterior region by kernel smoothed gradient term , while maintaining a degree of diversity by repulsive force term . When the number of particles becomes , then the algorithm reduces to the MAP estimate of the posterior. This algorithm is implemented in PyTorch with GPU acceleration and scaled to our deep convolutional encoder-decoder network DenseED.

Here we use one toy example to illustrate the idea of SVGD. We start with the

particles from the Normal distribution

, and tranport the particles iteratively to the target Gaussian mixture distribution with Algorithm 1.

(a) Iter
(b) Iter
(c) Iter
(d) Iter
(e) Iter
Figure 4: Example of transporting particles from to Gaussian mixture with SVGD using particles. The green dash represents the target Gaussian mixture.

3.3.1 Implementation and Training

  • We use samples of to approximate the empirical measure of the posterior. They are initialized and stored in deterministic DenseED neural networks.

  • At each step , for each model , the gradient of its joint likelihood (or unnormalized posterior) is computed by the automatic differentiation tool in PyTorch. We first compute the joint likelihood by feeding forward the data , then back propagate to compute its gradient . Note that this gradient is stored in PyTorch module associated to the weights in each network.

  • Then we can proceed to compute the kernel matrix , and its gradient , as well as the kernel weighted gradient of the joint likelihood . For this to happen, we need to vectorize (extract out) the parameters and the computed gradient from each neural network. The optimal perturbation direction is further computed by the sum of the two terms as shown in the SVGD algorithm.

  • After is computed, we send back to each neural network the gradient w.r.t. its parameters (overwriting the previously computed ), and updating locally in each neural network using PyTorch’s optimization library such as Adam or SGD to compute .

  • With one iteration complete, the algorithm repeats the above steps until convergence in the parameters is achieved.

3.4 Uncertainty Quantification

Of interest to the classical UQ problem is the computation of the posterior predictive distribution (predict the system response for a test input) as well as the computation of the output response averaged over the input probability distribution. In particular, we are interested in computing the following:

  • Predictive uncertainty at :

    , and in particular the moments

    .

  • Propagated uncertainty to the system response by integrating over : , and in particular , . One can use these moments to compute the statistics of conditional output statistics, e.g. , and , .

We can use Monte Carlo to approximate the moments of the predictive distribution

(11)

with mean (by the law of total expectation)

(12)

Note that the SVGD algorithm provides a sample representation of the joint posterior of all parameters . To obtain the samples of the marginal posterior as needed above, one simply needs to use the samples corresponding to .

The predictive covariance can also be easily calculated using the law of total variance. The variance and expectation below are w.r.t. to the posterior of the parameters. We can show the following:

(13)

where . The predictive variance is the diagonal of the predictive covariance:

(14)

where is a vector of ones with the same dimension as , and the square is applied element-wise to the vectors.

The above computation is the prediction at a specific input . We would also like to compute the average prediction over the distribution of the uncertain input. We first compute the output statistics given the realizations of the uncertain parameters , where . The conditional predictive mean is

(15)

and the conditional predictive covariance is

(16)

Also the conditional predictive variance (at each spatial location) is

(17)

where is a vector of ones with the same dimension as , and the square operator is here applied element-wise to the vectors. Then we can further compute the statistics of the above conditional statistics due to the uncertainty in the surrogate, i.e. , such as , and , , which are the sample means and sample variances in each output dimension of of the conditional predictive mean and variance.

4 Numerical Implementation and Results

We study the two-dimensional, single phase, steady-state flow through a random permeability field following the case study in Section in bilionis2013multi . Consider the random permeability field on a unit square spatial domain , the pressure field and velocity field of the fluid through the porous media are governed by Darcy’s law:

(18)

where denotes the unit normal vector to the boundary and the source term is used to model an injection well on the left-bottom corner of and a production well on the right-top corner. We also enforce no-flux boundary condition, and an integral constraint to ensure the uniqueness of the solution as in bilionis2013multi . More specifically,

(19)

where is the rate of the wells and is their size. The input log-permeability field is restricted in this work to be a Gaussian random field, i.e.

(20)

where is the constant mean and covariance function is specified in the following form using the norm in the exponent instead of the norm in bilionis2013multi , i.e.

(21)

4.1 Datasets

The Gaussian random field rasmussen2006gaussian with exponential kernel for the one-dimensional case corresponds to the Ornstein-Uhlenbeck process which is mean-square continuous but not mean-square differentiable. Thus when we discretize the field over a grid, the field value jumps (varies highly) as we move from pixel to pixel. The field does not become smoother when we use a finer grid over a fixed spatial domain. This high variability creates a significant challenge for data-driven models to capture, i.e. the intrinsic dimensionality of the discretized random field is the total number of pixels, e.g. for grids (which will be our reference grid for our calculations). However, a common assumption for natural images is that the underlying dimensionality is actually small (few hundreds) despite their complex appearance. To evaluate the generality and effectiveness of the methodology, we use KLE to control the intrinsic dimensionality of the permeability dataset. We evaluated our model using datasets produced with increasing dimensionality of , , (called KLE, KLE, and KLE, respectively). Notice that when the number of KLE terms is , the permeability field is directly sampled from the exponential Gaussian field without any dimensionality reduction. The intrinsic dimensionality of dataset is hidden from our model, i.e. our model do not built a map from the KLE terms to the system output. Instead, it models an end-to-end mapping from input fields to output fields. In fact, we will show one specific network architecture that works well for all three datasets obtained from the different intrinsic input data dimensions.

We consider solving the Darcy flow Eq. (18) over a unit squared domain with fixed grid, and length scale , kernel mean , rate of source , size of source

. The ratio of the cumulative sum of eigenvalues (in decreasing order) over the total sum of them is shown in Fig. 

5.

Figure 5: KLE profile.

The Darcy flow equation is solved using mixed finite element formulation implemented in FEniCS AlnaesBlechta2015a with third-order Raviart-–Thomas elements for the velocity, and fourth-order discontinuous elements for the pressure. The sample input permeability field and computed output pressure and velocity fields for three datasets are shown in Fig. 6.

(a) KLE
(b) KLE
(c) KLE
Figure 6: Sample permeability obtained from the (a) KLE dataset, (b) KLE dataset, and (c) KLE dataset (no dimensionality reduction) and the corresponding velocity components , and pressure obtained from the simulator. All figures are shown using pixels (imshow) to reveal the high variability of the input and output fields.

When the available data is limited, it is common practice to use cross-validation to evaluate the model. Since our dataset is synthetically generated, we have access to any number of training and test data up to computing constraints to solve the Darcy flow equations. Our current data includes four sets: the training set, validation set, test set, and uncertainty propagation set. The training set is sampled using the simplest design of experiment method, Latin hypercube sampling. More specifically, the KLE for the log-permeability field is

(22)

where and

are the eigenvalues and eigenfunctions of the exponential covariance function of the Gaussian field specified in Eqs. (

20) and (21), ’s are i.i.d. standard Normal, and is the number of KLE coefficients maintained in the expansion. The maximum number that can be used is finite and equal to the number of grid points used in the discetization of the field over the unit square. We first use Latin hypercube design to sample from the hypercube , then obtain the eigenvalue by , where

is the cumulative distribution function of the standard normal distribution. The KLE

case contains , , , and training data; KLE contains , , , and training data; and KLE contains , , , and training data.

The log-permeability fields in the other three sets are reconstructed directly with , which are sampled from standard normal. The validation and test set each contains input permeability fields, and the dataset for uncertainty propagation contains realizations. All datasets are organized as images as discussed in Section 2.1.

4.2 Evaluation Metrics

Several metrics are used to evaluate the trained models on test data . In particular, we consider the following:

Coefficient of determination (-score):

(23)

where is the output mean of the Bayesian surrogate, i.e. as in Eq. (12) or predictive output of the non-Bayesian surrogate, i.e. just , is the test target, is the mean of test target, and is the total number of test data. This metric enables the comparison between different datasets since the error is normalized, with the score closer to corresponding to better regression. This is the only metric used for evaluating non-Bayesian surrogate, the following metrics are additional metrics for evaluating the Bayesian surrogate. Note that this metric is also used for tracking the performance of the training process, thus it is evaluated for both the training and test data sets.

Root Mean Squared Error (RMSE):

This is a common metric for regression that is used in our experiments for monitoring the convergence of training.

Mean Negative Log-Probability (MNLP):

This metric evaluates the likelihood of the observed data. It is is used to assess the quality of the predictive model.

Predictive uncertainty and Propagated Uncertainty: These metrics were introduced in Section 3.4.

Estimated Distributions

: They include histograms or kernel density estimates for the output fields at certain locations of the physical domain.

Reliability Diagram: Given a trained Bayesian surrogate and a test data set, we can compute the

predictive interval for each test data point based on the Gaussian quantiles using the predictive mean and variance 

lakshminarayanan2016simple . We then compute the frequency of the test targets that fall within this predictive interval. For a well-calibrated regression model, the observed frequency should be close to . The reliability diagram is the plot of the observed frequency with respect to . Thus a well-calibrated model should have a reliability diagram close to the diagonal.

4.3 Non-Bayesian Surrogate Model

The hyperparameters to search include the parameters that determine the network architecture and the ones that specify training process, which both affect model performance. We use Hyperband li2016hyperband algorithm to optimize those hyperparameters with a constraint that the number of model parameters being less than million. The details of these experiments are given in A. The network configuration with the highest -score that Hyperband finds is shown in Fig. 7 with more details provided in Table 1. This configuration is referred to as DenseED-c16. The nd–th columns of Table 1 show the number of output feature maps, the spatial resolution of output feature maps, the number of parameters of each layer in the network.

Figure 7: DenseED-c16 with blocks , growth rate , and initial feature maps after the first convolution layer (yellow in the figure). There are in total (conv) layers and parameters in the network. The number of network parameters is optimized based on the generalization error analysis reported in B. It contains two down-sampling layers, thus the smallest spatial dimension (code dimension) of feature maps (purple in the figure) is , hence the network is named DenseED-c16. The first convolution kernel is of . The last transposed conv kernel in the decoding layer is of . The number of its output feature maps is (corresponding to output fields). For the decoding layers, the output padding is set to . The other conv kernels in dense blocks and encoding, decoding layers are described in Section 2.2.
Layers Resolution Number of parameters
Input -
Convolution k7s2p2 2352
Dense Block (1) K16L3 28032
Encoding Layer 25632
Dense Block (2) K16L6 77088
Decoding Layer (1) 57456
Dense Block (3) K16L3 38544
Decoding Layer (2) 12060
Table 1: DenseED-c16 architecture for the Darcy flow dataset

The network DenseED-c16 is trained with Adam kingma2014adam , a variant of stochastic gradient descent, with the loss function being

regularized MSE which is implemented as weight decay in modern neural net frameworks, such as PyTorch and TensorFlow. Other loss functions may achieve better results, such as smoothed

loss, or conditional GAN loss pix2pix2016 . This requires further investigations to be considered in future publication. The initial learning rate is , weight decay (regularization on weights) is , the batch size is . We also use a learning rate scheduler which drops times on plateau of the rooted MSE. The model is trained epochs. We train the model with the dataset introduced in Section 4.1.

Training the deterministic neural networks with regularized MSE is equivalent to finding the maximum a posterior of the uncertain parameters in Bayesian neural networks whose prior is independent normal. The typical training process is shown in Fig. 8.

Figure 8: Training process of DenseED-c16 with training data.

We train each network with different number of training data of KLE, KLE, and KLE. The validation -score is shown in Fig. 9, which shows that, with the same training data, the -score is closer to when the intrinsic dimensionality is smaller, and the -score is higher with more training data of the same dimensionality. Note that the score is more than with reasonably small size training data set for all the three cases which have dimensionality from to . This shows the effectiveness of the network DenseED-c16 for both low-dimensional and high-dimensional problems.

Figure 9: Test scores for the non-Bayesian surrogate.

The prediction of the output fields can be easily obtained in the test time by feeding the test input permeability field into the trained network, i.e. . We show the prediction of the test input shown in Fig. 6 using DenseED-c16, which is trained with three datasets (KLE, KLE, KLE) in Figs. 1011, and 12, respectively. The predictions are quite good even for the KLE case, where both the input and output fields vary rapidly in certain regions of the domain.

(a) training data
(b) training data
Figure 10: Prediction for the input realization shown in Fig. 5(a) from the KLE dataset using DenseED-c16 which is trained with datasets of sizes (a) and (b) , respectively. In both subfigures, the first row shows the three test target fields (simulation output), i.e. pressure and velocity , , the second row shows the corresponding model predictions, the third row shows the error.
(a) training data
(b) training data
Figure 11: Prediction for the input realization as shown in Fig. 5(b) from KLE dataset using DenseED-c16 which is trained with datasets of sizes (a) and (b) , respectively. In both subfigures, the first row shows the three test target fields (simulation output), i.e. pressure and velocity , , the second row shows the corresponding model predictions, the third row shows the error.
(a) training data
(b) training data
Figure 12: Prediction for the input realization as shown in Fig. 5(c) from KLE dataset using DenseED-c16 which is trained with datasets of sizes (a) and (b) , respectively. In both subfigures, the first row shows the three test target fields (simulation output), i.e. pressure and velocity , , the second row shows the corresponding model predictions, the third row shows the error.

4.4 Bayesian Surrogate Model

For all the experiments we only consider the homoscedastic noise model for Bayesian neural networks, i.e. output-wise Gaussian noise with Gamma prior on its precision , and Student’s t-prior on .

The set of all uncertain parameters is denoted as . We apply SVGD to the Bayesian neural network with samples from the posterior , i.e. set of deterministic model parameters of DenseED’s and noise precision . In implementation, this corresponds to different initializations for the deterministic DenseED and noise precision (a scalar). We update the parameters of DenseED’s and the corresponding noise precision using the SVGD algorithm as in Algorithm 1. The kernel is chosen to be

, with median heuristic for the choice of the kernel bandwidth

, where is the median of the pairwise distances between the current samples . We typically use samples of to approximate the empirical measure of the posterior. For large number of training data, the unnormalized posterior is evaluated using mini-batches of training data, i.e. . We observe that even for small training data such as , using smaller batch size (e.g. ) helps to get lower training and test errors, but with more time for training. We use Adam kingma2014adam to update using the gradient , instead of the vanilla stochastic gradient descent, for epochs, with learning rate for and for , and a learning rate scheduler that decreases by times when the training RMSE is on plateau.

The algorithm is implemented in PyTorch and runs on a single NVIDIA GeForce GTX Ti X GPU which requires about seconds for training epochs, when the training data size varies from to . The training time depends heavily on the training mini-batch size, which is for all cases. Potential ways to speed up significantly the training process include increasing the mini-batch size, or implementing the SVGD in parallel using multi-GPUs. The python source code will become available upon publication at https://github.com/bmmi/bayesnn.

We report next the -score computed similar to the non-Bayesian case, except the predicted output mean is used to compare with the test target. The scores are shown in Fig. 13. We can see that the Bayesian surrogate improves the -score significantly over the non-Bayesian version.

Figure 13: –scores comparison for the non-Bayesian and Bayesian models.

We also report the MNLP for test data in Fig. 14, which is a proper scoring rule and usually used to access the quality of predictive uncertainty quinonero2006evaluating .

Figure 14: MNLP.

The -score gives us a general estimate of how well the regression performs. For a given input permeability field, the Bayesian neural network can predict the mean of corresponding output fields, and also gives uncertainty estimate represented as predictive variance at each spatial location, which is unavailable for deterministic models, and desirable when the training data is small. In Figs. 1516, and 17, we show predictions for the test input shown in Fig. 6 with training data from KLE, KLE, and KLE, respectively. We can see that the predictive accuracy improves as the size of the training dataset increases, while the predictive uncertainty drops.

(a) training data
(b) training data
Figure 15: Prediction for the input realization shown in Fig. 5(a) from the KLE dataset. The rows from top to bottom show the simulation output fields (ground truth), predictive mean

, the error of the above two, and two standard deviation of predictive output distribution per pixel

. The three columns from left to right correspond to pressure field , and two velocity fields , , respectively.
(a) training data
(b) training data
Figure 16: Prediction for the input realization shown in Fig. 5(b) from the KLE dataset. The rows from top to bottom show the simulation output fields (ground truth), predictive mean , the error of the above two, and two standard deviation of predictive output distribution per pixel . The three columns from left to right correspond to pressure field , and two velocity fields , , respectively.
(a) training data
(b) training data
Figure 17: Prediction for the input realization shown in Fig. 5(c) from the KLE dataset. The rows from top to bottom show the simulation output fields (ground truth), predictive mean , the error of the above two, and two standard deviation of predictive output distribution per pixel . The three columns from left to right correspond to pressure field , and two velocity fields , , respectively.

We also performed uncertainty propagation by feeding the trained Bayesian surrogate with input realizations sampled from the Gaussian field, and calculating the output statistics as in Section 3.4. In Figs. 1819 and 20, we show the uncertainty propagation results and compare with Monte Carlo using the UP data for the Bayesian surrogate trained with the datasets KLE, KLE, and KLE.

(a) training data
(b) training data
(c) training data
(d) training data
Figure 18: Uncertainty propagation for KLE: In (a) and (b) from top to bottom, we show the Monte Carlo output mean, predictive output mean , the error of the above two, and two standard deviation of conditional predictive mean . The three columns from left to right correspond to pressure field , and two velocity fields , , respectively. In (c) and (d) from top to bottom, we show the Monte Carlo output variance, predictive output variance , the error of the above two, and two standard deviation of conditional predictive variance . The three columns are the same as (a) and (b).
(a) training data
(b) training data
(c) training data
(d) training data
Figure 19: Uncertainty propagation for KLE: (a), (b), (c), and (d) refer to Fig. 18.
(a) training data
(b) training data
(c) training data
(d) training data
Figure 20: Uncertainty propagation for KLE: (a), (b), (c), and (d) refer to Fig. 18.
(a) Location training data
(b) Location training data
Figure 21: Distribution estimate for the pressure, and the two velocity components (from left to right) at location with Bayesian surrogate trained with KLE dataset.
(a) Location training data
(b) Location training data
Figure 22: Distribution estimate for the pressure, and the two velocity components (from left to right) at location with Bayesian surrogate trained with KLE dataset.

We show the estimate of pressure , velocity components , at locations , and on the unit square for , and training data of KLE in Fig. 21, and 22, respectively. The PDF is obtained by kernel density estimation using the predictive mean. We can see that the density estimate is close to the Monte Carlo result even when the training dataset is small, and becomes closer as the training dataset increases. From Fig. 22, we observe that the predictions for the velocity fields are better than the pressure field especially in locations away from the diagonal of the unit square domain, and this is in general the case for our current network architecture, where the three output fields are treated the same.

In order to access the quality of the computed uncertainty, we adopt the reliability diagram which expresses the discrepancy between the predictive probability and the frequency of falling in the predictive interval for the test data. The diagram is shown in Fig. 23. Overall our models are well-calibrated since they are quite close to the ideal diagonal, especially the case when the training dataset size is as shown in Fig. 22(d). In general the model turns to be over-confident (small predictive uncertainty) when the training data is small, and gradually becomes prudent (larger predictive uncertainty) when the training data increases. The main reason for this observation is that the predictive uncertainty is dominated by the variation seen in the training data, which is small when small data is observed. The initial learning rates and their scheduling scheme for network parameters and noise precision may also play roles here since the uncertainty is determined by the optimum that that stochastic optimization obtained.

(a) KLE
(b) KLE
(c) KLE