# Bayesian Sparse learning with preconditioned stochastic gradient MCMC and its applications

In this work, we propose a Bayesian type sparse deep learning algorithm. The algorithm utilizes a set of spike-and-slab priors for the parameters in the deep neural network. The hierarchical Bayesian mixture will be trained using an adaptive empirical method. That is, one will alternatively sample from the posterior using preconditioned stochastic gradient Langevin Dynamics (PSGLD), and optimize the latent variables via stochastic approximation. The sparsity of the network is achieved while optimizing the hyperparameters with adaptive searching and penalizing. A popular SG-MCMC approach is Stochastic gradient Langevin dynamics (SGLD). However, considering the complex geometry in the model parameter space in non-convex learning, updating parameters using a universal step size in each component as in SGLD may cause slow mixing. To address this issue, we apply a computationally manageable preconditioner in the updating rule, which provides a step-size parameter to adapt to local geometric properties. Moreover, by smoothly optimizing the hyperparameter in the preconditioning matrix, our proposed algorithm ensures a decreasing bias, which is introduced by ignoring the correction term in preconditioned SGLD. According to the existing theoretical framework, we show that the proposed algorithm can asymptotically converge to the correct distribution with a controllable bias under mild conditions. Numerical tests are performed on both synthetic regression problems and learning the solutions of elliptic PDE, which demonstrate the accuracy and efficiency of present work.

## Authors

• 7 publications
• 8 publications
• 1 publication
• ### An Adaptive Empirical Bayesian Method for Sparse Deep Learning

We propose a novel adaptive empirical Bayesian (AEB) method for sparse d...
10/23/2019 ∙ by Wei Deng, et al. ∙ 0

Bayesian approaches have been successfully integrated into training deep...
10/03/2020 ∙ by Yating Wang, et al. ∙ 0

Stochastic gradient Hamiltonian Monte Carlo (SGHMC) is an efficient meth...
02/29/2020 ∙ by Ruqi Zhang, et al. ∙ 0

• ### ADAMT: A Stochastic Optimization with Trend Correction Scheme

01/17/2020 ∙ by Bingxin Zhou, et al. ∙ 0

• ### Bayesian Posterior Sampling via Stochastic Gradient Fisher Scoring

In this paper we address the following question: Can we approximately sa...
06/27/2012 ∙ by Sungjin Ahn, et al. ∙ 0

• ### On the Convergence of Stochastic Gradient MCMC Algorithms with High-Order Integrators

Recent advances in Bayesian learning with large-scale data have witnesse...
10/21/2016 ∙ by Changyou Chen, et al. ∙ 0

• ### Preconditioned Stochastic Gradient Langevin Dynamics for Deep Neural Networks

Effective training of deep neural networks suffers from two main issues....
12/23/2015 ∙ by Chunyuan Li, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Deep neural networks have attracted extensive attention in recent times. Due to their powerful potential in approximating high-dimensional nonlinear maps, and universal approximation property to represent a rich class of functions, DNNs have been successfully employed in problems from various research areas. However, effectively training DNN is still challenging due to the difficulty of escaping local minima in non-convex optimization and avoiding overfitting in over-parameterized networks.

Bayesian learning is appealing because of its ability to capture uncertainties in the model parameter, and MCMC sampling helps to address the overfitting issue. There has been extensive work bringing the Bayesian methods to the context of DNN optimization. The stochastic gradient Langevin dynamics (SGLD) sgld is first proposed and becomes a popular approach in the family of stochastic gradient MCMC algorithmschen2014stochastic ; ma2015complete ; PSGLD . SGLD is the first-order Euler discretization of Langevin diffusion with stationary distribution on Euclidian space. It can be viewed as adding some noise to a standard stochastic gradient optimization algorithm. Since it resembles SGD, SGLD inherits the advantage of SGD where the gradients are stochastically approximated using mini-batches. This makes MCMC scalable and provides a seamless transition between stochastic optimization and posterior sampling. It was shown that samples from SGLD will converge to samples from the true posterior distribution with annealed step size sg-mcmc-convergence ; sgld .

In DNN, the underlying models may have complicated geometric properties and possess non-isotropic target density functions dauphin2014identifying ; PSGLD ; chen2014stochastic . When the components of parameters have different curvature, generating samples using a universal step size for every model parameter may cause slow mixing and can be inefficient. In the optimization literature, there are many approaches to accelerate the gradient descent, such as preconditioning and Newton’s method dauphin2015equilibrated ; zhang2011quasi ; byrd2016stochastic ; bordes2009sgd

. However, naively borrowing this idea and using a preconditioning matrix in SGLD fails to produce a proper MCMC scheme, the Markov chain does not target the underlying posterior except for a few cases

PSGLD ; simsekli2016stochastic

. Considering that a Langevin diffusion with invariant measure can be directly defined on a Riemannian manifold, and the expected Fisher information is one typical choice for the Riemannian metric tensor

Riemann_LD_HMC , SGRLD is proposed SGRLD

. Built-up from Riemannian Langevin dynamics, SGRLD is a discretization of the Riemannian Langevin dynamics and the gradients are approximated stochastically. It incorporates local curvature information in parameter updating scheme, such that constant step size is adequate along with all directions. However, the full expected Fisher information is usually intractable. A more computationally efficient preconditioner is needed to approximate second-order Hessian information. Preconditioned SGLD adopts the same preconditioner as introduced in RMSprop

tieleman2012lecture as discussed in PSGLD which reduces the computational and storage cost. One can update the preconditioner sequentially taking into account the current gradient and previous preconditioning matrix. The preconditioner is in a diagonal form and can handle scale differences in the target density. However, the algorithm in PSGLD introduces a permanent bias on the MSE due to ignoring a correction term in the updating equation.

On the other hand, DNN models are usually over parameterized and require extensive storage capacity as well as a lot of computational power. The over specified models may also lead to bad generalization and large prediction variance. Enforcing sparsity in the network is necessary. In

sgld-sa , the authors propose an adaptive empirical Bayesian method for sparse learning. The idea is to incorporate an adaptive empirical Bayesian model selection techniques with SG-MCMC sampling algorithm (SGLD-SA). In SGLD-SA algorithmsgld-sa

, one adopts a spike-and-slab prior and obtains a Bayesian mixture DNN model. The model parameters are sampled from the adaptive hierarchical posterior using SG-MCMC, and the hyperparameters in the priors are optimized via stochastic approximation adaptively. The algorithm automatically searches and penalizes the low probability parameters, and identifies promising sparse high posterior probability models

EMVS . One can also apply a pruning strategy to cut off model parameters with small magnitudes to further enforce sparsity in the network molchanov2016pruning ; lin2017runtime . The performance of the sparse approach is demonstrated with numerous examples, and the method is also shown to be robust in adversarial attacks. Theoretically, the authors show that the proposed algorithm can asymptotically converge to the correct distribution.

In support of the advantages and considering the issues of the above-mentioned methods, we incorporate the preconditioned SGLD methods with sparse learning. We will apply the proposed method to learn solutions of partial differential equations with heterogeneous coefficients. Numerous approaches have been proposed to numerically solve ODEs and PDEs with deep neural networks, for example, parametric PDE Ying_paraPDE , ODE systems driven by data NeuralODE ; QinXiu2018dataODE , time-dependent multiscale problems wang2018NLMCDNN ; wang2020reduced and physical informed DNN (PINN1 ; PINN2 ; zhang2019quantifying ; PCDL_nz ). Moreover, various types of network architectures are constructed to achieve efficient learning based on existing fast numerical solvers. These approaches include designing multigrid neural networks Fan2018MNNH ; he2019mgnet , constructing multiscale models wang2018Gmsfem ; wang2018NLMCDNN ; wang_multiphase , learning surrogate reduced-order models by deep convolution networks deepconv_Zabaras ; E_deepRitz ; cheung2020deep and so on.

This work attempts to design an efficient sparse deep learning algorithm, and apply it to learn the solutions elliptic PDE with heterogeneous coefficients. Numerical simulations for these problems are challenging since it naturally contains heterogeneities from various scales as well as uncertainties. Based on model reduction idea, for example, generalized multiscale finite element methods (GMsFEM) GMsFEM13 ; MixedGMsFEM ; OnlineStokes , the authors wang_multiphase

design appropriate sparse DNN structure to learn the map from the heterogeneous permeability to velocity fields in Darcy’s flow. The idea is to apply locally connected/convolutional layers which can be an analogy to the upscaling and downscaling procedures in multiscale methods. However, the network is still over parameterized. In particular, the last decoding step joins neurons representing features on the coarser level to the neurons representing the fine-scale solutions and is realized by a fully connected layer. Due to the large degrees of freedom in the fine grid solution, the number of parameters in the network will be very large and result in inefficient training. Our main contribution is to bring together preconditioned SGLD and stochastic approximation to achieve efficient and sparse learning. We propose an adaptive empirical Bayesian algorithm, where the neural network parameters are sampled from a Bayesian mixture model using PSGLD method, and the latent variables are smoothly optimized during stochastic approximation. PSGLD incorporates local curvature information in parameter updating scheme, thus it is suitable to deal with our problem which possesses multiscale nature. More importantly, we will sequentially update the preconditioning matrix under the framework of stochastic approximation, such that the bias introduced by ignoring the correction term in the sampling approaches to zero asymptotically. We theoretically show the convergence of the proposed algorithm and demonstrate its performance in several numerical experiments.

The paper is organized as follows. In Section 2, we review some basic ideas in SGLD, SGRLD. In Section 3, the sparse adaptive empirical Bayesian approach is reviewed. Our main algorithm which combines preconditioned SGLD with sparse learning is explored in Section 4. Its convergence is discussed in Section 5. Applying the proposed method to a large-p-small-n regression problem, and to learn solutions of elliptic problems with heterogeneous coefficients, its performances are presented in Section 6. A conclusion is made in the last Section 7.

## 2 Stochastic gradient Langevin dynamics (SGLD) and stochastic gradient Riemann Langevin dynamics(SGRLD)

Throughout the paper, we denote by the model parameters with as a prior distribution, and the entire dataset, where is an input-output pair for the model. Let be the likelihood, the posterior is then . SGLD combines the idea from stochastic gradient algorithms and posterior Bayesian sampling using Langevin dynamics. The loss gradient is approximated efficiently use mini-batches of data in SGLD, and the uncertainties in the model parameter can be captured through Bayesian learning to avoid overfitting. The model parameters update as follows:

 βk+1=βk+ϵk∇β~L(βk)+N(0,2ϵkτ−1)

where for a subset of data points

 ∇β~L(β)=∇βlogp(β)+Nnn∑i=1∇βlogp(dki|β)

is the stochastic gradient computed using a mini-batch, which is used to approximate the true gradient .

However, if the components of the model parameter

possess different scales, the invariant probability distribution for the Langevin equation is not isotropic, using standard Euclidian distance may lead to slow mixing. Stochastic Gradient Riemann Langevin Dynamics (SGRLD)

SGRLD is a generalization of SGLD on a Riemannian manifold. In this case, consider the probability models on a Riemann manifold with some metric tensor , the parameter updates can be guided using the geometric information of this manifold as follows:

 βk+1=βk+ϵk[G(βk)∇β~L(βk)+Γ(βk)]+N(0,2ϵkτ−1G(βk)) (1)

where is an additional drift term and . The he expected Fisher information can be used as a natural metric tensor, however it is intractable in many cases. One can choose a more practical metric tensor and use it as a preconditioning matrix.

## 3 SGLD with stochastic approximation (SGLD-SA)

In order to achieve sparse learning in DNN, in sgld-sa , the authors propose an adaptive empirical Bayesian method. It assumes that the weight parameters , the -th neuron in the -th layer, follows spike-and-slab Gaussian Laplace prior

 π(βlj|σ2,γlj)=(1−γlj)Lp(0,σv0)+γljN(0,σ2v1)

where

are the latent binary variable selection indicators,

is the Laplace distribution, and

is the Normal distribution. The error variance

follows an inverse gamma prior . The prior for

follows a Bernoulli distribution,

, which incorporate uncertainty regarding which variables need be included in the model. Here, , and follows where , are some positive constants..

Let be the -th mini-batch of the dataset. The likelihood for a regression problem can be rewritten as

 π(dm|β,σ2)=1(2πσ2)n/2exp{−∑xmi∈dm(xmi−F(xmi;β))22σ2}

where denotes a map describing the input-output relationship from to .

Then, the posterior follows

 π(β,σ2,δ,γ|dm)∝π(dm|β,σ2)Nnπ(β|σ2,γ)π(σ2|γ)π(γ|δ)π(δ) (2)

Now treat as “missing data”. At iteration , instead of sampling from true posterior with respect to the whole dataset , one needs to sample from with respect to a mini-batch

 Q(β,σ,δ|βk,σk,δk)=EB[Eγ|D[logπ(β,σ,δ,γ)|B]]

and it can be separated as

 Q(β,σ,δ|βk,σk,δk)=Q1(β,σ|βk,σk,δk)+Q2(δ|βk,σk,δk)+C

where

 Q1(β,σ|βk,σk,δk) =Nnlogπ(dm|β)−∑l∈LD∑j∈plβ2lj2σ20−p+ν+22log(σ2)− ∑l∈LS∑j∈pl{|βlj|σE[1v0(1−γlj)]+β2lj2σ2E[1v1γlj]}−νλ2σ2
 Q2(δ|βk,σk,δk)=∑l∈LS∑j∈pllog(δl1−δl)E[γlj]+(a−1)log(δl)+(pl+b−1)log(1−δl)

where denotes sparse layers, and denotes non-sparse layers.

The adaptive empirical Bayesian algorithm samples from and iteratively optimize with respect to via stochastic approximation as in Algorithm 1.

We note that the update formula of latent variables are motivated by EM approach to Bayesian variable selection (EMVS) rovckova2014emvs . In Algorithm 1, , is the step size in updating latent variables, and , is the positive root to the following quadratic formula:

 {N+∑l∈Lspl+ν}σ2+{||∑l∈Lsκlk,0∘βlk+1||1}σ +{Nn∑xmi∈dm(ymi−F(xmi;β))2+||∑l∈Lsκlk,1∘βlk+1||22+νλ}=0

where denotes the point-wise product, and

are the vector

and norm correspondingly.

## 4 Preconditioned SGLD with stochastic approximation (PSGLD-SA)

As seen in Section 2, all model parameters are updated using the same learning rate

, this may cause slow mixing if the loss function has very different scales in different directions, and a small enough learning rate is required to avoid divergence in the largest positive curvature direction.

Here, we will introduce a preconditioning matrix to guide the updating directions during sampling. In gradient descent algorithms, the optimization can be improved using the second order information, i.e. the inverse of the Hessian matrix, as the preconditioning matrix. However, it is too computationally expensive to store and invert the full Hessian during the training. An efficient approximation is to use the same preconditioner as in RMSprop tieleman2012lecture . The idea is to scale the gradient using a moving average of its recent norm in each iteration, so that one can adapt the step size separately for each weight. By keep a moving average for each weight parameter from the previous step, one can control the changes among adjacent mini batches. We propose a sequentially updated preconditioner using the stochastic approximation idea as follows

 G(βk) =diag−1(η+√V(βk)) (3) V(βk) =αkV(βk−1)+(1−αk)g(βk)∘g(βk) (4)

where is a regularization constant, and , . Importantly, we note that the weight parameter is a sequence approaching as the time step increases, which is different from the constant in PSGLD . The change in the parameters will then be

 △βk=ϵk(G(βk)g(βk)+Γ(βk))+N(0,2ϵkτ−1G12(βk)) (5)

where .

We note that in PSGLD , is ignored in practice, and is a constant. This produces a permanent bias on the MSE. To address this issue, we let gradually approach during the adaptive optimization of the latent variables, then the bias mentioned before will decrease. To be specific, we have

 ∣∣ ∣∣K∑k=1Γi(βk)∣∣ ∣∣ =∣∣ ∣∣T∑t=1(1−αk)V−32(βk)g(βk)∂g(βk)∂β∣∣ ∣∣ =∣∣ ∣∣K∑k=1(1−αk)g(βk)[αk−1V(βk−1)+(1−αk−1)g(βk−1)2]−32∂g(βk)∂β∣∣ ∣∣ ≲∣∣ ∣ ∣∣K∑k=1(1−αk)g(βk)α32k−1V(βk−1)32∂g(βk)∂β∣∣ ∣ ∣∣ ≲∣∣ ∣ ∣∣K∑k=1(1−αk)g(βk)α321V32(βk−1)∂g(βk)∂β∣∣ ∣ ∣∣

Then we have

 ∣∣ ∣∣K∑k=1Γi(βk)∣∣ ∣∣≲M∣∣ ∣ ∣∣K∑k=1(1−αk)α321∣∣ ∣ ∣∣ (6)

due to the assumption that the derivative of the gradients are bounded, for some constant .

Typically, let be in the form of for some , and constants , we can see that the bias introduced on the MSE will approach as .

Thus, our proposed adaptive preconditioned SGLD samples and optimizes as in Algorithm2.

## 5 Convergence results

Now, we will discuss the weak convergence of our proposed algorithm PSGLD-SA. First, we will take a look at the hyperparameters. Denote by all the hyperparameters . The stochastic approximation attempts to get the optimal based on the asymptotically target distribution . Define , where represents a function to obtain optimal given current model parameters . Denote by its mean field function . SA aims to solve the fixed point equation , which is to find the root of the equation . As described in Algorithm 2, in each iteration, we first sample using precontioned SGLD based on , then update the latent variables using

 θk+1=θk+ωk+1H(θk,βk+1),

where the map is motivated by EMVS. However, we only use a small set of data of samples instead of the full set in the computation of obtaining optimal latent variables. This will result a bias at each step. That is, we actually use with

 ~H(β,θ)=H(β,θ)+Δ(n,θi,βi+1), (7)

and we assume for some constant .

Following a similar proof in sgld-sa , under suitable assumptions, the adaptive empirical Bayesian method for sparse approximation algorithm has the following convergence results. The details of the proof are in Appendix A.

###### Theorem 1.

For a sufficiently large , there exists a constant such that

 E[||θk−θ∗||2]=O(λωk+supi≥k0E||Δ(n,θi,βi+1)||).

Next, we present a weak convergence result of the model parameters.

###### Corollary 1.

Under Assumptions 2 in sg-mcmc-convergence , the bias and MSE of PSGLD-SA for steps with decreasing step size is bounded, the distribution of converges weakly to the target posterior with a controllable bias, as and .

###### Proof.

With geometric information for probability models, the Langevin diffusion on the manifold is described by

 dβ(t)=G(β(t))∇βL(β(t),θ∗)+Γ(β(t))+G12(β(t))dBt (8)

where is the Brownian motion.

Denote by the generator for (8), then

 L=[G(βk)∇βL(βk,θ∗)+Γ(βk)]⋅∇β+2G12(β)G12(βk)T:∇βk∇Tβ (9)

The generator is associated with the backward Kolmogorov equation

 E[ϕ(βk)]=etLϕ(β0)

In PSGLD-SA, one will sample from the adaptive hierarchical posterior using (3) (5), and gradually optimize the latent variables through stochastic approximation.

Write the local generator of our proposed algorithm as

 ~Lk=[G(βk)~gk]⋅∇β+2G12(β)G12(βk)T:∇βk∇Tβ (10)

where , and

 ΔVk=[G(βk)(∇βL(βk,θ∗)−~gk)+Γ(βk)]⋅∇β.

Thus

 ~gk=∇βL(βk)+ξk+O(k−γ+supi≥k0E||Δ(n,θi,βi+1)||)

where is a random vector denoting the difference between the true gradient and stochastic gradient, and is the bias term generated by SA.

Given a test function of interest, let be the posterior average of under the invariant measure of the SDE (8). Let be the numerical samples, and define , where . Let be a functional which solves the Poisson equation

 Lψ(βk)=ϕ(βk)−¯ϕ.

Following a similar proof as in sg-mcmc-convergence , one can obtain the following results. The bias of PSGLD-SA is

 |E^ϕ−¯ϕ|≤1SK|Eψ(βK)−ψ(β0)|+K∑k=1ϵkSKE||ΔVkψ(βk−1)||+CK∑k=1ϵ2k

Formally, we note that in the above bound for the bias, the term is important. It is related to the bias introduced by stochastic approximation and ignoring . By Assumptions 2 in sg-mcmc-convergence on the smootheness and boundedness on the functional , and the boundedness of the preconditioner, it is easy to see that the bias introduced by stochastic approximation can be decomposed into (1) the the term in the bias, which approaches 0 as , and (2) which is a controllable bias. The bias introduced by ignoring can be bounded by according (6), which goes to 0 as .

The MSE of PSGLD-SA can be bounded by

 E(^ϕ−¯ϕ)2≤C(K∑k=11S2K+K∑k=1ϵ2kS2KE||ΔVkψ(βk−1)||2+(∑Kk=1ϵk)2S2K)

which converges as long as is bounded.

Thus we conclude that, as and , the distribution of converges weakly to the target posterior with a controllable bias. The bias is expected to decrease if we enlarge the mini-batch size to approximate the gradient.

## 6 Numerical examples

### 6.1 Small n large p problem

We first test on a linear regression problem, where the model parameters

, and predictors . We take a dataset with observations and predictors. , for .

For the first test (section 6.1 test 1), we use with to simulate predictor values . The responses , and . The hyperparameters used for SGLD-SA are: , , , , , , . The hyperparameters used for PSGLD-SA are: , , , , , , , . The learning rate is , and the step size to update latent variables is . The performance of SGLD-SA, PSGLD and PSGLD-SA are compared and presented in Figure 1. It shows that both SGLD-SA and PSGLD-SA work similarly well for this setting. The variance in model parameters and are similar, and they can be quantified correctly in both settings. However, without stochastic approximation, vanilla PSGLD cannot capture the uncertainties propoerly. However, Figure 1 (c)-(d) show that preconditioned methods converge a little bit faster according to the testing curves.

In the second test (section 6.1 test 2), we change the predictors described in test 1 a little bit. That is, we multiply on the first column of the predictor (), to create different scales in the predictor. The response values

and true regression coefficients are set to be similar to before. In this case, the uncertainties in the posterior estimation

and will have different scales. We also compare the performance of SGLD-SA PSGLD, and PSGLD-SA and present them in Figure 2. In this example, we see that PSGLD-SA outperforms PSGLD and SGLD-SA the standard approach obviously.

### 6.2 Elliptic problem with heterogeneous coefficients

Next, we apply the proposed approaches to solve the elliptic problem with heterogeneous coefficients. The mixed formulation of the elliptic problem reads:

 κ−1u+∇p =0inΩ div(u) =finΩ u⋅n =uNonΓN p =pDonΓD

where represents the heterogeneous permeability field which can be generated using Karhunen-Loeve expansion. is a constant source term, is a squared computational domain , and . The boundary conditions are at and , at , and at .

Specifically, the permeability fields can be constructed as follows:

 κH(x;μ)=κ0+p∑j=1μj√ξjΦj(x)

where is a constant permeability, denotes the mean of the random field. corresponds to a random contribution obtained from Karhunen-Loeve expansion, and describes the uncertainty in the permeability field. are random numbers drawn from i.i.d . are the eigen-pairs obtained from a Gaussian covariance kernel:

 Cov(xi,yi;xj,yj)=σexp(|xi−xj|2l2x−|yi−yj|2l2y)

where we choose , and in our example.

In the discretized system, we use element for the velocity space , and piecewise constant element for the pressure solution space .

 a(u,v)+b(v,p) =∫ΓΩpDv⋅nds for all v∈Vh b(u,q) =−(f,q) for all q∈Qh

where , and .

The discrete system has the following matrix formulation

 [Ah(κ)BThBh0][uhph]=[GD−F] (11)

However, due to the multiscale nature of , a sufficiently fine mesh is required to resolve all scale properties. Thus the fine matrix has a large size, leading to some difficulties in solving the linear system. To overcome these, one can develop a reduced order model as a surrogate. Numerous mixed multiscale methods have been explored MixedGMsFEM ; Arbogast_mixed_MS_11 ; ae07 . For example, in MixedGMsFEM , one aims to construct velocity multiscale basis in each local coarse region, and use the piecewise constant on coarse grid to approximate the pressure. Typically, let be the dimension of the multiscale velocity space, and denote by the matrix assembled using multiscale velocity basis in every row, then maps from to , where is the fine degrees of freedom for the velocity. Similarly, denote by the matrix containing coarse grid piecewise constant basis for pressure which maps from to . Then one can rewrite the system 11 in the following form

 [AHBTHBH0][uHpH]=[Ru00Rp][Ah(κ)BThBh0][RTu00RTp][uHpH]=[0−FH] (12)

where can be viewed as an encoder which maps from fine grid to coarse grid (upscaling), and acts as an decoder which maps from coarse grid to fine grid (downscaling).

The coarse grid solver reveals its efficiency when we need to solve flow problem with varying source or boundary conditions, while with a fixed permeability field. However, in a lot of applications, it is more interesting to solve for the velocity given different . When the permeability fields vary, one needs to reconstruct the multiscale basis (reconstruct the matrix ) in the above mentioned multiscale method framework, which is not practical.

As discussed in wang_multiphase , we will construct an encoding-decoding type of network to approximate the relationship between the permeability fields and fine grid velocity solution . That is, . The proposed network structure is in analogy to the coarse-scale solver but will take permeability fields as input without constructing a set of the multiscale bases for each case.

The idea is to first apply a few convolution layers to extract features from the input permeability with size , and then project the extracted features on a coarser mesh by employing an average pooling layer. The intermediate output is then flattened and is linked to neurons with a fully connected layer. This procedure is in analogy to upscaling. We will then reshape the hidden coarse grid features to an image with size . A few locally connected layers or convolution layers are followed to mimic the coarse grid solver.

After that, the resulting hidden features are flattened again and are fully connected with neurons in the next layer, where is the degrees of freedom for multiscale velocity space. It is natural to represent the coarse grid velocity using a vector since the degrees of freedom are not located at coarse grid centers, which makes it not obvious to reshape it as a square image. Finally, we decode the coarse level features using a densely connected layer, and we obtain fine grid velocity output with dimension . The network architecture is illustrated in Figure 3.

However, the last downscaling layer is still fully connected. Due to the large degrees of freedom for the velocity solution, the last fully connected layer contributes very large numbers of trainable weights. Here, we would like to use our proposed sparse learning method to tackle this difficulty.

The training and testing data can be generated by solving the equations with a mixed finite element method on the fine grid for various permeability fields. An illustrations of the permeability fields for and corresponding their corresponding solutions are presented in 4. We can see that when becomes larger, the velocity solutions exhibit many more scale features.

We generate samples pairs , and randomly pick samples to train the network, and take the rest of the samples for validation. The size of an input permeability is , an output velocity solution vector is . The network first uses 2 convolution layers with window size , and and channels respectively. Then, an average pooling layer with pool size is followed by a flatten layer and then a fully connected layer with nodes. This part of the network can be viewed as an encoder. Then, a reshaping layer, another two convolution layers, a flatten layer, and a dense layer with neurons are used to mimic the coarse grid solver. In the end, a dense layer is used as a decoder. The total number of parameters is in the entire network, and the layers we choose to perform sparse learning contains parameters.

The numerical results using SGLD, PSGLD, SGLD-SA and PSGLD-SA are presented in Table 1. Denote by and where . The mean relative errors and among 300 testing samples are shown. We can see that, the results using PSGLD-SA outperforms SGLD-SA in all three cases when . Some predicted results for different values of in KLE are presented in Figures 6, 7 and 8. We can actually see that, the predictions using SGLD-SA is very poor, but the results using PSGLD-SA are very similar to the ground truth.

In this example, we choose the sparse rate to be and . By choosing appropriate hyperparameters, we can achieve similar accuracy for the dense cases and sparse cases as shown in Table 1. This indicates enforcing sparsity using our method can maintain accuracy while reducing storage/computational cost. But if the sparse rate is too large, we find it’s hard to get comparable results since over sparse network may not be sufficient to represent the properties of the target map of interest. On the other hand, compare PSGLD and SGLD, we notice that applying preconditioners can provide better results. The learning curves are presented in 5. It shows that PSGLD-SA converges faster than SGLD-SA or vanilla PSGLD in all three cases.

### 6.3 Elliptic problem with channelized media

Last, we employ the proposed algorithm to predict the solution of an elliptic problem with channelized media. The problem setup is the same as in section 6.2. However, the background permeability fields are images of channelized media. The image size for our problem is , which are patches cropped from the channelized media in laloy2018training . An illustration of the permeability data and corresponding solutions are presented in Figure 9. We generate data pairs and randomly split them into and for training and testing purposes respectively.

We set the sparse rate to be , , and . The performance of our proposed method PSGLD-SA compared with vanilla SGLD, vanilla PSGLD, and SGLD-SA is shown in Table 2. We see that PSGLD-SA outperforms SGLD-SA in all three sparse cases. PSGLD-SA also results in more accurate results compared with vanilla PSGLD. The test learning curves are presented in Figure 10. It shows that preconditioning helps to improve convergence speed. With stochastic approximation, PSGLD-SA provides better results compared with PSGLD. Some samples are presented in Figure 11. It is clear that predicted velocity solutions using PSGLD-SA captures the heterogeneities in the underlying problem, and look very close to the true solutions. However, the predicted solutions obtained from SGLD-SA are not reliable.

## 7 Conclusion

We proposed a Bayesian sparse learning algorithm, where the model parameters are adaptively trained from a Bayesian mixture deep neural network, and the latent variables are smoothly learned through optimization. The Bayesian hierarchical model adopts SSGL priors, and samples are generated from the posterior using preconditioned Stochastic gradient descent Markov Chain Monte Carlo (PSGLD). PSGLD incorporates local curvature information in parameter updating scheme, such that constant step size is adequate and slow mixing can be avoided. Due to the diagonal form of the preconditioning matrix, PSGLD needs less computational and storage cost compared to SGRLD. Moreover, we apply stochastic approximation techniques in the sequentially updated preconditioning matrix, the bias on the MSE introduced due to ignoring a correction term will approach zero. The convergence of the proposed algorithm is discussed. Numerical simulations are performed to learn the solutions of elliptic PDE with heterogeneous coefficients. Sparse learning with preconditioned SGLD sampling algorithm is shown to be helpful to accelerate the learning process and the trained sparse models which can be used as computational efficient surrogates for solving the underlying PDE. The algorithm can also be extended to solve other heterogeneous problems, and applied to the multi-fidelity framework. Moreover, we may construct appropriate network structure and enforce sparsity according to physical information, such that we can interpret the sparse network obtained physically.

## Acknowledgement

We gratefully acknowledge the support from the National Science Foundation (DMS-1555072, DMS-1736364, CMMI-1634832, and CMMI-1560834), and Brookhaven National Laboratory Subcontract 382247. The authors would also like to acknowledge the support from NVIDIA Corporation for the donation of the Titan Xp GPU used for this research.

## References

• (1) J. Aarnes and Y. Efendiev, Mixed multiscale finite element for stochastic porous media flows, SIAM J. Sci. Comput., 30 (5) (2008), pp. 2319–2339.
• (2) T. Arbogast, Homogenization-based mixed multiscale finite elements for problems with anisotropy, Multiscale Model. Simul., 9 (2011), pp. 624–653.
• (3) A. Bordes, L. Bottou, and P. Gallinari, Sgd-qn: Careful quasi-newton stochastic gradient descent

, Journal of Machine Learning Research, 10 (2009), pp. 1737–1754.

• (4) R. H. Byrd, S. L. Hansen, J. Nocedal, and Y. Singer, A stochastic quasi-newton method for large-scale optimization, SIAM Journal on Optimization, 26 (2016), pp. 1008–1031.
• (5) C. Chen, N. Ding, and L. Carin, On the convergence of stochastic gradient mcmc algorithms with high-order integrators., In Advances in Neural Information Processing Systems, (2015), pp. 2278–2286.
• (6) R. Chen, Y. Rubanova, J. Bettencourt, and D. Duvenaud, Neural ordinary differential equations, arXiv preprint arXiv:1806.07366, (2018).
• (7) T. Chen, E. Fox, and C. Guestrin, Stochastic gradient hamiltonian monte carlo, in International conference on machine learning, 2014, pp. 1683–1691.
• (8) S. W. Cheung, E. T. Chung, Y. Efendiev, E. Gildin, Y. Wang, and J. Zhang, Deep global model reduction learning in porous media flow simulation, Computational Geosciences, 24 (2020), pp. 261–274.
• (9) E. Chung, Y. Efendiev, and C. Lee, Mixed generalized multiscale finite element methods and applications, SIAM Multicale Model. Simul., 13 (2014), pp. 338–366.
• (10) E. Chung, Y. Efendiev, W. T. Leung, M. Vasilyeva, and Y. Wang, Online adaptive local multiscale model reduction for heterogeneous problems in perforated domains, Applicable Analysis, 96 (2017), pp. 2002–2031.
• (11) Y. Dauphin, H. De Vries, and Y. Bengio, Equilibrated adaptive learning rates for non-convex optimization, in Advances in neural information processing systems, 2015, pp. 1504–1512.
• (12) Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio, Identifying and attacking the saddle point problem in high-dimensional non-convex optimization, in Advances in neural information processing systems, 2014, pp. 2933–2941.
• (13) W. Deng, X. Zhang, F. Liang, and G. Lin, An adaptive empirical bayesian method for sparse deep learning., In Advances in Neural Information Processing Systems, (2019), pp. 5564–5574.
• (14) Y. Efendiev, J. Galvis, and T. Hou, Generalized multiscale finite element methods (gmsfem), Journal of Computational Physics, 251 (2013), pp. 116–135.
• (15) Y. Fan, L. Lin, L. Ying, and L. Zepeda-Núnez, A multiscale neural network based on hierarchical matrices, arXiv preprint arXiv:1807.01883, (2018).
• (16) M. Girolami and B. Calderhead, Riemann manifold langevin and hamiltonian monte carlo methods., Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73 (2011), pp. 123–214.
• (17) J. He and J. Xu,

Mgnet: A unified framework of multigrid and convolutional neural network

, Science china mathematics, 62 (2019), pp. 1331–1354.
• (18) Y. Khoo, J. Lu, and L. Ying, Solving parametric pde problems with artificial neural networks, arXiv:1707.03351, (2017).
• (19) E. Laloy, R. Hérault, D. Jacques, and N. Linde, Training-image based geostatistical inversion using a spatial generative adversarial neural network, Water Resources Research, 54 (2018), pp. 381–406.
• (20) C. Li, C. Chen, D. Carlson, and L. Carin, Preconditioned stochastic gradient langevin dynamics for deep neural networks.

, In Thirtieth AAAI Conference on Artificial Intelligence, (2016).

• (21) J. Lin, Y. Rao, J. Lu, and J. Zhou, Runtime neural pruning, in Advances in Neural Information Processing Systems, 2017, pp. 2181–2191.
• (22) Y.-A. Ma, T. Chen, and E. Fox, A complete recipe for stochastic gradient mcmc, in Advances in Neural Information Processing Systems, 2015, pp. 2917–2925.
• (23) P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz, Pruning convolutional neural networks for resource efficient inference, arXiv preprint arXiv:1611.06440, (2016).
• (24) S. Patterson and Y. W. Teh., Stochastic gradient riemannian langevin dynamics on the probability simplex., In Advances in neural information processing systems, (2013), pp. 3102–3110.
• (25) T. Qin, K. Wu, and D. Xiu, Data driven governing equations approximation using deep neural networks, arXiv preprint arXiv:1811.05537, (2018).
• (26) M. Raissi, P. Perdikaris, and G. E. Karniadakis, Physics informed deep learning (part i): Data-driven solutions of nonlinear partial differential equations, arXiv preprint arXiv:1711.10561, (2017).
• (27) M. Raissi, P. Perdikaris, and G. E. Karniadakis, Physics informed deep learning (part ii): Data-driven discovery of nonlinear partial differential equations, arXiv preprint arXiv:1711.10566, (2017).
• (28) V. Ročková and E. I. George, Emvs: The em approach to bayesian variable selection, Journal of the American Statistical Association, 109 (2014), pp. 828–846.
• (29) V. Ročková and E. I. George, Emvs: The em approach to bayesian variable selection., Journal of the American Statistical Association, 109 (2014), pp. 828–846.
• (30) U. Simsekli, R. Badeau, T. Cemgil, and G. Richard, Stochastic quasi-newton langevin monte carlo, 2016.
• (31) T. Tieleman and G. Hinton, Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude, COURSERA: Neural networks for machine learning, 4 (2012), pp. 26–31.
• (32) M. Wang, S. W. Cheung, E. T. Chung, Y. Efendiev, W. T. Leung, and Y. Wang, Prediction of discretization of gmsfem using deep learning, arXiv preprint arXiv:1810.12245, (2018).
• (33) M. Wang, S. W. Cheung, W. T. Leung, E. T. Chung, Y. Efendiev, and M. Wheeler, Reduced-order deep learning for flow dynamics. the interplay between deep learning and model reduction, Journal of Computational Physics, 401 (2020), p. 108939.
• (34) Y. Wang, S. W. Cheung, E. T. Chung, Y. Efendiev, and M. Wang, Deep multiscale model learning, Journal of Computational Physics, 406 (2020), p. 109071.
• (35) Y. Wang and G. Lin, Efficient deep learning techniques for multiphase flow simulation in heterogeneous porousc media., Journal of Computational Physics, 401 (2020), p. 108968.
• (36) E. Weinan and B. Yu, The deep ritz method: A deep learning-based numerical algorithm for solving variational problems, Communications in Mathematics and Statistics, 6 (2018), pp. 1–12.
• (37) M. Welling and Y. W. Teh, Bayesian learning via stochastic gradient langevin dynamicsn, In Proceedings of the 28th international conference on machine learning (ICML-11), (2011), pp. 681–688.
• (38) D. Zhang, L. Lu, L. Guo, and G. E. Karniadakis, Quantifying total uncertainty in physics-informed neural networks for solving forward and inverse stochastic problems, Journal of Computational Physics, 397 (2019), p. 108850.
• (39) Y. Zhang and C. A. Sutton, Quasi-newton methods for markov chain monte carlo, in Advances in Neural Information Processing Systems, 2011, pp. 2393–2401.
• (40) Y. Zhu and N. Zabaras, Bayesian deep convolutional encoder–decoder networks for surrogate modeling and uncertainty quantification, Journal of Computational Physics, 366 (2018), pp. 415–447.
• (41) Y. Zhu, N. Zabaras, P.-S. Koutsourelakis, and P. Perdikaris, Physics-constrained deep learning for high-dimensional surrogate modeling and uncertainty quantification without labeled data, Journal of Computational Physics, 394 (2019), pp. 56–81.