1 Introduction
Deep neural networks have attracted extensive attention in recent times. Due to their powerful potential in approximating highdimensional nonlinear maps, and universal approximation property to represent a rich class of functions, DNNs have been successfully employed in problems from various research areas. However, effectively training DNN is still challenging due to the difficulty of escaping local minima in nonconvex optimization and avoiding overfitting in overparameterized networks.
Bayesian learning is appealing because of its ability to capture uncertainties in the model parameter, and MCMC sampling helps to address the overfitting issue. There has been extensive work bringing the Bayesian methods to the context of DNN optimization. The stochastic gradient Langevin dynamics (SGLD) sgld is first proposed and becomes a popular approach in the family of stochastic gradient MCMC algorithmschen2014stochastic ; ma2015complete ; PSGLD . SGLD is the firstorder Euler discretization of Langevin diffusion with stationary distribution on Euclidian space. It can be viewed as adding some noise to a standard stochastic gradient optimization algorithm. Since it resembles SGD, SGLD inherits the advantage of SGD where the gradients are stochastically approximated using minibatches. This makes MCMC scalable and provides a seamless transition between stochastic optimization and posterior sampling. It was shown that samples from SGLD will converge to samples from the true posterior distribution with annealed step size sgmcmcconvergence ; sgld .
In DNN, the underlying models may have complicated geometric properties and possess nonisotropic target density functions dauphin2014identifying ; PSGLD ; chen2014stochastic . When the components of parameters have different curvature, generating samples using a universal step size for every model parameter may cause slow mixing and can be inefficient. In the optimization literature, there are many approaches to accelerate the gradient descent, such as preconditioning and Newton’s method dauphin2015equilibrated ; zhang2011quasi ; byrd2016stochastic ; bordes2009sgd
. However, naively borrowing this idea and using a preconditioning matrix in SGLD fails to produce a proper MCMC scheme, the Markov chain does not target the underlying posterior except for a few cases
PSGLD ; simsekli2016stochastic. Considering that a Langevin diffusion with invariant measure can be directly defined on a Riemannian manifold, and the expected Fisher information is one typical choice for the Riemannian metric tensor
Riemann_LD_HMC , SGRLD is proposed SGRLD. Builtup from Riemannian Langevin dynamics, SGRLD is a discretization of the Riemannian Langevin dynamics and the gradients are approximated stochastically. It incorporates local curvature information in parameter updating scheme, such that constant step size is adequate along with all directions. However, the full expected Fisher information is usually intractable. A more computationally efficient preconditioner is needed to approximate secondorder Hessian information. Preconditioned SGLD adopts the same preconditioner as introduced in RMSprop
tieleman2012lecture as discussed in PSGLD which reduces the computational and storage cost. One can update the preconditioner sequentially taking into account the current gradient and previous preconditioning matrix. The preconditioner is in a diagonal form and can handle scale differences in the target density. However, the algorithm in PSGLD introduces a permanent bias on the MSE due to ignoring a correction term in the updating equation.On the other hand, DNN models are usually over parameterized and require extensive storage capacity as well as a lot of computational power. The over specified models may also lead to bad generalization and large prediction variance. Enforcing sparsity in the network is necessary. In
sgldsa , the authors propose an adaptive empirical Bayesian method for sparse learning. The idea is to incorporate an adaptive empirical Bayesian model selection techniques with SGMCMC sampling algorithm (SGLDSA). In SGLDSA algorithmsgldsa, one adopts a spikeandslab prior and obtains a Bayesian mixture DNN model. The model parameters are sampled from the adaptive hierarchical posterior using SGMCMC, and the hyperparameters in the priors are optimized via stochastic approximation adaptively. The algorithm automatically searches and penalizes the low probability parameters, and identifies promising sparse high posterior probability models
EMVS . One can also apply a pruning strategy to cut off model parameters with small magnitudes to further enforce sparsity in the network molchanov2016pruning ; lin2017runtime . The performance of the sparse approach is demonstrated with numerous examples, and the method is also shown to be robust in adversarial attacks. Theoretically, the authors show that the proposed algorithm can asymptotically converge to the correct distribution.In support of the advantages and considering the issues of the abovementioned methods, we incorporate the preconditioned SGLD methods with sparse learning. We will apply the proposed method to learn solutions of partial differential equations with heterogeneous coefficients. Numerous approaches have been proposed to numerically solve ODEs and PDEs with deep neural networks, for example, parametric PDE Ying_paraPDE , ODE systems driven by data NeuralODE ; QinXiu2018dataODE , timedependent multiscale problems wang2018NLMCDNN ; wang2020reduced and physical informed DNN (PINN1 ; PINN2 ; zhang2019quantifying ; PCDL_nz ). Moreover, various types of network architectures are constructed to achieve efficient learning based on existing fast numerical solvers. These approaches include designing multigrid neural networks Fan2018MNNH ; he2019mgnet , constructing multiscale models wang2018Gmsfem ; wang2018NLMCDNN ; wang_multiphase , learning surrogate reducedorder models by deep convolution networks deepconv_Zabaras ; E_deepRitz ; cheung2020deep and so on.
This work attempts to design an efficient sparse deep learning algorithm, and apply it to learn the solutions elliptic PDE with heterogeneous coefficients. Numerical simulations for these problems are challenging since it naturally contains heterogeneities from various scales as well as uncertainties. Based on model reduction idea, for example, generalized multiscale finite element methods (GMsFEM) GMsFEM13 ; MixedGMsFEM ; OnlineStokes , the authors wang_multiphase
design appropriate sparse DNN structure to learn the map from the heterogeneous permeability to velocity fields in Darcy’s flow. The idea is to apply locally connected/convolutional layers which can be an analogy to the upscaling and downscaling procedures in multiscale methods. However, the network is still over parameterized. In particular, the last decoding step joins neurons representing features on the coarser level to the neurons representing the finescale solutions and is realized by a fully connected layer. Due to the large degrees of freedom in the fine grid solution, the number of parameters in the network will be very large and result in inefficient training. Our main contribution is to bring together preconditioned SGLD and stochastic approximation to achieve efficient and sparse learning. We propose an adaptive empirical Bayesian algorithm, where the neural network parameters are sampled from a Bayesian mixture model using PSGLD method, and the latent variables are smoothly optimized during stochastic approximation. PSGLD incorporates local curvature information in parameter updating scheme, thus it is suitable to deal with our problem which possesses multiscale nature. More importantly, we will sequentially update the preconditioning matrix under the framework of stochastic approximation, such that the bias introduced by ignoring the correction term in the sampling approaches to zero asymptotically. We theoretically show the convergence of the proposed algorithm and demonstrate its performance in several numerical experiments.
The paper is organized as follows. In Section 2, we review some basic ideas in SGLD, SGRLD. In Section 3, the sparse adaptive empirical Bayesian approach is reviewed. Our main algorithm which combines preconditioned SGLD with sparse learning is explored in Section 4. Its convergence is discussed in Section 5. Applying the proposed method to a largepsmalln regression problem, and to learn solutions of elliptic problems with heterogeneous coefficients, its performances are presented in Section 6. A conclusion is made in the last Section 7.
2 Stochastic gradient Langevin dynamics (SGLD) and stochastic gradient Riemann Langevin dynamics(SGRLD)
Throughout the paper, we denote by the model parameters with as a prior distribution, and the entire dataset, where is an inputoutput pair for the model. Let be the likelihood, the posterior is then . SGLD combines the idea from stochastic gradient algorithms and posterior Bayesian sampling using Langevin dynamics. The loss gradient is approximated efficiently use minibatches of data in SGLD, and the uncertainties in the model parameter can be captured through Bayesian learning to avoid overfitting. The model parameters update as follows:
where for a subset of data points
is the stochastic gradient computed using a minibatch, which is used to approximate the true gradient .
However, if the components of the model parameter
possess different scales, the invariant probability distribution for the Langevin equation is not isotropic, using standard Euclidian distance may lead to slow mixing. Stochastic Gradient Riemann Langevin Dynamics (SGRLD)
SGRLD is a generalization of SGLD on a Riemannian manifold. In this case, consider the probability models on a Riemann manifold with some metric tensor , the parameter updates can be guided using the geometric information of this manifold as follows:(1) 
where is an additional drift term and . The he expected Fisher information can be used as a natural metric tensor, however it is intractable in many cases. One can choose a more practical metric tensor and use it as a preconditioning matrix.
3 SGLD with stochastic approximation (SGLDSA)
In order to achieve sparse learning in DNN, in sgldsa , the authors propose an adaptive empirical Bayesian method. It assumes that the weight parameters , the th neuron in the th layer, follows spikeandslab Gaussian Laplace prior
where
are the latent binary variable selection indicators,
is the Laplace distribution, andis the Normal distribution. The error variance
follows an inverse gamma prior . The prior forfollows a Bernoulli distribution,
, which incorporate uncertainty regarding which variables need be included in the model. Here, , and follows where , are some positive constants..Let be the th minibatch of the dataset. The likelihood for a regression problem can be rewritten as
where denotes a map describing the inputoutput relationship from to .
Then, the posterior follows
(2) 
Now treat as “missing data”. At iteration , instead of sampling from true posterior with respect to the whole dataset , one needs to sample from with respect to a minibatch
and it can be separated as
where
where denotes sparse layers, and denotes nonsparse layers.
The adaptive empirical Bayesian algorithm samples from and iteratively optimize with respect to via stochastic approximation as in Algorithm 1.
We note that the update formula of latent variables are motivated by EM approach to Bayesian variable selection (EMVS) rovckova2014emvs . In Algorithm 1, , is the step size in updating latent variables, and , is the positive root to the following quadratic formula:
4 Preconditioned SGLD with stochastic approximation (PSGLDSA)
As seen in Section 2, all model parameters are updated using the same learning rate
, this may cause slow mixing if the loss function has very different scales in different directions, and a small enough learning rate is required to avoid divergence in the largest positive curvature direction.
Here, we will introduce a preconditioning matrix to guide the updating directions during sampling. In gradient descent algorithms, the optimization can be improved using the second order information, i.e. the inverse of the Hessian matrix, as the preconditioning matrix. However, it is too computationally expensive to store and invert the full Hessian during the training. An efficient approximation is to use the same preconditioner as in RMSprop tieleman2012lecture . The idea is to scale the gradient using a moving average of its recent norm in each iteration, so that one can adapt the step size separately for each weight. By keep a moving average for each weight parameter from the previous step, one can control the changes among adjacent mini batches. We propose a sequentially updated preconditioner using the stochastic approximation idea as follows
(3)  
(4) 
where is a regularization constant, and , . Importantly, we note that the weight parameter is a sequence approaching as the time step increases, which is different from the constant in PSGLD . The change in the parameters will then be
(5) 
where .
We note that in PSGLD , is ignored in practice, and is a constant. This produces a permanent bias on the MSE. To address this issue, we let gradually approach during the adaptive optimization of the latent variables, then the bias mentioned before will decrease. To be specific, we have
Then we have
(6) 
due to the assumption that the derivative of the gradients are bounded, for some constant .
Typically, let be in the form of for some , and constants , we can see that the bias introduced on the MSE will approach as .
Thus, our proposed adaptive preconditioned SGLD samples and optimizes as in Algorithm2.
5 Convergence results
Now, we will discuss the weak convergence of our proposed algorithm PSGLDSA. First, we will take a look at the hyperparameters. Denote by all the hyperparameters . The stochastic approximation attempts to get the optimal based on the asymptotically target distribution . Define , where represents a function to obtain optimal given current model parameters . Denote by its mean field function . SA aims to solve the fixed point equation , which is to find the root of the equation . As described in Algorithm 2, in each iteration, we first sample using precontioned SGLD based on , then update the latent variables using
where the map is motivated by EMVS. However, we only use a small set of data of samples instead of the full set in the computation of obtaining optimal latent variables. This will result a bias at each step. That is, we actually use with
(7) 
and we assume for some constant .
Following a similar proof in sgldsa , under suitable assumptions, the adaptive empirical Bayesian method for sparse approximation algorithm has the following convergence results. The details of the proof are in Appendix A.
Theorem 1.
For a sufficiently large , there exists a constant such that
Next, we present a weak convergence result of the model parameters.
Corollary 1.
Under Assumptions 2 in sgmcmcconvergence , the bias and MSE of PSGLDSA for steps with decreasing step size is bounded, the distribution of converges weakly to the target posterior with a controllable bias, as and .
Proof.
With geometric information for probability models, the Langevin diffusion on the manifold is described by
(8) 
where is the Brownian motion.
Denote by the generator for (8), then
(9) 
The generator is associated with the backward Kolmogorov equation
In PSGLDSA, one will sample from the adaptive hierarchical posterior using (3) (5), and gradually optimize the latent variables through stochastic approximation.
Write the local generator of our proposed algorithm as
(10) 
where , and
Thus
where is a random vector denoting the difference between the true gradient and stochastic gradient, and is the bias term generated by SA.
Given a test function of interest, let be the posterior average of under the invariant measure of the SDE (8). Let be the numerical samples, and define , where . Let be a functional which solves the Poisson equation
Following a similar proof as in sgmcmcconvergence , one can obtain the following results. The bias of PSGLDSA is
Formally, we note that in the above bound for the bias, the term is important. It is related to the bias introduced by stochastic approximation and ignoring . By Assumptions 2 in sgmcmcconvergence on the smootheness and boundedness on the functional , and the boundedness of the preconditioner, it is easy to see that the bias introduced by stochastic approximation can be decomposed into (1) the the term in the bias, which approaches 0 as , and (2) which is a controllable bias. The bias introduced by ignoring can be bounded by according (6), which goes to 0 as .
The MSE of PSGLDSA can be bounded by
which converges as long as is bounded.
Thus we conclude that, as and , the distribution of converges weakly to the target posterior with a controllable bias. The bias is expected to decrease if we enlarge the minibatch size to approximate the gradient.
∎
6 Numerical examples
6.1 Small large problem
We first test on a linear regression problem, where the model parameters
, and predictors . We take a dataset with observations and predictors. , for .For the first test (section 6.1 test 1), we use with to simulate predictor values . The responses , and . The hyperparameters used for SGLDSA are: , , , , , , . The hyperparameters used for PSGLDSA are: , , , , , , , . The learning rate is , and the step size to update latent variables is . The performance of SGLDSA, PSGLD and PSGLDSA are compared and presented in Figure 1. It shows that both SGLDSA and PSGLDSA work similarly well for this setting. The variance in model parameters and are similar, and they can be quantified correctly in both settings. However, without stochastic approximation, vanilla PSGLD cannot capture the uncertainties propoerly. However, Figure 1 (c)(d) show that preconditioned methods converge a little bit faster according to the testing curves.
In the second test (section 6.1 test 2), we change the predictors described in test 1 a little bit. That is, we multiply on the first column of the predictor (), to create different scales in the predictor. The response values
and true regression coefficients are set to be similar to before. In this case, the uncertainties in the posterior estimation
and will have different scales. We also compare the performance of SGLDSA PSGLD, and PSGLDSA and present them in Figure 2. In this example, we see that PSGLDSA outperforms PSGLD and SGLDSA the standard approach obviously.6.2 Elliptic problem with heterogeneous coefficients
Next, we apply the proposed approaches to solve the elliptic problem with heterogeneous coefficients. The mixed formulation of the elliptic problem reads:
where represents the heterogeneous permeability field which can be generated using KarhunenLoeve expansion. is a constant source term, is a squared computational domain , and . The boundary conditions are at and , at , and at .
Specifically, the permeability fields can be constructed as follows:
where is a constant permeability, denotes the mean of the random field. corresponds to a random contribution obtained from KarhunenLoeve expansion, and describes the uncertainty in the permeability field. are random numbers drawn from i.i.d . are the eigenpairs obtained from a Gaussian covariance kernel:
where we choose , and in our example.
In the discretized system, we use element for the velocity space , and piecewise constant element for the pressure solution space .
where , and .
The discrete system has the following matrix formulation
(11) 
However, due to the multiscale nature of , a sufficiently fine mesh is required to resolve all scale properties. Thus the fine matrix has a large size, leading to some difficulties in solving the linear system. To overcome these, one can develop a reduced order model as a surrogate. Numerous mixed multiscale methods have been explored MixedGMsFEM ; Arbogast_mixed_MS_11 ; ae07 . For example, in MixedGMsFEM , one aims to construct velocity multiscale basis in each local coarse region, and use the piecewise constant on coarse grid to approximate the pressure. Typically, let be the dimension of the multiscale velocity space, and denote by the matrix assembled using multiscale velocity basis in every row, then maps from to , where is the fine degrees of freedom for the velocity. Similarly, denote by the matrix containing coarse grid piecewise constant basis for pressure which maps from to . Then one can rewrite the system 11 in the following form
(12) 
where can be viewed as an encoder which maps from fine grid to coarse grid (upscaling), and acts as an decoder which maps from coarse grid to fine grid (downscaling).
The coarse grid solver reveals its efficiency when we need to solve flow problem with varying source or boundary conditions, while with a fixed permeability field. However, in a lot of applications, it is more interesting to solve for the velocity given different . When the permeability fields vary, one needs to reconstruct the multiscale basis (reconstruct the matrix ) in the above mentioned multiscale method framework, which is not practical.
As discussed in wang_multiphase , we will construct an encodingdecoding type of network to approximate the relationship between the permeability fields and fine grid velocity solution . That is, . The proposed network structure is in analogy to the coarsescale solver but will take permeability fields as input without constructing a set of the multiscale bases for each case.
The idea is to first apply a few convolution layers to extract features from the input permeability with size , and then project the extracted features on a coarser mesh by employing an average pooling layer. The intermediate output is then flattened and is linked to neurons with a fully connected layer. This procedure is in analogy to upscaling. We will then reshape the hidden coarse grid features to an image with size . A few locally connected layers or convolution layers are followed to mimic the coarse grid solver.
After that, the resulting hidden features are flattened again and are fully connected with neurons in the next layer, where is the degrees of freedom for multiscale velocity space. It is natural to represent the coarse grid velocity using a vector since the degrees of freedom are not located at coarse grid centers, which makes it not obvious to reshape it as a square image. Finally, we decode the coarse level features using a densely connected layer, and we obtain fine grid velocity output with dimension . The network architecture is illustrated in Figure 3.
However, the last downscaling layer is still fully connected. Due to the large degrees of freedom for the velocity solution, the last fully connected layer contributes very large numbers of trainable weights. Here, we would like to use our proposed sparse learning method to tackle this difficulty.
The training and testing data can be generated by solving the equations with a mixed finite element method on the fine grid for various permeability fields. An illustrations of the permeability fields for and corresponding their corresponding solutions are presented in 4. We can see that when becomes larger, the velocity solutions exhibit many more scale features.



We generate samples pairs , and randomly pick samples to train the network, and take the rest of the samples for validation. The size of an input permeability is , an output velocity solution vector is . The network first uses 2 convolution layers with window size , and and channels respectively. Then, an average pooling layer with pool size is followed by a flatten layer and then a fully connected layer with nodes. This part of the network can be viewed as an encoder. Then, a reshaping layer, another two convolution layers, a flatten layer, and a dense layer with neurons are used to mimic the coarse grid solver. In the end, a dense layer is used as a decoder. The total number of parameters is in the entire network, and the layers we choose to perform sparse learning contains parameters.
The numerical results using SGLD, PSGLD, SGLDSA and PSGLDSA are presented in Table 1. Denote by and where . The mean relative errors and among 300 testing samples are shown. We can see that, the results using PSGLDSA outperforms SGLDSA in all three cases when . Some predicted results for different values of in KLE are presented in Figures 6, 7 and 8. We can actually see that, the predictions using SGLDSA is very poor, but the results using PSGLDSA are very similar to the ground truth.
In this example, we choose the sparse rate to be and . By choosing appropriate hyperparameters, we can achieve similar accuracy for the dense cases and sparse cases as shown in Table 1. This indicates enforcing sparsity using our method can maintain accuracy while reducing storage/computational cost. But if the sparse rate is too large, we find it’s hard to get comparable results since over sparse network may not be sufficient to represent the properties of the target map of interest. On the other hand, compare PSGLD and SGLD, we notice that applying preconditioners can provide better results. The learning curves are presented in 5. It shows that PSGLDSA converges faster than SGLDSA or vanilla PSGLD in all three cases.
Dense  
PSGLD (/ % )  SGLD(/ % )  
KLE32  0.75/0.57  2.37 /2.17 
KLE64  0.82/0.63  2.38 /2.25 
KLE128  2.13 /1.93  2.90 /2.60 
PSGLDSA (/ % )  SGLDSA (/ % )  
Sparse rate 50%  
KLE32  0.59/0.56  2.67 /2.35 
KLE64  0.78 /0.58  2.68 /2.41 
KLE128  1.60 /1.31  3.47 /3.00 
Sparse rate 70%  
KLE32  0.58/ 0.51  2.28 /2.10 
KLE64  0.76 /0.61  2.40 /2.97 
KLE128  1.79 /1.60  3.51/3.02 
6.3 Elliptic problem with channelized media
Last, we employ the proposed algorithm to predict the solution of an elliptic problem with channelized media. The problem setup is the same as in section 6.2. However, the background permeability fields are images of channelized media. The image size for our problem is , which are patches cropped from the channelized media in laloy2018training . An illustration of the permeability data and corresponding solutions are presented in Figure 9. We generate data pairs and randomly split them into and for training and testing purposes respectively.


We set the sparse rate to be , , and . The performance of our proposed method PSGLDSA compared with vanilla SGLD, vanilla PSGLD, and SGLDSA is shown in Table 2. We see that PSGLDSA outperforms SGLDSA in all three sparse cases. PSGLDSA also results in more accurate results compared with vanilla PSGLD. The test learning curves are presented in Figure 10. It shows that preconditioning helps to improve convergence speed. With stochastic approximation, PSGLDSA provides better results compared with PSGLD. Some samples are presented in Figure 11. It is clear that predicted velocity solutions using PSGLDSA captures the heterogeneities in the underlying problem, and look very close to the true solutions. However, the predicted solutions obtained from SGLDSA are not reliable.
PSGLD (/ % )  SGLD (/ % )  

Dense  3.31/2.73  6.39/ 4.94 
PSGLDSA (/ % )  SGLDSA (/ % )  
Sparse rate 30%  2.73/2.13  3.94/ 3.12 
Sparse rate 50%  2.75/2.16  3.91 /3.05 
Sparse rate 70%  2.71/1.95  4.46/3.57 
7 Conclusion
We proposed a Bayesian sparse learning algorithm, where the model parameters are adaptively trained from a Bayesian mixture deep neural network, and the latent variables are smoothly learned through optimization. The Bayesian hierarchical model adopts SSGL priors, and samples are generated from the posterior using preconditioned Stochastic gradient descent Markov Chain Monte Carlo (PSGLD). PSGLD incorporates local curvature information in parameter updating scheme, such that constant step size is adequate and slow mixing can be avoided. Due to the diagonal form of the preconditioning matrix, PSGLD needs less computational and storage cost compared to SGRLD. Moreover, we apply stochastic approximation techniques in the sequentially updated preconditioning matrix, the bias on the MSE introduced due to ignoring a correction term will approach zero. The convergence of the proposed algorithm is discussed. Numerical simulations are performed to learn the solutions of elliptic PDE with heterogeneous coefficients. Sparse learning with preconditioned SGLD sampling algorithm is shown to be helpful to accelerate the learning process and the trained sparse models which can be used as computational efficient surrogates for solving the underlying PDE. The algorithm can also be extended to solve other heterogeneous problems, and applied to the multifidelity framework. Moreover, we may construct appropriate network structure and enforce sparsity according to physical information, such that we can interpret the sparse network obtained physically.
Acknowledgement
We gratefully acknowledge the support from the National Science Foundation (DMS1555072, DMS1736364, CMMI1634832, and CMMI1560834), and Brookhaven National Laboratory Subcontract 382247. The authors would also like to acknowledge the support from NVIDIA Corporation for the donation of the Titan Xp GPU used for this research.
References
 (1) J. Aarnes and Y. Efendiev, Mixed multiscale finite element for stochastic porous media flows, SIAM J. Sci. Comput., 30 (5) (2008), pp. 2319–2339.
 (2) T. Arbogast, Homogenizationbased mixed multiscale finite elements for problems with anisotropy, Multiscale Model. Simul., 9 (2011), pp. 624–653.

(3)
A. Bordes, L. Bottou, and P. Gallinari, Sgdqn: Careful quasinewton
stochastic gradient descent
, Journal of Machine Learning Research, 10 (2009), pp. 1737–1754.
 (4) R. H. Byrd, S. L. Hansen, J. Nocedal, and Y. Singer, A stochastic quasinewton method for largescale optimization, SIAM Journal on Optimization, 26 (2016), pp. 1008–1031.
 (5) C. Chen, N. Ding, and L. Carin, On the convergence of stochastic gradient mcmc algorithms with highorder integrators., In Advances in Neural Information Processing Systems, (2015), pp. 2278–2286.
 (6) R. Chen, Y. Rubanova, J. Bettencourt, and D. Duvenaud, Neural ordinary differential equations, arXiv preprint arXiv:1806.07366, (2018).
 (7) T. Chen, E. Fox, and C. Guestrin, Stochastic gradient hamiltonian monte carlo, in International conference on machine learning, 2014, pp. 1683–1691.
 (8) S. W. Cheung, E. T. Chung, Y. Efendiev, E. Gildin, Y. Wang, and J. Zhang, Deep global model reduction learning in porous media flow simulation, Computational Geosciences, 24 (2020), pp. 261–274.
 (9) E. Chung, Y. Efendiev, and C. Lee, Mixed generalized multiscale finite element methods and applications, SIAM Multicale Model. Simul., 13 (2014), pp. 338–366.
 (10) E. Chung, Y. Efendiev, W. T. Leung, M. Vasilyeva, and Y. Wang, Online adaptive local multiscale model reduction for heterogeneous problems in perforated domains, Applicable Analysis, 96 (2017), pp. 2002–2031.
 (11) Y. Dauphin, H. De Vries, and Y. Bengio, Equilibrated adaptive learning rates for nonconvex optimization, in Advances in neural information processing systems, 2015, pp. 1504–1512.
 (12) Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio, Identifying and attacking the saddle point problem in highdimensional nonconvex optimization, in Advances in neural information processing systems, 2014, pp. 2933–2941.
 (13) W. Deng, X. Zhang, F. Liang, and G. Lin, An adaptive empirical bayesian method for sparse deep learning., In Advances in Neural Information Processing Systems, (2019), pp. 5564–5574.
 (14) Y. Efendiev, J. Galvis, and T. Hou, Generalized multiscale finite element methods (gmsfem), Journal of Computational Physics, 251 (2013), pp. 116–135.
 (15) Y. Fan, L. Lin, L. Ying, and L. ZepedaNúnez, A multiscale neural network based on hierarchical matrices, arXiv preprint arXiv:1807.01883, (2018).
 (16) M. Girolami and B. Calderhead, Riemann manifold langevin and hamiltonian monte carlo methods., Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73 (2011), pp. 123–214.

(17)
J. He and J. Xu,
Mgnet: A unified framework of multigrid and convolutional neural network
, Science china mathematics, 62 (2019), pp. 1331–1354.  (18) Y. Khoo, J. Lu, and L. Ying, Solving parametric pde problems with artificial neural networks, arXiv:1707.03351, (2017).
 (19) E. Laloy, R. Hérault, D. Jacques, and N. Linde, Trainingimage based geostatistical inversion using a spatial generative adversarial neural network, Water Resources Research, 54 (2018), pp. 381–406.

(20)
C. Li, C. Chen, D. Carlson, and L. Carin, Preconditioned stochastic
gradient langevin dynamics for deep neural networks.
, In Thirtieth AAAI Conference on Artificial Intelligence, (2016).
 (21) J. Lin, Y. Rao, J. Lu, and J. Zhou, Runtime neural pruning, in Advances in Neural Information Processing Systems, 2017, pp. 2181–2191.
 (22) Y.A. Ma, T. Chen, and E. Fox, A complete recipe for stochastic gradient mcmc, in Advances in Neural Information Processing Systems, 2015, pp. 2917–2925.
 (23) P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz, Pruning convolutional neural networks for resource efficient inference, arXiv preprint arXiv:1611.06440, (2016).
 (24) S. Patterson and Y. W. Teh., Stochastic gradient riemannian langevin dynamics on the probability simplex., In Advances in neural information processing systems, (2013), pp. 3102–3110.
 (25) T. Qin, K. Wu, and D. Xiu, Data driven governing equations approximation using deep neural networks, arXiv preprint arXiv:1811.05537, (2018).
 (26) M. Raissi, P. Perdikaris, and G. E. Karniadakis, Physics informed deep learning (part i): Datadriven solutions of nonlinear partial differential equations, arXiv preprint arXiv:1711.10561, (2017).
 (27) M. Raissi, P. Perdikaris, and G. E. Karniadakis, Physics informed deep learning (part ii): Datadriven discovery of nonlinear partial differential equations, arXiv preprint arXiv:1711.10566, (2017).
 (28) V. Ročková and E. I. George, Emvs: The em approach to bayesian variable selection, Journal of the American Statistical Association, 109 (2014), pp. 828–846.
 (29) V. Ročková and E. I. George, Emvs: The em approach to bayesian variable selection., Journal of the American Statistical Association, 109 (2014), pp. 828–846.
 (30) U. Simsekli, R. Badeau, T. Cemgil, and G. Richard, Stochastic quasinewton langevin monte carlo, 2016.
 (31) T. Tieleman and G. Hinton, Lecture 6.5rmsprop: Divide the gradient by a running average of its recent magnitude, COURSERA: Neural networks for machine learning, 4 (2012), pp. 26–31.
 (32) M. Wang, S. W. Cheung, E. T. Chung, Y. Efendiev, W. T. Leung, and Y. Wang, Prediction of discretization of gmsfem using deep learning, arXiv preprint arXiv:1810.12245, (2018).
 (33) M. Wang, S. W. Cheung, W. T. Leung, E. T. Chung, Y. Efendiev, and M. Wheeler, Reducedorder deep learning for flow dynamics. the interplay between deep learning and model reduction, Journal of Computational Physics, 401 (2020), p. 108939.
 (34) Y. Wang, S. W. Cheung, E. T. Chung, Y. Efendiev, and M. Wang, Deep multiscale model learning, Journal of Computational Physics, 406 (2020), p. 109071.
 (35) Y. Wang and G. Lin, Efficient deep learning techniques for multiphase flow simulation in heterogeneous porousc media., Journal of Computational Physics, 401 (2020), p. 108968.
 (36) E. Weinan and B. Yu, The deep ritz method: A deep learningbased numerical algorithm for solving variational problems, Communications in Mathematics and Statistics, 6 (2018), pp. 1–12.
 (37) M. Welling and Y. W. Teh, Bayesian learning via stochastic gradient langevin dynamicsn, In Proceedings of the 28th international conference on machine learning (ICML11), (2011), pp. 681–688.
 (38) D. Zhang, L. Lu, L. Guo, and G. E. Karniadakis, Quantifying total uncertainty in physicsinformed neural networks for solving forward and inverse stochastic problems, Journal of Computational Physics, 397 (2019), p. 108850.
 (39) Y. Zhang and C. A. Sutton, Quasinewton methods for markov chain monte carlo, in Advances in Neural Information Processing Systems, 2011, pp. 2393–2401.
 (40) Y. Zhu and N. Zabaras, Bayesian deep convolutional encoder–decoder networks for surrogate modeling and uncertainty quantification, Journal of Computational Physics, 366 (2018), pp. 415–447.
 (41) Y. Zhu, N. Zabaras, P.S. Koutsourelakis, and P. Perdikaris, Physicsconstrained deep learning for highdimensional surrogate modeling and uncertainty quantification without labeled data, Journal of Computational Physics, 394 (2019), pp. 56–81.
Comments
There are no comments yet.