Solving partial differential equations (PDEs) has been the most ubiquitous tool to simulate complicated phenomena in applied sciences and engineering problems. Classical numerical methods include finite difference method LeVeque (2007), finite element method (FEM) Elman et al. (2014), discontinuous Galerkin method Cockburn et al. (2000), and spectral method Shen et al. (2011), which are typically designed for low dimensional PDEs and are well understood in terms of stability and accuracy. However, there are high dimensional PDEs such as Schrödinger equation in the quantum many-body problem Dirac (1981), Hamilton-Jacobi-Bellman equation in stochastic optimal control Bardi and Capuzzo-Dolcetta (2008), and nonlinear Black-Scholes equation for pricing financial derivatives Hull (2009)
. Solving these equations is far out of the capability of classical numerical methods due to the curse of dimensionality, i.e., the number of unknowns grows exponentially fast as the dimension increases.
Until very recently, deep-learning based methods have been developed to solving these high-dimensional PDEs; see E et al. (2017); Giuseppe and Matthias (2017); E and Yu (2018); Han et al. (2018); Raissi (2018); Sirignano and Spiliopoulos (2018); Hutzenthaler et al. (2019); Raissi et al. (2019); Beck et al. (2019); Cervera (2019); Fan et al. (2019); Khoo et al. (2019); Beck et al. (2020); Wang et al. (2020); Zang et al. (2020); Discacciati et al. (2020) for examples. Typically, there are three main ingredients (stages) of a deep-learning method for solving PDEs: (1) modeling: the loss (objective) function to be optimized; (2) architecture: the deep neural network (DNN) for function approximation; (3) optimization: the optimal set of parameters in the DNN which minimizes the loss function. By design, the number of parameters in DNNs grows at most polynomially in terms of dimension. Meanwhile, possibly high-dimensional integrals in the loss function are approximated by Monte-Carlo method. Therefore, by design, deep learning overcomes the curse of dimensionality. In practice, deep learning performs well for Schrödinger equation Giuseppe and Matthias (2017); Han et al. (2019), Hamilton-Jacobi-Bellman equation Han et al. (2018); E et al. (2017), and nonlinear Black-Scholes equation Beck et al. (2019); Cervera (2019).
Typically, deep learning solves a PDE in the following way. For the given PDE, the loss function is modeled as the equation residual in the least-squares sense Sirignano and Spiliopoulos (2018) or the variational form if exists E and Yu (2018). ResNet is often used as the network architecture He et al. (2015)
, which was tested to overcome the notorious problem of vanishing/exploding gradient. Afterwards, stochastic gradient descent method is used to find the optimal set of parameters in ResNet which minimizes the loss function. ResNet with the optimal set of parameters gives an approximation of the PDE solution.
In this work, we propose a deep mixed residual method (MIM) for solving high-order PDEs. In the modeling stage, by rewriting a given PDE into a first-order system, we obtain a larger problem in the sense that both the PDE solution and its high-order derivatives are unknown functions to be approximated. This has analogs in classical numerical methods, such as local discontinuous Galerkin method Cockburn et al. (2000) and mixed finite element method Boffi et al. (2013)
. Compared to DGM, there are two more degrees of freedom in MIM:
In the loss function stage, one can choose different high-order derivatives into the set of unknown functions. Take biharmonic equation as an example. The set of unknown functions can include the PDE solution and its derivatives up to the third order, or only contain the PDE solution and its second-order derivatives, and both choices have analogs in discontinuous Galerkin method Yan and Shu (2002); Cockburn et al. (2009). We then write the loss function as the sum of equation residuals in the least-squares sense, very much in the same spirit as the least-squares finite element method Bochev and Gunzburger (2015).
In the architecture stage, one can choose the number of networks to approximate the set of unknown functions. In one case, one DNN is used to approximate the PDE solution and other DNNs are used to approximate its high-order derivatives; in the other case, the PDE solution and its derivatives share nearly the same DNN.
These two degrees of freedom allow MIM to produce better approximations over DGM in all examples, including Poisson equation, Monge-Ampére equation, biharmonic equation, and Korteweg-de Vries (KdV) equation. In particular, MIM provides better approximations not only for the high-order derivatives but also for the PDE solution itself. It is worth mentioning that the usage of mixed residual in deep learning was first introduced for surrogate modeling and uncertainty quantification of a second-order elliptic equation Zhu et al. (2019) and was later adopted in a deep domain decomposition method Li et al. (2019).
2 Deep mixed residual method
In this section, we introduce MIM and discuss its difference with DGM in terms of loss function and neural network structure.
2.1 Loss function
Consider a potentially time-dependent nonlinear PDE over a bounded domain
where denotes the boundary of . In DGM, the loss function is defined as the PDE residual in the least-squares sense
where and are penalty parameters given a priori. These three terms in (2) measure how well the approximate solution satisfies the PDE, the initial condition and the boundary condition, respectively.
In the absence of temporal derivatives, (1) reduces to
and the corresponding loss function in DGM becomes
Table 1 lists four PDEs with their corresponding loss functions in DGM and Table 2 lists different boundary conditions, the initial condition and their contributions to loss functions in DGM and MIM. More boundary conditions can be treated in this way. Interested readers may refer to Chen et al. (2020) for details.
|Equation||Explicit form||Loss function|
|Condition||Explicit form||Contribution to the loss function|
In MIM, we first rewrite high-order derivatives into low-order ones using auxiliary variables. For notational convenience, auxiliary variables represent
For KdV equation, we have instead of the second formula in (4). With these auxiliary variables, we define loss functions for four types of PDEs in Table 3. Since one can choose a subset of high-order derivatives into the set of unknown functions, there are more than one loss function in MIM. For biharmonic equation, there are two commonly used sets of auxiliary variables in local discontinuous Galerkin method and weak Galerkin finite element method: one with all high-order derivatives Yan and Shu (2002) and the other with part of high-order derivatives Cockburn et al. (2009); Mu et al. (2015). Correspondingly, if all high-order derivatives are used, we denote MIM by MIM, and if only part of high-order derivatives are used, we denote MIM by MIM. In Section 2.2, we will discuss how to equip different loss functions with different DNNs. In short, if only one DNN is used to approximate the PDE solution and its derivatives, we denote MIM by MIM, and if multiple DNNs are used, we denote MIM by MIM. In Section 3, different loss functions listed in Table 1, Table 2 and Table 3 will be tested and discussed. By default, all the penalty parameters are set to be .
|Equation||Explicit form||Loss function|
2.2 Neural network architecture
ResNet He et al. (2015) is used to approximate the PDE solution and its high-order derivatives. It consists of blocks in the following form
Here , . is the depth of network, is the width of network, and
is the (scalar) activation function. Explicit formulas of activation functions used in this work are given in Table4. The last term on the right-hand side of (51 for demonstration. Such a structure can automatically solve the notorious problem of vanishing/exploding gradient He et al. (2016).
Since is in rather than
, we can pad
by a zero vector to get the network input. A linear transform can be used as well without much difference. Meanwhile, has outputs which cannot be directly used for the PDE solution and its derivatives employed in the loss function. Therefore, a linear transform is applied to to transform it into a suitable dimension. Let be the whole set of parameters which include parameters in ResNet () and parameters in the linear transform . Note that the output dimension in MIM depends on both the PDE problem and the mixed residual loss. We illustrate network structures for biharmonic equation as an example in Figure 2.
From Figure 2, we see that DGM has only output, MIM has outputs, and MIM has outputs. In Figure 3, we illustrate networks structures of MIM and MIM for Poisson equation. In MIM, two DNNs are used: one to approximate the solution and the other one to approximate its derivatives. It is clear from Figure 2 that network structures in DGM and MIM only differ in the output layer and thus they have comparable numbers of parameters to be optimized. To be precise, we calculate their numbers of parameters in Table 5, from which one can see the number of parameters in DGM and MIM is close. The number of parameters in MIM is nearly double for Poisson equation, Monge-Ampére equation and biharmonic equation (MIM), tripled for KdV equation, and quadrupled for biharmonic equation (MIM), respectively. In Section 3, from numerical results, we observe a better performance of MIM for all four equations, not only for derivatives of the PDE solution, but also for the solution itself.
|Method||Equation||Size of the parameter set|
2.3 Stochastic Gradient Descent
For completeness, we also briefly introduce stochastic gradient descent method. For the loss function defined in (3
), we generate two sets of points uniformly distributed overand : in and on .
where is the learning rate chosen to be here. and are measures of and , respectively. is the DNN approximation of PDE solution parameterized by . Sampling points and are updated at each iteration. In implementation, we use ADAM optimizer Kingma and Ba (2014) and automatic differentiation Paszke et al. (2017)
for derivatives in PyTorch.
3 Numerical Result
In this section, we show numerical results of MIM for four types of equations. We use relative errors of , , , and defined in Table 6 for comparison. In all figures, relative errors are in scale.
3.1 Poisson Equation
Consider the following Neumann problem
Since both and are explicitly used, one more advantage of MIM is the enforcement of boundary conditions. For (7), we multiply by to satisfy the Neumann boundary condition automatically; see Figure 3. DGM only has as its unknown function, and thus it is unclear that how the exact Neumann boundary condition can be imposed.
Therefore, for DNNs in Figure 3, the loss function in MIM can be simplified as
We emphasize that Dirichlet boundary condition can be exactly imposed in DGM Berg and Nyström (2018) and no penalty term is needed. For Neumann boundary condition, mixed boundary condition, and Robin boundary condition, however, it is difficult to build up a DNN representation which satisfies the exact boundary condition. Building up a DNN approximation which satisfies the exact boundary condition can have a couple of advantages Chen et al. (2020): 1) make ease of the training process by avoiding unnecessary divergence; 2) improve the approximation accuracy; 3) save the execution time. In MIM, however, we have the direct access to both and . Therefore, all these boundary conditions can be imposed exactly in principle. This will be presented in a subsequent work Lyu et al. (2020).
For (7), average errors of and over the last iterations are recorded in Table 7. The network depth and the activation function is used. Network widths are for dimensional problems, respectively. Time is recorded as the average CPU time per iteration. It is not surprising that MIM costs less time than DGM since the DNN approximation in MIM satisfies the Neumann boundary condition automatically and both methods have similar network structures. It is surprising that MIM costs less time than DGM since the number of parameters in MIM is about twice of that in DGM. In terms of execution time, MIM MIM DGM.
|d||Method||Relative error ()||Time (s)|
Figure 4 and Figure 5 plot training processes of DGM and MIM in terms of relative errors for and . Generally speaking, in terms of approximation error, MIM MIM DGM as expected. Therefore, MIM provides a better strategy over DGM. MIM provides better approximations in terms of relative errors for both and . For , the improvement of MIM over DGM is about several times and that of MIM over MIM is about one order of magnitude. For , the improvement is about several times. Moreover, a dimensional dependence is observed for both and . The higher the dimension is, the better the approximation is.
Table 8 records approximation errors of MIM and DGM in terms of activation function and network depth when . MIM provides better approximations for both and . It is not surprising that ReLU is not a suitable function for DGM due to high-order derivatives, but is suitable in MIM since only first-order derivatives are present in MIM.
|Relative error ()|
3.2 Monge-Ampére equation
Consider the nonlinear Monge-Ampére equation
and the loss function in MIM
respectively. For (10), the Dirichlet boundary condition can be enforced for both DGM and MIM. For comparison purpose, instead, we have the penalty term in both DGM and MIM. However, imposing exact boundary conditions is always encouraged in practice.
In this example, we fix the network depth and the activation function as . Relative errors in the last iterations with respect to the network width in different dimensions are recorded in Table 9. Figure 6 plots errors in terms of network width for different dimensions. The advantage of MIM is obvious from these results.
|d||Relative error ()|
3.3 Biharmonic equation
Consider the biharmonic equation
with the exact solution over . The loss function in DGM is
The loss function in MIM is
and the loss function in MIM is
Again, we can enforce the exact boundary condition in MIM but cannot enforce it in DGM. For comparison purpose, we use penalty terms in both methods.
Set and when , respectively. Table 10 records averaged errors in the last 1000 iterations.
|d||Method||Relative error ( )||Time (s)|
Relative errors for , , and in terms of iteration number are plotted in Figure 7 when .
Generally speaking, MIM provides better approximations for , , , and than DGM. For MIM and MIM, MIM has a slightly better approximation accuracy comparable to that of MIM, although MIM has more outputs. These results are of interests since they are connected with results of local discontinuous Galerkin method that the formulation with a subset of derivatives has a better numerical performance Yan and Shu (2002); Cockburn et al. (2009). We point out that MIM has the advantage that the exact boundary condition can be enforced, although we use penalty terms for this example.
3.4 KdV equation
Consider a time-dependent linear KdV-type equation
defined over , where the exact solution . We first rewrite it into the first-order system
The loss function in DGM is
Here is the standard basis set of . The loss function in MIM is
Relative errors of , , and are recorded in Table 11. Again, as shown in previous examples, MIM provides better results compared to DGM, especially for ReQU activation function. No obvious improvement of MIM over MIM is observed.
|Method||Relative error ()|
4 Conclusion and Discussion
Motivated by classical numerical methods such as local discontinuous Galerkin method, mixed finite element method, and least-squares finite element method, we develop a deep mixed residual method to solve high-order PDEs in this paper. The deep mixed residual method inherits several advantages of classical numerical methods:
Flexibility for the choice of loss function;
Larger solution space with flexible choice of deep neural networks;
Enforcement of exact boundary conditions;
Better approximations of high-order derivations with almost the same cost.
Meanwhile, the deep mixed residual method also provides a better approximation for the PDE solution itself. These features make deep mixed residual method suitable for solving high-order PDEs in high dimensions.
Boundary condition is another issue which is important for solving PDEs by DNNs. Enforcement of exact boundary conditions not only makes the training process easier, but also improves the approximation accuracy; see Berg and Nyström (2018); Chen et al. (2020) for examples. The deep mixed residual method has the potential for imposing exact boundary conditions such as Neumann boundary condition, mixed boundary condition, and Robin boundary condition. All these conditions cannot be enforced exactly in deep Galerkin method. This shall be investigated in a subsequent work Lyu et al. (2020).
So far, in the deep mixed residual method, only experiences from classical numerical methods at the basic level are transferred into deep learning. We have seen its obvious advantages. To further improve the deep mixed residual method, we need to transfer our experiences from classical numerical analysis at a deeper level. For example, the choice of solution space relies heavily on the choice of residual in order to maximize the performance of least-squares finite element method Bochev and Gunzburger (2015). Many other connections exist in discontinuous Galerkin method Cockburn et al. (2000) and mixed finite element method Boffi et al. (2013). For examples, since only first-order derivatives appear in the deep mixed residual method, ReLU works well for all time-independent equations we have tested but does not work well for KdV equation. Therefore, it deserves a theoretical understanding of the proposed method in the language of linear finite element method He et al. (2018). Another possible connection is to use the weak formulation of the mixed residual instead of least-squares loss, as done in deep learning by Zang et al. (2020) and in discontinuous Galerkin method by Cockburn et al. (2000). Realizing these connections in the deep mixed residual method will allow for a systematic way to understand and improve deep learning for solving PDEs.
This work was supported by National Key R&D Program of China (No. 2018YFB0204404) and National Natural Science Foundation of China via grant 11971021. We thank Qifeng Liao and Xiang Zhou for helpful discussions.
- Optimal control and viscosity solutions of Hamilton-Jacobi-Bellman equations. Springer Science & Business Media. Cited by: §1.
- Machine learning approximation algorithms for high-dimensional fully nonlinear partial differential equations and second-order backward stochastic differential equations. Journal of Nonlinear Science 29 (4), pp. 1563–1619. Cited by: §1.
- Overcoming the curse of dimensionality in the numerical approximation of high-dimensional semilinear elliptic partial differential equations. arXiv preprint arXiv:2003.00596. External Links: Cited by: §1.
- A unified deep artificial neural network approach to partial differential equations in complex geometries. Neurocomputing 317, pp. 28–41. External Links: Cited by: §3.1, §4.
- Least Squares Finite Element Methods. Springer, Berlin, Heidelberg. External Links: Cited by: item 1, §4.
- Mixed Finite Element Methods and Applications. Springer, Berlin, Heidelberg. External Links: Cited by: §1, §4.
- Solution of the black-scholes equation using artificial neural networks. Journal of Physics: Conference Series 1221, pp. 012044. Cited by: §1.
- A comprehensive study of boundary conditions when solving PDEs by DNNs. arXiv preprint arXiv:2005.04554. External Links: Cited by: §2.1, §3.1, §4.
- A hybridizable and superconvergent discontinuous galerkin method for biharmonic problems. Journal of Scientific Computing 40 (1), pp. 141–187. Cited by: item 1, §2.1, §3.3.
- Discontinuous Galerkin Methods - Theory, Computation and Applications. Springer-Verlag Berlin Heidelberg. External Links: Cited by: §1, §1, §4.
- The principles of quantum mechanics. Oxford university press. Cited by: §1.
- Controlling oscillations in high-order discontinuous galerkin schemes using artificial viscosity tuned by neural networks. Journal of Computational Physics 409, pp. 109304. Cited by: §1.
- Deep learning-based numerical methods for high-dimensional parabolic partial differential equations and backward stochastic differential equations. Communications in Mathematics and Statistics 5 (4), pp. 349–380. Cited by: §1.
- The Deep Ritz Method: A Deep Learning-Based Numerical Algorithm for Solving Variational Problems. Communications in Mathematics and Statistics 6 (1), pp. 1–12. External Links: Cited by: §1, §1.
- Finite Elements and Fast Iterative Solvers: with Applications in Incompressible Fluid Dynamics. Oxford University Press. External Links: Cited by: §1.
- A multiscale neural network based on hierarchical matrices. Multiscale Modeling & Simulation 17 (4), pp. 1189–1213. Cited by: §1.
- Solving the quantum many-body problem with artificial neural networks. Science 355 (6325), pp. 602–606. Cited by: §1.
- Solving high-dimensional partial differential equations using deep learning. Proceedings of the National Academy of Sciences of the United States of America 115 (34), pp. 8505–8510. Cited by: §1.
- Solving many-electron schrödinger equation using deep neural networks. Journal of Computational Physics 399, pp. 108929. Cited by: §1.
- ReLU deep neural networks and linear finite elements. arXiv preprint arXiv:1807.03973. Cited by: §4.
- Deep residual learning for image recognition. CoRR 1512.03385. External Links: Cited by: §1, §2.2.
- Deep residual learning for image recognition. 2, pp. 770–778. Cited by: §2.2.
- Options, futures and other derivatives. Upper Saddle River, NJ: Prentice Hall,. Cited by: §1.
- A proof that rectified deep neural networks overcome the curse of dimensionality in the numerical approximation of semilinear heat equations. arXiv preprint arXiv:1901.10854. External Links: Cited by: §1.
- Solving for high-dimensional committor functions using artificial neural networks. Research in the Mathematical Sciences 6, pp. 1. Cited by: §1.
- Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §2.3.
- Finite Difference Methods for Ordinary and Partial Differential Equations: Steady-State and Time-Dependent Problems. Society for Industrial and Applied Mathematics. Cited by: §1.
- D3M: A Deep Domain Decomposition Method for Partial Differential Equations. IEEE Access 8, pp. 5283–5294. External Links: Cited by: §1.
- Enforcing exact boundary and initial condtions in the deep mixed residual method. in preparation. Cited by: §3.1, §4.
- A weak Galerkin finite element method with polynomial reduction. Journal of Computational and Applied Mathematics 285, pp. 45–58. External Links: Cited by: §2.1.
- Automatic differentiation in PyTorch. Note: [Online; accessed 13. May 2020] External Links: Cited by: §2.3.
- Physics-informed neural networks: a deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics 378, pp. 686–707. Cited by: §1.
- Deep hidden physics models: deep learning of nonlinear partial differential equations. Journal of Machine Learning Research 19 (1), pp. 932–955. Cited by: §1.
- Spectral methods: algorithms, analysis and applications. Vol. 41, Springer Science & Business Media. Cited by: §1.
- DGM: a deep learning algorithm for solving partial differential equations. Journal of Computational Physics 375, pp. 1339–1364. Cited by: §1, §1.
- Deep multiscale model learning. Journal of Computational Physics 406, pp. 109071–109071. Cited by: §1.
- Local Discontinuous Galerkin Methods for Partial Differential Equations with Higher Order Derivatives. Journal of Scientific Computing 17 (1), pp. 27–47. External Links: Cited by: item 1, §2.1, §3.3.
- Weak adversarial networks for high-dimensional partial differential equations. Journal of Computational Physics 411, pp. 109409. Cited by: §1, §4.
- Physics-constrained deep learning for high-dimensional surrogate modeling and uncertainty quantification without labeled data. Journal of Computational Physics 394, pp. 56–81. External Links: Cited by: §1.