# Structure Probing Neural Network Deflation

Deep learning is a powerful tool for solving nonlinear differential equations, but usually, only the solution corresponding to the flattest local minimizer can be found due to the implicit regularization of stochastic gradient descent. This paper proposes Structure Probing Neural Network Deflation (SP-NND) to make deep learning capable of identifying multiple solutions that are ubiquitous and important in nonlinear physical models. First, we introduce deflation operators built with known solutions to make known solutions no longer local minimizers of the optimization energy landscape. Second, to facilitate the convergence to the desired local minimizer, a structure probing technique is proposed to obtain an initial guess close to the desired local minimizer. Together with neural network structures carefully designed in this paper, the new regularized optimization can converge to new solutions efficiently. Due to the mesh-free nature of deep learning, SP-NND is capable of solving high-dimensional problems on complicated domains with multiple solutions, while existing methods focus on merely one or two-dimensional regular domains and are more expensive than SP-NND in operation counts. Numerical experiments also demonstrate that SP-NND could find more solutions than exiting methods.

## Authors

• 7 publications
• 19 publications
• 39 publications
10/03/2020

### Deep Learning algorithms for solving high dimensional nonlinear Backward Stochastic Differential Equations

We study deep learning-based schemes for solving high dimensional nonlin...
02/18/2022

### FinNet: Solving Time-Independent Differential Equations with Finite Difference Neural Network

In recent years, deep learning approaches for partial differential equat...
12/06/2019

### A randomized Newton's method for solving differential equations based on the neural network discretization

We develop a randomized Newton's method for solving differential equatio...
01/06/2021

### Can Transfer Neuroevolution Tractably Solve Your Differential Equations?

This paper introduces neuroevolution for solving differential equations....
09/26/2021

05/05/2022

### Neural Network Based Variational Methods for Solving Quadratic Porous Medium Equations in High Dimensions

In this paper, we propose and study neural network based methods for sol...
01/12/2021

### A SOM-based Gradient-Free Deep Learning Method with Convergence Analysis

As gradient descent method in deep learning causes a series of questions...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

### 1.1 Problem statement

Nonlinear differential equations are ubiquitous in various important physical models such as fluid dynamics, plasma physics, solid mechanics, and quantum field theory [Ferris1997, Chyan2000, Davis2001, Kilbas2006], as well as chemical and biological models [Williams1982, Clark2003]

. Solving nonlinear differential equations has been a very challenging problem especially when it is important to find multiple distinct solutions. The nonlinearity of the differential equation may cause traditional iterative solvers to stop at a spurious solution if the initial guess is not close to a physically meaningful solution. When multiple distinct solutions are of interest, a naive strategy is to try different initial guesses as many as possible so that iterative solvers can return distinct solutions as many as possible. However, most of the initial guesses would lead to either spurious solutions or repeated solutions, making this approach usually time-consuming and inefficient unless a priori estimate of the solutions is available.

Recently, neural network-based optimization has become a powerful tool for solving nonlinear differential equations especially in high-dimensional spaces [Berg2018, Sirignano2018, Han2018, Cai2019, Zang2020, Huang2019, Gu2020]. As a form of nonlinear parametrization through compositions of simple functions [IanYoshuaAaron2016]

, deep neural networks (DNNs) can efficiently approximate various useful classes of functions without the curse of dimensionality

[barron1993, Hadrien, bandlimit, poggio2017, 2019arXiv190609477Y, 2020arXiv200103040L, MO] and achieve exponential approximation rates [yarotsky2017, bandlimit, 2020arXiv200103040L, bandlimit, DBLP:journals/corr/LiangS16, DBLP:journals/corr/abs-1807-00297, Opschoor2019]

. Therefore, applying DNNs to parametrize the solution space of differential equations (including boundary value problems, initial value problems, and eigenvalue problems) and seeking a solution via energy minimization from variational formulation have become a popular choice, e.g., the residual method

[Berg2018, Sirignano2018] as a special case of variational formulation, the Ritz method [EYu2018], the Nitsche method [Liao2019].

However, neural network-based optimization usually can only find the smoothest solution with the fastest decay in the frequency domain due to the implicit regularization of network structures and the stochastic gradient descent (SGD) for solving the minimization problem, no matter how the initial guess is randomly selected. It was shown through the frequency principle of neural networks

[DBLP:journals/corr/abs-1905-10264, DBLP:journals/corr/abs-1905-07777, DBLP:journals/corr/abs-1906-09235] and the neural tangent kernel [cao2020towards] that neural networks have an implicit bias towards functions that decay fast in the Fourier domain and the gradient descent method tends to fit a low-frequency function better than a high-frequency function. Through the analysis of the optimization energy landscape of SGD, it was shown that SGD with small batches tends to converge to the flattest minimum [Neyshabur2017, Lei2018, xiaowu]. Therefore, designing an efficient algorithm for neural network-based optimization to find distinct solutions as many as possible is a challenging problem.

To tackle the challenging problem just above and find distinct solutions as many as possible, we propose the structure probing neural network deflation (SP-NND) in this paper. The key idea of NND is to introduce deflation operators built with known solutions to regularize deep learning optimization, making known solutions no longer local minimizers of the optimization energy landscape while preserving unknown solutions as local minimizers. In particular, we introduce a deflation functional mapping known solutions to infinity. We multiply this deflation functional to the original optimization loss function, then the known solutions will be removed from consideration and unknown solutions can be found by optimizing the regularized loss function via SGD. Furthermore, to facilitate the convergence of SGD, we propose special network structures incorporating boundary conditions of differential equations to simplify the optimization loss function. Finally, a novel structure-probing (SP) algorithm is proposed to initialize the NND optimization making it more powerful to identify distinct solutions with desired structures.

As a general framework, NND can be applied to all neural network-based optimization methods for differential equations. In this paper, we will take the example of boundary value problem (BVP) and the residual method (RM) without loss of generality. The generalization to other problems and methods is similar. Consider the boundary value problem (BVP)

 (1) Du(x)=f(u(x),x),~{}% in~{}Ω,Bu(x)=g(x),~{}on~{}∂Ω,

where is a differential operator that can be nonlinear, can be a nonlinear function in , is a bounded domain in , and characterizes the boundary condition. Other types of problems like initial value problems can also be formulated as a BVP as discusssed in [Gu2020]. Then RM seeks a solution as a neural network with a parameter set via the following optimization problem

 (2) minθ LRM:=∥Du(x;θ)−f(u,x)∥2L2(Ω)+λ∥Bu(x;θ)−g(x)∥2L2(∂Ω),

where is the loss function measuring the norms of the differential equation residual and the boundary residual , and is a regularization parameter.

As we shall see, SP-NND enjoys four main advantages compared to traditional methods not based on deep learning:

• Numerical examples show that SP-NND can identify more solutions than other existing methods, e.g., see Test Case 5 in Section 5.

• As a neural network-based method, SP-NND can be applied to solve high-dimensional nonlinear differential equations with multiple solutions while existing methods are only applicable to low-dimensional problems. For example, there is a -dimensional Yamabe’s equation in Test Case 6 in Section 5.

• SP-NND can be applied to problems with complex domains due to the flexibility of neural network parametrization, e.g., see Test Cases 5 & 6 in Section 5.

• As we shall discuss in Section 3.4, SP-NND admits lower computational complexity of compared to the computational complexity of in existing methods like the original deflation method in [Farrell2015], where

is the degree of freedom for each method.

### 1.2 Related work

The deflation technique can be traced back to the last century for identifying distinct roots of scalar polynomials [Wilkinson1963]. This technique was extended to find roots of systems of nonlinear algebraic equations by Brown and Gearhart in [Brown1971], where deflation matrices were constructed with old roots to transform the residual of a system of nonlinear algebraic equations so that iterative methods applied to the new residual will only converge to a new root. In [Farrell2015], Ferrell et al. extended the theoretical framework of Brown and Gearhart [Brown1971] to the case of infinite-dimensional Banach spaces with new classes of deflation operators, enabling the Newton-Kantorovitch iteration to converge to several distinct solutions of nonlinear differential equations even with the same initial guess.

Another well-established method for distinct solutions of differential equations is based on the numerical continuation [AG2003, AG1993, CLP1975, C1979], where the basic idea of which is to transform the known solutions of a simple start system gradually to the desired solutions of a difficult target system. For example, [Morgan1989] proposed coefficient-parameter polynomial continuation for computing all geometrically isolated solutions to polynomial systems. [Hao2014] put forward a bootstrapping approach for computing multiple solutions of differential equations using a homotopy continuation method with domain decomposition to speed up computation. For more examples of homotopy-based methods and theory in the literature, the reader is referred to [Liao2012].

The third kind of methods to identify distinct solutions of nonlinear systems is the numerical integration of the Davidenko differential equation associated with the original nonlinear problem [B1972, D1953]. The basic idea is to introduce an artificial time parameter such that solving the original nonlinear equation to identify a solution is equivalent to finding a steady state solution of a time-dependent nonlinear equation , which provides a gradient flow of

. The gradient flow forms an ordinary differential equation with a solution converging to a solution to the original problem, i.e.,

. This method is indeed a broad framework containing the Newton’s method as a special example.

### 1.3 Organization

This paper is organized as follows. In Section 2

, we will review the fully connected feed-forward neural network, introduce the formulation of RM for BVP, and design special network structures for four types of boundary conditions. In Section

3, the detailed formulation and implementation of NND will be presented. In Section 4, the SP initialization is introduced. Various numerical experiments are provided in Section 5 to verify the efficiency of SP-NND. Finally, we conclude this paper in Section 6.

## 2 Network-based Methods for Differential Equations

In this section, we introduce the network-based residual method based on fully connected feed-forward neural networks and (2) for solving the BVP (1). Moreover, special network structures for common boundary conditions are introduced to simplify the loss function in (2

) to facilitate the convergence to the desired PDE solution. Vectors are written in the bold font to distinguish from scalars in our presentation.

### 2.1 Fully connected feed-forward neural network (FNN)

FNNs are one of the most popular DNNs and are widely applied to network-based methods for differential equations. Mathematically speaking, for a fixed nonlinear activation function

, FNN is the composition of simple nonlinear functions, called hidden layer functions, in the following formulation:

 ϕ(x;θ):=aThL∘hL−1∘⋯∘h1(x)for x∈Rd,

where and with and for . The activation function

can be chosen as a rectified linear unit (ReLU) function

, its polynomial , a hyperbolic tangent function , etc. With the abuse of notations, means that is applied entry-wise to a vector to obtain another vector of the same size. is the width of the -th layer and is the depth of the FNN. is the set of all parameters in to determine the underlying neural network. Other kinds of neural networks are also suitable in our proposed methods, but we will adopt FNNs for simplicity.

### 2.2 Residual method

The RM is a least-squares optimization approach to solve general differential equations. Specifically, let be a neural network to approximate the solution of BVP (1), then the RM is formulated as

 (3) minθ LRM(θ):=∥Du(x;θ)−f(x)∥2L2(Ω)+λ∥Bu(x;θ)−g(x)∥2L2(∂Ω),

where is the loss function measuring the weighted magnitude of the differential equation residual and the boundary residual in the sense of -norm with a weight parameter .

The goal of (3) is to find an appropriate set of parameters such that the network minimizes the loss . If the loss is minimized to zero with some , then satisfies in and on , implying that is exactly a solution of (1). If is minimized to a nonzero but small positive number, is close to the true solution as long as (1) is well-posed (e.g. the elliptic PDE with Neumann boundary condition, see Theorem 4.1 in [Gu2020]).

) in the deep-learning framework. The optimization and mesh-free setting of RM with neural networks admit several advantageous features that led to its great success and popularity including but not limited to 1) the capacity to solve high-dimensional problems; 2) the flexibility to solve equations of various forms on complicated problem domains; 3) the simple and high-performance implementation with automatic differential programming in existing open-source software.

### 2.3 Special network structures for boundary conditions

In numerical implementation, the RM loss function in (3) heavily relies on the selection of a suitable weight parameter and a suitable initial guess. If is not appropriate, it may be difficult to identify a reasonably good minimizer of (3). For instance, in the BVP (1) with , if we solve (3) by SGD with an initial guess such that , SGD might converge to a local minimizer corresponding to a solution neural network close to a constant zero, which is far away from the desired solution, especially when the differential operator is highly nonlinear or is too large. The undesired local minimizer is due to the fact that the boundary residual overwhelms the equation residual in the loss function.

The issue just above motivates us to design special networks that satisfy the boundary condition automatically and hence we can simplify the RM loss function from (3) to

 (4) minθ LRM(θ):=∥Du(x;θ)−f(x)∥2L2(Ω).

As we shall see in the numerical section, our numerical tests show that such simplification can help SGD to converge to desired solutions rather than spurious solutions. The design of these special neural networks depends on the type of boundary conditions. We will discuss four common types of boundary conditions by taking one-dimensional problems defined in the domain as an example. Network structures for more complicated boundary conditions in high-dimensional domains can be constructed similarly. In what follows, denote by a generic deep neural network with trainable parameters . We will augment with several specially designed functions to obtain a final network that satisfies automatically.

Case 1. Dirichlet boundary condition , .

In this case, we can introduce two special functions and to augment to obtain the final network :

 (5) u(x;θ)=h1(x)^u(x;θ)+l1(x).

Note and are chosen such that automatically satisfies the Dirichlet no matter what is. Then is used to approximate the true solution of the differential equation and is trained through (4).

For the purpose, is taken by a lifting function which satisfies the given Dirichlet boundary condition, i.e. , ; is taken by a special function which satisfies the homogeneous Dirichlet boundary condition, i.e. , . A straightforward choice for is the linear function given by

 l1(x)=(b0−a0)(x−a)/(b−a)+a0.

For , we can set it as a (possibly fractional) polynomial with roots and , namely,

 h1(x)=(x−a)pa(x−b)pb,

with . To obtain an accurate approximation, and should be chosen to be consistent with the orders of and of the true solution, hence no singularity will be brought into the network structure. For regular solutions, we take ; for singular solutions, and would take fractional values. For instance, in the case of a fractional Laplace equation for on the domain with boundary conditions , the true solution has the property that with as a smooth function [Acosta2016, Dyda2017]. Then in the construction of , it is reasonable to choose and .

Case 2. one-sided condition , .

Similarly to Case 1, the special network can be constructed by , where the lifting function is given by

 l2(x)=a1(x−a)+a0,

and is set as

 (6) h2(x)=(x−a)pa,

with . Such guarantees and its first derivative both vanish at .

Case 3. mixed boundary condition , .

In this case, the special network is constructed by with a lifting function chosen as a linear function satisfying the mixed boundary conditions, e.g.,

 l3(x)=a0x+b0−a0b,

and satisfying the homogeneous mixed boundary conditions. In the construction of , it is inappropriate to naively take with and , following the approaches in the preceding two cases, because such satisfies a redundant condition . Instead, we assume

 (7) ~u(x;θ)=(x−a)pa^u(x;θ)+c,

where and is a network-related constant to be determined. Clearly, (7) implies , whereas has not been specified. Next, the constraint gives . Therefore, the special network for mixed boundary conditions can be constructed via

 (8) u(x;θ)=(x−a)pa^u(x;θ)−(b−a)pa^u(b;θ)+l3(x).

Case 4. Neumann boundary condition , .

Similarly to Case 3, we construct the network by with a lifting function satisfying the Neumann boundary condition given by

 l4(x)=(b0−a0)2(b−a)(x−a)2+a0x.

And satisfying the homogeneous Neumann boundary condition is assumed to be

 (9) ~u(x;θ)=(x−a)paˇu(x;ˇθ)+c1,

where , is an intermediate network to be determined later, and is a network parameter to be trained together with . It is easy to check that . Next, by the constraint , we have

 paˇu(b;ˇθ)+(b−a)ˇu′(b;ˇθ)=0,

which can be reformulated as

 (exp(paxb−a)ˇu(x;ˇθ))′x=b=0.

Therefore, we have

 (10) exp(paxb−a)ˇu(x;ˇθ)=(x−b)pb^u(x;^θ)+c2,

where and is another network parameter to be trained together with . Finally, by combining (9) and (10), we obtain the following special network satisfying the given Neumann condition, i.e.

 (11) u(x;θ)=exp(paxa−b)(x−a)pa((x−b)pb^u(x;^θ)+c2)+c1+l4(x),

where .

## 3 Neural Network Deflation

In this section, we propose the general formulation, the detailed implementation, and the computational complexity of NND. As we shall see, our method is easy to implement on high-dimensional and complex domains with a lower computational cost per iteration than other traditional deflation methods.

### 3.1 Formulation

A nonlinear BVP (1) might have multiple distinct solutions and each solution is a local minimizer of the corresponding network-based optimization, say

 (12) minθ L(u(x;θ)),

where is a generic loss function for solving differential equations. One example is the residual loss in (4) and can also be other loss functions. However, due to the implicit regularization of SGD, only local minimizers in flat energy basins are likely to be found. Hence, no matter how to initialize the SGD and how to choose hyper-parameters, usually, only a few solutions can be found by minimizing (12) directly.

Neural network deflation (NND) is therefore introduced in this paper. We multiply the original loss function by a prefactor from deflation operators [Farrell2015] to modify the energy landscape using know solutions. The prefacor is able to exclude known solutions as local minimizers while preserving unknown solutions as local minimizers. Specifically, let () be the known solutions of the BVP (1), which are named deflated solutions. Then NND is formulated as the following optimization problem,

 (13) minθ LNND:=(K∑k=11∥u(x;θ)−uk(x)∥pkL2(Ω)+α)L(u(x;θ)),

where is a positive power of the deflated solutions for and is a shift constant. Note that a nonzero is used to prevent from approaching to infinity during the training process as discussed in [Farrell2015]. The modified loss function (13) has a new energy landscape where the known solutions take values equal to infinity and are no longer local minimizers, while unknown solutions are still local minimizers [Farrell2015]. Consequently, new solutions can be found through the regularized loss function via SGD.

### 3.2 Deflation with a varying shift

The original deflation operator introduced in [Farrell2015] fixes the shift in (13) as a constant. In this paper, we propose a new variant of deflation operators with a varying shift along with the SGD iteration. Note that when is equal or close to , the deflation term dominates the loss and hence gradient descent tends to converge to what is far away from the known solutions. When is moderately large, the original loss function dominates the loss and the gradient descent process tends to converge to a solution with a smaller residual. Therefore, in this paper can be set to be a monotonically increasing function of the SGD iteration. In the early stage, is chosen to be close to such that the current solution will be pushed away from known solutions. During this stage, a large learning rate is preferable. In the latter stage when the current solution is roughly stable, is set to be large and a small learning rate is used to obtain a small residual loss.

Practically, we increase exponentially with a linearly increasing power over time. The shift in the -th SGD iteration, denoted by , is given by

 (14) αn=10p0+n(p1−p0)/NI,

where and are two prescribed powers with , and is the total number of iterations.

### 3.3 Discretization

The continuous loss functions in (4) and (13) can be approximately evaluated by stochastic sampling. The

-norm can be interpreted as an expectation of a random function with a random variable

in a certain domain. Hence, the expectation can be approximated by sampling several times and computing the average function value as an approximant. Let us take as an example. We generate random samples ,

, which are uniformly distributed in

. Denote , then is evaluated as the discrete -norm denoted as via

 (15) ∥u(x)∥L2(X):=(1Np∑xi∈X|u(xi)|2)12.

The discretization technique above is applied to discretize the -norms in all loss functions in this paper. In the -th iteration of gradient descent for minimizing the NND optimization problem in (13), assuming that the shift is set to be and the set of random samples is denoted as , then the discrete NND loss function is calculated by

 ˆL(n)NND(θ):=(Kk=1∑1∥u(x;θ)−uk(x)∥pkL2(Xn)+αn)ˆL(u(x;θ)),

where is a discrete approximation to using the same set of samples, e.g.,

 ˆL(u(x;θ))=∥Du(x;θ)−f(x)∥2L2(Xn)

when the RM loss in (4) is applied. Then the network parameter is updated by

 θ←θ−τn∇θˆL(n)NND(θ),

where is the learning rate in the -th iteration. In our implementation, is renewed in every iteration.

### 3.4 Computational Complexity

Let us estimate the computational complexity of the SGD algorithm for the NND optimization (13) with RM loss function (4). Recall that denotes the number of random samples in each iteration. Assume that the FNN has layers and neurons in each hidden layer. Note that evaluating the FNN or computing its derivative with respect to its parameters or input via the forward or backward propagation takes FLOPS (floating point operations per second) for each sample . Moreover, as in most existing approaches, we assume in the BVP can be evaluated with FLOPS for a single . Therefore, in (4) and its derivative can be calculated with FLOPS using the discrete -norm in (15), if the differential operator is evaluated through finite difference approximation. Similarly, assuming the number of known solutions is and the known solutions are stored as neural networks of width and depth , then the deflation factor and its gradient with respect to can also be calculated with FLOPS. Finally, the total complexity in each gradient descent iteration of the NND optimization is .

In existing methods [Farrell2015, Croci2015, Adler2016], a given nonlinear differential equation is discretized via traditional discretization techniques, e.g. the finite difference method (FDM) and finite element method (FEM), resulting in a nonlinear system of algebraic equations. The solutions of the system of algebraic equations provide numerical solutions to the original nonlinear differential equation. By multiplying different deflation terms to the nonlinear system of algebraic equations, existing methods are able to identify distinct solutions via solving the deflated system by Newton’s iteration. The number of algebraic equations derived by FDM is exactly the number of grid points; and the number of equations derived by FEM is exactly the number of trial functions in the Galerkin formulation.

Now we compare NND with existing deflation methods in [Farrell2015, Croci2015, Adler2016] in terms of the computational complexity under the assumption that the degrees of freedom of these methods are equal, i.e., the number of grid points or trial functions in existing methods is equal to the number of parameters in NND, which guarantees that these methods have almost the same accuracy to find a solution. Denote the degree of freedom of these methods by . Then by the above discussion, we have . Therefore, the total computational complexity in each iteration of NND is , where is usually chosen as a hyper-parameter much smaller than . In existing methods, the Jacobian matrix in each Newton’s iteration is a low-rank matrix plus a sparse matrix of size by . Typically, each iteration of Newton’s method requires solving a linear system of the Jacobian matrix, which usually requires FLOPS. If a good preconditioner exists or a sparse direct solver for inverting the Jacobian matrix exists, the operation count may be reduced. Consequently, the total complexity in each iteration of existing methods would be more expensive than NND depending on the performance of preconditioners.

## 4 Structure Probing Initialization

The initialization of parameters plays a critical role in training neural networks and has a significant impact on the ultimate performance. In the training of a general FNN, network parameters are usually randomly initialized using normal distributions with zero-mean. One popular technique is the Xavier initialization

[Glorot2010]: for each layer , the weight matrix is chosen randomly from a normal distribution with mean

and variance

; the bias vector

is initialized to be zero. As a variant of Xavier initialization, the He initialization [He2015] takes a slightly different variance for and for

. In general, FNNs initialized randomly have a smooth function configuration, and hence their Fourier transform coefficients decay quickly.

The least-squares optimization problem, either for regression problems or solving linear partial differential equations, with over-parameterized FNNs and random initialization admits global convergence by gradient descent with a linear convergence rate

[DBLP:journals/corr/abs-1806-07572, Du2018, Zhu2019, chen1, LuoYang2020]. However, the speed of convergence depends on the spectrum of the target function. The training of a randomly initialized DNN tends to first capture the low-frequency components of a target solution quickly. The high-frequency fitting error cannot be improved significantly until the low-frequency error has been eliminated, which is referred to as F-principle [Xu2019]. Related works on the learning behavior of DNNs in the frequency domain is further investigated in [DBLP:journals/corr/abs-1906-09235, DBLP:journals/corr/abs-1905-07777, DBLP:journals/corr/abs-1905-10264, cao2020towards]. In the case of nonlinear differential equations where multiple solutions exist, these theoretical works imply that deep learning-based solvers converge to solutions in the low-frequency domain unless the DNN is initialized near a solution with high-frequency components.

The discussion just above motivates us to propose the structure probing initialization that helps the training converge to multiple structured solutions. The structure probing initialization incorporates desired structures in the initialization and training of DNNs. For example, to obtain oscillatory solutions of a differential equation, we initialize the DNN with high-frequency components for the purpose of making the initialization closer to the desired oscillatory solution. During the optimization process, the magnitudes of these high-frequency components will be optimized to fit the desired solution. One choice to probe an oscillatory solution is to take a linear combination of structure probing functions with various frequencies, e.g., with randomly selected. Then the following network with a set of random parameters can serve as an oscillatory initial guess:

 (16) uJ(x;θJ)=u(x;θ)+J∑j=1cjξj(x),

where is trainable after initialization. In the initialization,

can be set as random numbers or manually determined hyper-parameters with large magnitudes. Instead of planewaves, radial basis functions are also a popular structure in the solution of differential equations. In this case, we can choose

for example. The idea of structure probing initialization is not limited to the above two types of structures and is application dependent.

The above paragraph has sketched out the main idea of the structure probing initialization. Now we are ready to discuss its special cases when we need to make the structure probing network in (16) satisfy the boundary condition in the BVP (1), which is important for the convergence of deep learning-based solvers as discussed in Section 2.3. For this purpose, we first construct a special network such that by the approaches described in Section 2.3. Next, the structured probing functions are specifically chosen to satisfy for each . As an example, let us take the one-dimensional mixed boundary condition on :

 (17) u′(a)=a0,u(b)=b0

for any constants and . Then a feasible choice for can be . Finally, it is easy to check that .

## 5 Numerical Examples

In this section, several numerical examples are provided to show the performance of SP-NND in solving BVP (1). We choose the loss function of RM as the general loss function in (13), then the optimization problem of SP-NND is formulated as

 (18) minθ LNND(θ):=(K∑k=11∥u(x;θ)−uk(x)∥pkL2(Ω)+α)∥Du(x;θ)−f(x)∥2L2(Ω),

where is the neural network of the approximate solution to be determined. Remark that the optimization problem can also be formulated by other optimization-based methods instead of least squares.

To verify the effectiveness of special networks that satisfy boundary conditions automatically, we use NND without the special network for boundary conditions as a comparison, where the loss function of NND becomes

 (19)

The overall setting for all examples is summarized as follows.

• Environment.

The experiments are performed in Python 3.7 environment. We utilize PyTorch library for neural network implementation and CUDA 10.0 toolkit for GPU-based parallel computing. One-dimensional examples (Test Case 1-4) can be implemented on a laptop and high-dimensional examples (Test Case 5-6) are implemented on a scientific workstation.

• Optimizer. In all examples, the optimization problems are solved by adam subroutine from PyTorch library with default hyper parameters. This subroutine implements the Adam algorithm in [Kingma2014].

• Learning rate. In each example, the learning rate is set to decay exponentially with linearly decreasing powers. Specifically, the learning rate in the -th iteration is set as

 (20) τn=10q0+n(q1−q0)/NI,

where are the initial and final powers, respectively, and denotes the total number of iterations.

• Network setting. In each example, we construct a special network that satisfies the given boundary condition as discussed in Section 2.3. The special network involves a generic FNN, denoted by . In all examples, we set the depth of as a fixed number , but the width depends on individual examples. Unless specified particularly, all weights and biases of are initialized by . The activation function of is chosen as .

• Varying shifts in deflation operators. In one-dimensional examples (Test Case 1-4), using constant shifts is sufficient to find all solutions. In high-dimensional examples (Test Case 5-6), varying shifts will help to find more distinct solutions. In these examples, we set varying shifts according to (14).

We also summarize the numerical examples in this section in Table 1 below, which could help the reader to better understand how the extensive numerical examples demonstrate the advantages of our new ideas in this paper: 1) neural network deflation (NND); 2) structure probing initialization (SP); 3) special network for boundary conditions (BC); 4) varying shifts in deflation operators (VS).

In each example, necessary parameters to obtain each solution are listed in a table right next to the example. In these tables, we use , , , and to denote the width for , the batch size, the number of iterations, and the range of learning rates (i.e. ), respectively. In each iteration of the optimization, random samples will be renewed. The value of the shift for each solution found by SP-NND is listed in the table as a constant for a a fixed or an interval for a varying .

### 5.1 Numerical tests in one-dimension

First of all, we will provide four numerical tests for problems in one-dimension. These numerical tests show that the proposed NND works as well as existing methods [Farrell2015, Hao2014].

Test Case 1. We consider second-order the Painlevé equation [HOLMES1984, Fornberg2011, Noonburg1995] that seeks satisfying

 (21) d2udx2=100u2−1000x,in%  Ω=(0,1), (22) u(0)=0,u(1)=√10.

It has been shown in [Hastings1989] that the Painlevé equation (21)-(22) has exactly two solutions, denoted by and , which satisfy and , respectively.

In our experiments, we take the following special network

 (23) u(x;θ)=x(x−1)^u(x;θ)+√10x

that automatically satisfies the boundary conditions and use parameters in Table 2. The initial guess of is randomly initialized as mentioned previously. The first solution can be easily found by the RM using (4), and the second solution is found by NND with as the deflation source and . Other parameters associated with these solutions are listed in Table 2. Figure 1 visualizes the identified solutions and with the same function configurations as in [Farrell2015].

To verify the effectiveness of special networks that satisfy the boundary conditions (22), we use NND without the special network for boundary conditions as a comparison. Hence, the loss function of NND is given by (19) with a solution network as a generic FNN of the same structure as in (23). To show that the results of (19) are quite independent of the weight , and are used and the corresponding solutions are denoted as and , respectively. As listed in Table 2, other parameters to identify these two solutions are the same as those for identifying for a fair comparison. It is clear that these two solutions do not satisfy the boundary condition at the endpoint (see Figure 1). This verifies the importance of using special networks that satisfy the boundary conditions automatically.

Test Case 2. We consider a fourth-order nonlinear BVP that seeks such that

 (24) d4udx4=βx(1+u2)% in Ω=(0,1), (25) u(0)=u′(1)=u′′(1)=0,u′′(0)−u′′(γ)=0,

where and are two given constants. Graef et al. [Graef2003_1, Graef2003_2] have proven that the problem (24)-(25) has at least two positive solutions when and .

The three-point boundary condition (25) is more complicated than usual. Accordingly, we construct the following special network for it,

 (26) u(x;θ)=(x−1)3^u(x;θ)+^u(0;θ)+cγx(x−1)3,

where

 (27) cγ=1−12γ2+18γ(d2dx2((x−1)3^u(x;θ))|x=γ−d2dx2((x−1)3^u(x;θ))|x=0).

It can be verified that (26) indeed satisfies the boundary condition (25) independent of .

In our experiment, we find the first solution, denoted by , by applying RM (4). With deflation source (), we find the second solution, denoted by , by using NND (18). The parameters and solutions are demonstrated in Table 3 and Figure 2.

Similarly, we test NND without special networks for boundary conditions as a comparison under the same setting as Test Case 1. We find two solutions, denoted by and , from and , respectively (see Figure 2). It is clear that both solutions are spurious since their configurations do not take the prescribed boundary value at (see Figure 2), which implies the effectiveness of using special networks for boundary conditions.

Test Case 3. We consider the fourth-order nonlinear equation describing the steady laminar flow of a viscous incompressible fluid in a porous channel [Xu2010]. For simplicity, we consider the one-dimensional problem that seeks such that

 (28) d4udx4+γ(xd3udx3+3d2udx2)+R(ud3udx3−dudxd2udx2)=0,0

where is the cross-flow Reynolds number and is a physical constant related to the wall expansion ratio. Xu et al. [Xu2010] have proven that the problem (28)-(29) admits multiple solutions by analytic approaches. Three solutions were found by homotopy analysis method (HAM) in [Liao2012] for the setting and .

In our experiments, we take the same and as in [Liao2012]. The special network for the boundary condition (29) is chosen as

 (30) u(x;θ,^c)=x(x−1)2(x2^u(x;θ)+c)e2x+sin(πx/2),

where is a network parameter to be trained together with . In this case, we initialize and . Other network parameters are initialized as mentioned above. Firstly, one solution is found by RM (4). Next, the second solution is obtained by NND (18) with deflation source (). Moreover, the third solution is obtained by NND (18) with deflation sources and (). Corresponding parameters are shown in Table 4. The three found solutions and their first derivatives are plotted in Figure 3, which are the same solutions found in [Liao2012].

Also, a comparison test is performed to seek by NND (19) with the same setting as above, except for using a generic solution network without the special structure for boundary conditions. We find two solutions, denoted by and , using and . Neither of them takes the prescribed boundary value 0 at or 1 at and, hence, they are spurious solutions (see Figure 3).

Test Case 4. We consider the following second-order problem that seeks such that

 (31) d2udx2=f(u),0

where is a polynomial function of . The existence of multiple solutions for the problem (31) has been studied by the bootstrapping method [Hao2014].

First, we set the right-hand side of the problem (31) as . It is shown in [Hao2014] that there are two solutions for . In our experiments, we take . The special network for the boundary condition (32) is given by

 (33) u(x;θ)=x2^u(x;θ)−^u(1;θ).

The first solution is found by RM (4) and the second solution is found by NND (18) with deflation source (). Similarly to preceding cases, we perform a comparison test without the special network structure for boundary conditions and two spurious solutions (for ) and (for ) are found by NND (19). The parameters for all these solutions are shown in Table 5 and all solutions are plotted in Figure 4.

Second, we repeat the test by choosing . [Hao2014] has proved that there exist eight solutions in total. Note that is a trivial solution. In this case, we start from NND (18) with the special network (33) and the deflation source () to find the first solution , which is quite close to . We would like to emphasize that it is sufficient to use NND without the structure probing initialization to identify and