1. Introduction
Approximation theory plays a key role in scientific computing, including in the design of numerical PDE solvers. This theory prescribes a certain form of ansatz to approximate a solution to the PDE, allowing derivation of an algebra problem whose solution yields the coefficients in the ansatz. Various methods are used to finetune the process of translation to an algebraic problem, but the accuracy of the computed solution is essentially determined by the the underlying approximation theory. New approximation methods have the potential to produce new strategies for numerical solution of PDEs.
During the past decade, driven by some remarkable successes in machine learning, neural networks (NNs) have become popular in many contexts. They are extremely powerful in such areas as computer vision, natural language processing, and games
[52, 42]. What kinds of functions are well approximated by NNs, and what are the advantages of using NNs in the place of more traditional approximation methods? Some studies [10, 51, 27]have revealed that NNs can represent functions in high dimensional spaces very well. Unlike traditional approximation techniques, the number of NN coefficients needed to represent such functions does not increase exponentially with the dimension; in some sense, they overcome the “curse of dimensionality.” This fact opens up many possibilities in scientific computing, where the discretization of high dimensional problems often plays a crucial role. One example is problems from uncertainty quantification, where many random variables are needed to represent a random field, with each random variable essentially adding an extra dimension to the PDE
[69, 70, 41, 9]. Techniques that exploit intrinsic lowdimensional structures can be deployed on the resulting highdimensional problem [38, 14, 11, 23, 45]. Another example comes from PDE problems in which the medium contains structures at multiple scales or is highly oscillatory, so that traditional discretization techniques require a large number of grid points to achieve a prescribed error tolerance. Efficient algorithms must then find ways to handle or compress the many degrees of freedom.
Despite the high dimensionality in these examples, successful algorithms have been developed, albeit specific to certain classes of problems. With the rise of NN approximations, with their advantages in highdimensional regimes, it is reasonable to investigate whether strategies based on NNs can be developed that may even outperform classical strategies. In this paper, we develop an approach that utilizes a twolayer NN to solve multiscale elliptic PDEs. We test our strategy on two nonlinear problems of this type.
The use of NN in numerical PDE solvers is no longer a new idea. Two approaches that have been developed are to use NN to approximate the solutions ([25, 29, 67, 66, 71, 58, 53, 13]) or the solution map ([36, 35, 59, 37, 54, 68]). Due to the complicated and unconventional nature of approximation theory for NN, it is challenging to perform rigorous numerical analysis, though solid evidence has been presented of the computational efficacy of these approaches.
The remainder of our paper is organized as follows. In Section 2 we formulate the multiscale PDE problem to be studied. We give an overview of our domain decomposition strategy and the general specification of the Schwarz algorithm. In Section 3, we discuss our NNbased approach in detail and justify its use in this setting. We then present our reducedorder Schwarz method based on twolayer neural networks. Numerical evidence is reported in Section 4. Two comprehensive numerical experiments for the semilinear elliptic equation and the Laplace equation are discussed, and efficiency of the methods is evaluated. We make some concluding remarks in Section 5.
2. Domain Decomposition and the Schwarz method for multiscale elliptic PDEs
We start by reviewing some key concepts. Section 2.1 describes nonlinear multiscale elliptic PDEs and discussed the homogenization limit for highly oscillatory medium. Section 2.2 outlines the domain decomposition framework and the Schwarz iteration strategy.
2.1. Nonlinear elliptic equation with multiscale medium
Consider the following general class of nonlinear elliptic PDEs with Dirichlet boundary conditions:
(2.1) 
where is a domain in dimensional space, represents the small scale, and (where denotes the space of real symmetric matrices) is a smooth function. To ensure ellipticity, we require for all that
for all nonnegative semidefinite .
This class of problems has fundamental importance in modern science and engineering, in such areas as synthesis of composite materials, discovery of geological structures, and design of aerospace structures. The primary computational challenges behind all these problems lie in the complicated interplay between the nonlinearity and the extremely high number of degrees of freedom necessitated by the smallest scale. We assume that for an appropriately chosen boundary condition , the PDE (2.1) has a unique viscosity solution . For details on the theory of fully nonlinear elliptic equations, see for example, [15, 49].
To achieve a desired level of numerical error, classical numerical methods require refined discretization strategies with a mesh width , making the leading to at least degrees of freedom in the discretized problem. The resulting numerical cost is prohibitive when is small. The homogenization limit of (2.1) as can be specified under additional assumptions, such as when the medium is pseudoperiodic. Let
(2.2) 
for some that is periodic in the last argument with period . We have the following theorem.
Theorem 2.1 ([34], Theorem 3.3).
Suppose that the nonlinear function is uniform elliptic and is nondecreasing. Let be pseudoperiodic as defined in (2.2). The solution to (2.1) converges uniformly as to the unique viscosity solution of the following equation
(2.3) 
where the homogenized nonlinear function is defined as follows: For a fixed set of , there exists a unique real number for which the following cell problem has a unique viscosity solution for some :
(2.4) 
(where is the period in the last argument of ). We set .
This result can be viewed as the extension of a linear homogenization result [5]. Although the medium is highly oscillatory for small , the solution approaches that of a certain limiting equation with a onescale structure, as . In practice, the form of the limit is typically unknown, but this observation has led to an exploration of numerical homogenization algorithms, in which one seeks to capture the limit numerically without resolving the fine scale . We view this problem as one of manifold reduction. The solution can be “compressed” significantly; its “information” is stored mostly in , which can be computed from (2.4) using mesh width , in contrast to the required to solve (2.1). In other words, the dimensional solution manifold can potentially be compressed into an dimensional solution manifold, up to small homogenization error that vanishes as .
Remark 1.
Due to the popularity of the elliptic multiscale problem, the literature is rich. For linear elliptic PDEs, many influential methods have been developed, including the multiscale finite element method (MsFEM) [46, 33, 47], the heterogeneous multiscale method (HMM) [24, 3, 28], the generalized finite element method [8, 7], localization methods [60], methods based on random SVD [17, 16, 18, 19], and many others [2, 1, 62, 63, 61, 12, 43]. Many of these methods adopt an offlineonline strategy. In the offline stage, local bases that encode the smallscale information and approximate the local solution manifold (space) with few degrees of freedom are constructed. In the online stage, the offline bases are used to compute global solutions on coarse grids, thus reducing online computation requirements drastically over naive approaches. For nonlinear problems, there is less prior work, and almost all methods can be seen as extensions of classical methods [32, 21, 31, 28, 4, 30, 44, 2, 57]. There is no counterpart on the nonlinear solution manifold for a linear basis, so most classical solvers construct local basis function iteratively, which accounts for a large amount of overhead time. One strategy that avoids repeated online computation of local bases is to adopt an idea from manifold learning [20] based on preparing a dictionary for each local patch in the offline stage to approximate the local solution manifold. The major computational issue for classical multiscale solvers is thus greatly alleviated: Repeated basis computation is reduced to basis searching on the manifold. However, since the method is locally linear, its efficacy depends on the amount of nonlinearity of underlying PDE. Thus far, the approach is difficult to generalize to fully nonlinear elliptic PDEs, and a more universal methodology is needed to approximate the nonlinear solution map.
2.2. Domain decomposition and Schwarz iteration
A popular framework for solving elliptic PDEs is domain decomposition, where the problem is decomposed and solved separately in different subdomains, with boundary conditions chosen iteratively to ensure regularity of the solution across the full domain. This approach is naturally parallelizable, with potential savings in memory and computational cost. It essentially translates the inversion of a large matrix into the a composition of inversions of many smaller matrices. The many variants of domain decomposition include the Schwarz iteration strategy that we adopt in this paper. This strategy makes use of a partitionofunity function that resolves the mismatch between two solutions in adjacent subdomains. We briefly review the method here.
For simplicity we describe the case of and assume throughout the paper that for some . The approach partitions the domain into multiple overlapping subdomains, also called patches. It starts with an initial guess of the solution on the boundaries of all subdomains, and solves the Dirichlet problem on each patch. The computed solutions then serve as the boundary conditions for neighboring patches, for purposes of computing the next iteration. The entire process is repeated until convergence.
In the current setting, the overlapping rectangular patches are defined as follows:
(2.5) 
where is a multiindex and is the collection of the indices
We plot the setup in Figure 2.1. For each patch we define the associated partitionofunity function , which has and
(2.6) 
We set to be the boundary of patch and denote by the collection of indices of the neighbors of . In this 2D case, we have
(2.7) 
Naturally, indices that are out of range, which correspond to patches adjacent to the boundary , are omitted from .
In the framework of domain decomposition, the fulldomain problem is decomposed into multiple smaller problems supported on the subdomains. Define the local Dirichlet problem on patch by:
(2.8) 
For this local problem, we define the following operators:

is the solution operator that maps local boundary condition to the local solution :
Denoting by the number of grid points on the boundary and the number of grid points on the subdomain , then maps to .

denotes the restriction (or tracetaking) operator that restricts the solution within to its part that overlaps with the boundary of , for all . That is,
Denoting by the number of grid points in , then maps to .

is the composition of and . It is a boundarytoboundary operator that maps the local boundary condition to the restricted solution :
maps to .

denotes the collection of all segments of boundary conditions that is computed from the fulldomain boundary condition :
(2.9) Letting , maps to .
The Schwarz procedure starts by making a guess of boundary condition on each . At the th iteration, (2.8) is solved for each subdomains (possibly in parallel) and these solutions are used to define new boundary conditions for the neighboring subdomains , . The boundary conditions for at iteration are thus:
(2.10) 
Note that the physical fulldomain boundary condition is imposed on the points in . Each iteration of the Schwarz procedure can be viewed as an application of the map . The procedure concludes by patching up the local solutions from the subdomains. The overall algorithm is summarized in Algorithm 1.
The convergence of classical Schwarz iteration is guaranteed for fully nonlinear elliptic equations; see, for example [56, 55, 39]. Since the computation of solution can be expensive due to the nonlinearity and oscillation of the medium at small scale , the major computational cost for Schwarz iteration comes from the repeated evaluation of the boundarytoboundary map , which requires solution of an elliptic PDE on each subdomain.
3. Reduced order Schwarz method based on neural networks
The major numerical expense in the Schwarz iteration comes from the local PDE solves — one per subdomain per iteration. However, except at the final step where we assemble the global solution, our interest is not in the local solutions per se: It is in the boundarytoboundary maps that share information between adjacent subdomains on each Schwarz iteration. If we can implement these maps directly, we can eliminate the need for local PDE solves. To this end, we propose an offlineonline procedure. In the offline stage, we implement the boundarytoboundary maps, and in the online stage, we call these maps repeatedly in the Schwarz framework. This approach is summarized in Algorithm 2. In this description, we replace the boundarytoboundary map by a surrogate , which is neural network parametrized by weights , whose values are found by an offline training process.
Since the online stage is selfexplanatory, we focus on the offline stage, and study how to obtain the approximation to .
3.1. Two observations
A rigorous approach to preparing the boundarytoboundary map in the offline stage is not straightforward. In the case of linear PDEs, it amounts to computing all Green’s functions in the local subdomains and confining them on the adjacent subdomain boundaries for the map; see [19]. When the PDEs are nonlinear, there would seem to be no alternative to solving the local PDEs with all possible configurations of the boundary conditions, applying the appropriate restrictions, and storing the results. At the discrete level, would be represented as a highdimensional function mapping to . To achieve a specified accuracy, both and need to scale as . For bruteforce training, at least local PDE solves need to be performed to compute the required approximation to . This is a large amount of computation, and it offsets whatever gains accrue in the online stage from efficient deployment of the approximation to .
To be costeffective, a method of the form of Algorithm 2 must exploit additional properties, intrinsic to and to the scheme for approximating this mapping. The first such property is a direct consequence of homogenization. As argued in Section 2.1, the solution of the effective equation (2.3) can preserve the ground truth well, with the effective equation independent of . Therefore, the map , though presented as a mapping from to
, is intrinsically of low dimension and can be compressed. To visualize this relation, we plot the relative singular values of the boundarytoboundary operator
of a linear multiscale elliptic equation (see (4.4)) in Figure 3.1.With the system being of intrinsically low dimension, we expect that a compression mechanism can be deployed. Even though the data itself is represented in high dimension, the number of parameters in the compressed representation should not grow too rapidly with the order of discretization. We seek an approximation strategy that can overcome the “curse of dimensionality.” These considerations lead us to the use of neural network (NN). NN, unlike other approximation techniques, is powerful in learning functions supported in high dimensional space; the number of parameters that need to be tuned to fit data in a high dimensional space is typically relaxed from the dimension of the data.
Consider a fully connected feedforward neural network (FCNN) representing a function . A 2layer FCNN with hiddenlayer width would thus be required to satisfy
(3.1) 
where , are weight matrices and ,
are biases. The activation function
is applied componentwise to its argument. (The ReLU activation function
is especially popular.) This layer FCNN already can represent high dimensional functions. A fundamental approximation result [51, 26, 10] is captured in the following theorem.Theorem 3.1 (Barron’s Theorem).
Let be a bounded domain. Suppose a generic function satisfies
(3.2) 
where
is the Fourier transform of the zero extension of
to . Then there exists a twolayer ReLU neural network withhiddenlayer neurons such that
(3.3) 
A natural high dimensional extension of the result is as follows.
Corollary 3.1.
Let be a bounded domain. Suppose a generic function so that satisfies (3.2), then there exists a twolayer ReLU neural network with hiddenlayer neurons such that
(3.4) 
where .
A nice feature of this result is that the approximation error is mostly relaxed from the dimension of the problem, making NN a good fit for our purposes. In our setting, it is the highdimensional operator that needs to be learned. Theorem 3.1 suggests that if FCNN is used as the representation, the number of neurons required will not depend strongly on this dimension.
3.2. Offline training and the full algorithm
The two observations above suggest that using a neuralnetwork approximation for the boundarytoboundary operator can reduce computation costs and memory significantly. Following (3.1), we define the NN approximation to as follows:
(3.5) 
Here denotes all learnable parameters, with weight matrices and biases . The number of neurons is a tunable parameter that captures the intrinsic dimension of . Theorem 3.1 and the homogenizability of the elliptic equation suggest that can be chosen to satisfy a prescribed approximation error while being independent of both and , and thus of the small scale .
Given a fixed NN architecture and a data set, the identification of optimal
amounts to minimizing a loss function
that measures the misfit between the data and the prediction. One needs to prepare a set of data and corresponding outputs(3.6) 
where solves (2.8). The loss function to be minimized is
(3.7) 
where evaluates the mismatch between the first and the second arguments. (This measure could be defined using the norm and / or the norm.) Gradientbased algorithms for minimizing (3.7) have the general form
(3.8) 
where is the learning rate and is based on the all gradients seen so far. For example, for the Adam optimizer [50], the function is a normalized exponentially decaying average of gradients:
(3.9) 
for some parameter . The sign means needs to be normalized so that .
Like many optimization processes, the training and tuning of this NN depends on some prior knowledge. We propose a mechanism to select training data that represent well the information in . We also initialize the weights according to a reduced linear problem. These mechanisms are described in the following two sections; their effectiveness in numerical testing is demonstrated in Section 4.
3.2.1. Generating training data
To learn the parameters in the NN approximation to the boundarytoboundary map, one needs to provide a training set of examples of the map. We generate such examples by adding a boundary margin of width to each interior patch to obtain an enlarged patch , as shown in Figure 3.2. Samples are generated by choosing Dirichlet conditions for the enlarged patch, then solving the equation, and defining the map in terms of restrictions of both input and output conditions to the appropriate boundaries.
Specifically, following [20], we generate i.i.d. samples of the boundary conditions for the enlarged patch according to ^{1}^{1}1The distribution of the sample is uniform in angle and satisfies a power law in the radius. Letting , we write with uniformly distributed on the unit sphere and distributed in according to the density function . To measure the discrete norm, we employ the formula , where we denote denote , and is the step size., and solve the following equations for :
(3.10) 
The boundarytoboundary map maps each element of to the corresponding element of , where
(3.11) 
This pair of sets — input set and output set — serves as the training data.
3.2.2. Initialization
The training problem of minimizing in (3.7) to obtain the NN approximate operator is nonconvex, so a good initialization scheme can improve the performance of a gradientbased optimization scheme significantly. We can make use of knowledge about the PDE to obtain good starting points. Our strategy is to assign good initial weights and biases for the neural network using a linearization of the fully nonlinear elliptic equation (2.1). Denoting by the boundarytoboundary operator of a linearized version of , to be made specific below for the numerical examples in Section 4, we initialize in a way that approximately captures . The linear boundarytoboundary operator has a matrix representation. Denoting by the approximate rank (up to a preset error tolerance), we can write
(3.12) 
where and have orthonormal columns while is diagonal. As argued in [19], due to the fact that the underlying equation is homogenizable, this rank is much less than , and is independent of and .
To start the iteration of , we compare (3.5) with the form of (3.12). This suggests the following settings of parameters in (3.5): and
(3.13)  
Note that . These configurations will be used as the initial iteration in (3.8).
We summarize our offline training method in Algorithm 3. Integratation into the full algorithm yields the reduced order neural network based Schwarz iteration method.
4. Numerical results
We present numerical examples using our proposed method to solve a multiscale semilinear elliptic equation and a multiscale Laplace equation. In both examples, we use domain . To form the partitioning, is divided into equal nonoverlapping rectangles, then each rectangle is enlarged by on the sides that do not intersect with , to create overlap. We thus have
The loss function is defined as in (3.7), with parameter . For training to obtain
, we use Pytorch
[64]. For both examples, each neural network is trained for 5,000 epochs using shuffled minibatch gradient descent with a batchsize of 5% of the training set size. The Adam optimizer is used with default settings, and the learning rate decays with a decayrate of 0.9 every 200 epochs.
4.1. Semilinear elliptic equations
The first example is the semilinear elliptic equation
(4.1) 
with oscillatory medium defined by
(4.2) 
with . The medium is plotted in Figure 4.1.
The reference solution and the local PDE solves are computed using the standard finitevolume scheme with uniform grid with mesh size and Newton’s method is used to solve the resulting algebraic problem. For our domain decomposition approach, we set to define the patches , with boundary margins to form . The input and output dimensions of are thus .
To obtain the training data, each patch is further enlarged to a buffered patch by adding a margin of to . On each patch , samples are generated with random boundary conditions defined by and . To train the NN, we use the loss function (3.7) with
(4.3)  
where is the discrete version of the derivative operator with step size . The second term measures mismatch in the derivative.
To initialize the neural networks, we take to be the boundarytoboundary operator of the following linear elliptic equation
(4.4) 
We truncate the rank representation of at rank to preserve all singular values bigger than a tolerance .
4.1.1. Offline training
We show the improvements in the offline process for training due to the two strategies described in Subsection 3.2.2: the use of enlarged patches, and initialization using SVD of a matrix representation of a linearized equation. Figure 4.2 plots number of epochs in the offline training vs the training loss function (4.3) associated with for the patch in four different settings: SVDinitialization on training data with buffer zone, SVDinitialization on training data without buffer zone, and the counterpart without SVDinitialization. The same NN model is used in all four settings. It is immediate that the training process has a much faster decay in error if buffer zone is adopted, and that the SVD initialization gives a much smaller error than random initialization.
To show the generalization performance of the resulting trained NN, we generate a test data set from the same distribution as the buffered training data set with 1,000 samples, for the same patch . Since the NNs trained using nonbuffered data produce larger error, we only test the NNs trained with buffered data. The test errors (4.3) in the training process for different models are plotted in Figure 4.3. Again, the use of buffered data along with SVDinitialization yields the best performance.
To demonstrate generalization performance, we plot the predicted outputs for two typical examples in the test set in Figure 4.4. For comparison, we also plot the outputs produced by randomly initialized neural network and the linear operator . It can be seen that the lowrank SVDinitialized neural network has the best performance among all the initialization methods.
We note too that the neural network models initialized by the SVD of linear PDEs tend to be more interpretable. Figure 4.5 shows the final weight matrices for models initialized by different methods. It can be seen that for SVDinitialized model yiels weight matrices with recognizable structure: the parameters for higher modes are near zero, and only the top modes in the positive and negative halves are nontrivial. By comparison, the trained weight matrices using randomly initialized parameters do not show any pattern or structure.
4.1.2. Online phase: Schwarz iteration
We show results obtained by using the NN approximation of the boundarytoboundary map inside the Schwarz iteration. Table 1 shows the boundary conditions used for the three problems we tested. (The same medium (4.2) is used in all cases.) We use for the tolerance in Algorithm 2, and use the full accuracy local solvers as in the generation of training data set. In Figure 4.6, we plot the ground truth solutions for different boundary conditions and the absolute error of obtained by neural networkbased Schwarz iteration. (Note that the scaling of the axis in the latter is different from the former.) The relative errors obtained for the four variants of NN approximation along with the linear approximation to the boundarytoboundary map can be found in Tables 2 and 3. Note that the smallest errors are attained by the variant that uses the SVD initialization and buffered patches. To demonstrate the efficiency of our method, we compare the CPU time of neural network basedSchwarz method and the classical Schwarz method, using the same tolerance for the latter. The NNs we used for the test is trained by SVD initialization, and its training data is generated with buffer zone. Since NNproduced local boundarytoboundary map is only an approximation to the ground truth, for a fair comparison, we also run the reference local solution with a relaxed accuracy requirement. The CPU time, number of iteration and error comparison can be found in Table 4. In all three test cases, the NN approximate executes faster than the conventional local solution technique as a means of implementing the boundarytoboundary map, while producing errors of the same order.
No.  Boundary condition  

1 


2 


3 

Problem Number  1  2  

Relative Error  
SVDNN  0.0013  0.0028  0.0029  0.0010  0.0010  0.0016 
SVDNN (No buffer zone)  0.0042  0.0091  0.0078  0.0029  0.0030  0.0039 
RandNN  0.0425  0.0445  0.0907  0.0379  0.0188  0.0370 
RandNN (No buffer zone)  0.0882  0.0965  0.1555  0.0773  0.0400  0.0629 
Linear  0.0606  0.0644  0.1066  0.0505  0.0252  0.0415 
Problem Number  3  

Relative Error  
SVDNN  0.0035  0.0059  0.0058 
SVDNN (No buffer zone)  0.0235  0.0341  0.0346 
RandNN  0.1029  0.1293  0.1333 
RandNN (No buffer zone)  0.1739  0.2277  0.2078 
Linear  0.0614  0.0729  0.0776 
Problem Number  1  2  3  

Method  NN  Classical  NN  Classical  NN  Classical 
CPU time  12.4  17.2  13.9  18.5  13.4  19.5 
Iteration  30  29  30  29  34  35 
Error  0.0035  0.0024  0.0012  0.0007  0.0062  0.0022 
Comments
There are no comments yet.