DeepAI
Log In Sign Up

Multigrid-augmented deep learning preconditioners for the Helmholtz equation

In this paper, we present a data-driven approach to iteratively solve the discrete heterogeneous Helmholtz equation at high wavenumbers. In our approach, we combine classical iterative solvers with convolutional neural networks (CNNs) to form a preconditioner which is applied within a Krylov solver. For the preconditioner, we use a CNN of type U-Net that operates in conjunction with multigrid ingredients. Two types of preconditioners are proposed 1) U-Net as a coarse grid solver, and 2) U-Net as a deflation operator with shifted Laplacian V-cycles. Following our training scheme and data-augmentation, our CNN preconditioner can generalize over residuals and a relatively general set of wave slowness models. On top of that, we also offer an encoder-solver framework where an "encoder" network generalizes over the medium and sends context vectors to another "solver" network, which generalizes over the right-hand-sides. We show that this option is more robust and efficient than the stand-alone variant. Lastly, we also offer a mini-retraining procedure, to improve the solver after the model is known. This option is beneficial when solving multiple right-hand-sides, like in inverse problems. We demonstrate the efficiency and generalization abilities of our approach on a variety of 2D problems.

READ FULL TEXT VIEW PDF
07/28/2021

Numerical wave propagation aided by deep learning

We propose a deep learning approach for wave propagation in media with m...
05/25/2019

Robust Task-Parallel Solution of the Triangular Sylvester Equation

The Bartels-Stewart algorithm is a standard approach to solving the dens...
08/20/2021

Fourier Neural Operator Networks: A Fast and General Solver for the Photoacoustic Wave Equation

Simulation tools for photoacoustic wave propagation have played a key ro...
04/03/2021

Semi matrix-free twogrid shifted Laplacian preconditioner for the Helmholtz equation with near optimal shifts

Due to its significance in terms of wave phenomena a considerable effort...
01/16/2020

On Solving Groundwater Flow and Transport Models with Algebraic Multigrid Preconditioning

Sparse iterative solvers preconditioned with the algebraic multigrid has...
03/10/2022

IAE-Net: Integral Autoencoders for Discretization-Invariant Learning

Discretization invariant learning aims at learning in the infinite-dimen...
04/29/2021

Distributed Multigrid Neural Solvers on Megavoxel Domains

We consider the distributed training of large-scale neural networks that...

Abstract

In this paper, we present a data-driven approach to iteratively solve the discrete heterogeneous Helmholtz equation at high wavenumbers. In our approach, we combine classical iterative solvers with convolutional neural networks (CNNs) to form a preconditioner which is applied within a Krylov solver. For the preconditioner, we use a CNN of type U-Net that operates in conjunction with multigrid ingredients. Two types of preconditioners are proposed 1) U-Net as a coarse grid solver, and 2) U-Net as a deflation operator with shifted Laplacian V-cycles. Following our training scheme and data-augmentation, our CNN preconditioner can generalize over residuals and a relatively general set of wave slowness models. On top of that, we also offer an encoder-solver framework where an “encoder” network generalizes over the medium and sends context vectors to another “solver” network, which generalizes over the right-hand-sides. We show that this option is more robust and efficient than the stand-alone variant. Lastly, we also offer a mini-retraining procedure, to improve the solver after the model is known. This option is beneficial when solving multiple right-hand-sides, like in inverse problems. We demonstrate the efficiency and generalization abilities of our approach on a variety of 2D problems.

elmholtz equation, Shifted Laplace preconditioner, Iterative solution methods, Multigrid, Deep Learning, U-Net, Convolutional Neural Networks.

68T07, 65N55, 65N22

1 Introduction

The efficient numerical solution of the Helmholtz equation with a high and spatially-dependent wavenumber is a difficult task. The linear system that results from the discretization of the Helmholtz equation involves a large, complex-valued, sparse, and indefinite matrix. As the wavenumber is higher, the linear system needs to be larger, and the solution becomes more oscillatory and spatially complex. Many authors have contributed to the development of various iterative methods for solving the problem, e.g., [37, 31, 13, 16, 43, 9, 46, 12, 35, 40], yet it remains expensive and difficult to solve.

One of the most common methods to solve the heterogeneous Helmholtz equation is the Shifted Laplacian (SL) multigrid method [12, 47]

, which is based on a multigrid preconditioner for a damped version of the equation. That is, a complex shift is added to the matrix (and to its eigenvalues) to form a preconditioner that is efficiently solved by standard multigrid components. The preconditioning is typically applied with a Krylov solver, such as (flexible) GMRES

[42]. However, the shifted Laplacian preconditioner is not scalable due to the high shift that is required for multigrid to solve the shifted system, and further investigation on the topic is needed. The work in [44] provides analysis of the SL solver, showing that the increase of the wavenumber leads to a large increase in the number of iterations and the cost required for the solution. Indeed, the number of iterations in SL increases linearly with the wavenumber.

The numerical methods mentioned above are in some sense “classical”. In recent years, data-driven methods have revolutionized many fields, and in particular, Deep Learning (DL) approaches have changed the landscape of fields like computer vision, natural language processing and bioinformatics

[28, 27]

. Similarly, solving partial differential equations (PDEs) with a deep neural network is a natural idea, and has been recently considered in various forms, which are roughly divided into two, as we present below. Another related approach is optimization of PDE solvers

[17].

One common class of DL-for-PDEs approaches, which are often referred to as physics-informed neural networks (PINNs), includes both PDE solvers [38, 39, 30, 18], and PDE discovery methods [3]

. The general idea is to implement a neural network that implicitly represents the PDE solution. That is, the solution is approximated by the neural network as an analytical function at a given point in space. This function is obtained using stochastic optimization for learning the network’s weights as an unsupervised learning task, given a large training set of points. The loss function that is minimized forces the network to meet both the system of equations and the initial and/or boundary conditions, using automatic differentiation to get the PDE derivatives. Therefore, the learning process essentially minimizes the residual, which is computed by differentiating the neural network function to get the terms of the PDE analytically. Following the learning process, the PDE coefficients and boundary conditions are assumed to be embedded into the neural network weights. This approach can be applied to different types of PDEs. As PINNs are evaluated at a single point, their training is in principle mesh-free, but requires the training data set and the architecture to be large enough to capture the complexity of the solution and the problem. A nice property of PINNs is that they can also solve inverse problems, as suggested, for example, in

[2] for the weighted Poisson equation and its inverse problem, using extended unsupervised loss function that includes the residual in norm.

A different family of approaches involves solvers in which DL is used to approximate the solution of the PDE in an end-to-end fashion given the problem properties (coefficients, initial conditions or sources). That is, the network needs to generalize over some of the PDE properties, which is the more common scenario in computer vision applications. For example, in [20] the authors propose ’PauliNet’, a wavefunction ansatz based on a neural network that achieves nearly exact solutions of the electronic Schrödinger equation. Similarly, some works employ Convolutional Neural Networks (CNNs) which are typically used for applications involving data on regular grids, like images and videos, or, discretized PDEs on a structured grid. For example, the authors of [14, 23] suggested CNNs to solve 1D and 2D parametric elliptic PDEs. Khoo and Ying [24] introduced a CNN named ’SwitchNet’, for approximating forward and inverse maps arising from the time-harmonic wave equation, which is the Helmholtz equation in time domain. In these problems, the local information in the input has a global impact on the output, hence, SwitchNet is based on convolution layers and operators of reshaping, switching, flattening, and point-wise matrix multiplication, with which the network performs knowledge-based actions and mimics the butterfly Helmholtz preconditioner [29]. The network is trained to solve the problem for plane-wave sources and generalize for 2-4 Gaussian scatterers in a constant medium. Similarly, [48] employs a 3D CNN to predict near and far scattering fields. However, no neural network has been presented in the literature that effectively solves the Helmholtz equation at high wavenumber for rather general heterogeneous models and right-hand-sides.

The main principle of our work is merging data-driven approaches, such as DL, together with classical methods. We wish to exploit the strengths of the classical solvers and to complement them with DL, covering for their weaknesses. One DL accelerator adopting this approach, named CFDNet [34], has been suggested for fluid simulations. The authors present a coupled framework using both simulation and deep learning, for accelerating the convergence of Reynolds averaged Navier-Stokes simulations. They show that their model performs well and also generalizes for examples not seen in the training. We note that while DL models are considered to be computationally heavy, the emergence of graphics processing units (GPUs) makes DL efficient and beneficial, both in terms of computational cost and in terms of parallelism. Furthermore, compression methods can really accelerate the inference of DL methods by a significant factor [8], using a variety of approaches like pruning (sparsification), quantization, knowledge distillation and more, as the community aims to make DL feasible for low-power edge devices.

In order to solve the Helmholtz equation for a heterogeneous medium and a high wavenumber in a single application of a neural network, one would have to train a huge network to be able capture all the complexity in the solution—a lot of oscillatory waves, reflections and interference. That is even for a certain type of right-hand-sides. In the context of PINNs, the high variability of the solution may require the training to include a huge amount of points, which, together with the fact that we need a large network, makes the training expensive at solve time. To be effective, PINNs assume that the PDE solution is relatively simple, and that the network is light and can be trained with a relatively small data set of samples. As example, the work of [33] considered the time-domain wave equation, and limited the solution to point sources, and a given heterogeneous medium (i.e., the network does not generalize over the medium—it is fixed). While the first arrival wave is indeed well predicted, it seems that reflections are not recovered well by the NN, compared to the true solution. These reflections are important to capture for geophysical applications, e.g., Full Waveform Inversion [32]. Lastly, we note that NNs can easily approximate simple functions or solve problems to reasonable-but-low accuracy using a relatively low number of parameters. However, when high complexity and accuracy are needed, the typical size of the neural networks that are required grow tremendously. For example, to improve classification accuracy by 2% only, architectures need to almost double their size [19].

To summarize, given a problem we cannot be sure that the architecture is sufficient to accurately represent the solution, and in hard cases, huge architectures are needed. This brings us to the conclusion that in order to accurately solve complicated problems, the network has to be used iteratively to build and improve the solution in several steps, like a preconditioner. This requires, however, the ability to work on general right-hand-sides (residuals) which are multivariate grid-shaped inputs. Furthermore, if we wish to generalize on the medium as an input of the network, we have to use a network that consumes images—a CNN—as a full heterogeneous medium is an image. We note that generating training data for our task and training the CNN model are very expensive, and we assume they are done once off-line to be used multiple times later. However, the preconditioner is fast and efficient in the solution time once the model is trained. That is also a reason it is important to have the slowness model generalization.

For the reasons stated above, in this work we utilize CNNs as preconditioners for the discretized Helmholtz equation. We propose to use a CNN together with a classical multigrid approach—the Shifted Laplacian—which is efficient at removing only part of the error. We build on the similarity between a V-cycle and a U-Net [41] which is a multiscale CNN used for image-to-image mappings such as semantic segmentation, image denoising etc. Indeed, a geometric V-cycle can be naturally implemented using DL frameworks, suggesting that a U-Net architecture, like a V-cycle, can be used as a preconditioner for solving PDEs on a regular grid. Also, a V-cycle can be used to solve the equation for any medium and right-hand-side, meaning that it generalizes on the problem. This suggests that a U-Net or at least a variant of it can be trained to generalize on both a right-hand-size and a medium, and act as a preconditioner. In this work we consider a standard U-Net, but note that other variants are worthy of consideration.

To precondition the Helmholtz equation, we use a U-Net together with a classical approach, to complement the network and stabilize the solution process. Our U-Net architecture begins with a strided convolution, initially projecting the fine level input to a coarse grid. We examine two preconditioners: one is a U-Net that acts as a non-linear coarse grid correction and is applied with a Jacobi iteration for pre-and post-smoothing. In the second formulation we apply alternating U-Net and SL multigrid V-cycle, which provides a more powerful smoothing and somewhat resembles the deflation preconditioner of

[44, 9]. Both preconditioners are applied within a Krylov subspace method—flexible GMRES (FGMRES). We define our U-Net to generalize over a random right-hand-side and a random piece-wise smooth slowness model. We also suggest two upgrades to the scheme above: (1) an encoder-solver framework and (2) a mini-retrain approach. Both upgrades are detailed later, and can improve the standard U-Net tremendously.

2 Preliminaries and Background

2.1 Problem formulation

The heterogeneous Helmholtz problem is give by

(2.1)

The unknown

represents the pressure wave function in the frequency domain,

denotes the angular frequency, is the Laplacian operator and is the heterogeneous wave slowness model - the inverse of its velocity. The source term is the right-hand-side (RHS) of the system, and is denoted by . The parameter indicates the fraction of attenuation (or damping) in the medium and . We focus here on the hardest case for iterative methods, which is , but it is also possible to consider a varying attenuation. As boundary conditions, we consider an absorbing layer [12, 11], but Sommerfeld, PML [45] or [36] can be viable options as well.

Discretization

The second order finite difference discretization of the aforementioned problem (2.1) on a uniform mesh of width in both and directions yields the stencil

(2.2)

where boldface letters like denote discrete vectors. This leads to a system of linear equations

(2.3)

When discretizing the Helmholtz equation, we must obey the rule of thumb that at least 10 grid nodes per wavelength are used, which leads to the bound . This typically requires a very fine mesh for high wavenumbers, significantly increasing the number of unknowns and making the system (2.3) huge, ill-conditioned, in addition to being indefinite, and complex-valued due to the boundary conditions. Moreover, the number of eigenvalues of the matrix in (2.2) with negative real part increases as the wavenumber is higher. Hence, solutions for two and especially three dimensional problems are challenging and require creative iterative solvers.

2.2 Shifted Laplacian multigrid

The common approach to solving (2.3) is to use preconditioner within an iterative Krylov subspace method. The SL operator

(2.4)

is used to accelerate the convergence of a Krylov subspace method for solving (2.3). The preconditioning matrix is obtained from the discretization of (2.4), in the same way we are discretizing (2.1). The solution is computed from the (right) preconditioned system

(2.5)

where and are the discrete Helmholtz and shifted Laplacian matrices, respectively. In this paper we focus on the pair and , which in [12] is shown to lead to a good compromise between approximating (2.3) and our ability to solve the shifted system using multigrid tools. The SL multigrid method is very consistent and robust for non-uniform models. However, it is considered slow and computationally expensive, especially for large wavenumbers.

Geometric multigrid

Generally, multigrid methods are used to iteratively solve discretized PDEs, like the linear system in (2.3), defined on a fine grid using a hierarchy of grids. In a nutshell, in a two level setting, we project the fine level problem onto a coarser grid, solve the problem on that coarser grid, and then use the coarse solution to correct the solution on the original fine grid. This is repeated recursively to form a V-cycle, which is applied iteratively to solve the problem.

More explicitly, multigrid methods are based on two complementary processes: relaxation and coarse grid correction. The relaxation is obtained by a standard method like Jacobi or Gauss-Seidel, which is only effective at reducing part of the error111In the case of the Helmholtz system, such relaxation methods do not converge due to the indefiniteness of , yet they are effective at smoothing the error using 1-2 iterations only.

. The remaining error, called “algebraically smooth”, is typically spanned by the eigenvectors of

corresponding to small eigenvalues (in magnitude), i.e., vectors s.t.

(2.6)

To reduce such errors, multigrid methods use a coarse grid correction, where the error for some iterate

is estimated by solving a coarser system and interpolating its solution:

Here, the matrix is an approximation of on a coarser mesh , obtained for width . To interpolate the solution from coarse to fine grids we choose the bi-linear interpolation operator which is suitable for the problem because the Laplacian operator in (2.4) is homogeneous. The restriction operator , dubbed “full-weighting”, projects the fine-level residual to the coarse grid. Since the coarse problem is still too large to solve, the process of relaxation and coarse grid correction is applied recursively resulting in the V-cycle algorithm given in Alg. 1.

Relax times on with as an initial guess ;
if deepest level then
       Solve the system directly;
      
else
       ;
       V-cycle;
       ;
      
end if
Relax times on with as an initial guess;
Algorithm 1 Multilevel V-cycle: V-cycle

In Alg. (1) one has to choose the number of levels in the V-cycle. Unlike other scenarios, the smooth error modes of the Helmholtz operator are oscillatory at high wavenumber, and cannot be represented well on very coarse grids. Hence, the performance of the solver deteriorates as we use more levels. For example, the results in [7] show that the best performance is achieved using three levels only, and the authors suggest to use GMRES as a coarsest grid solver.

2.3 Machine and deep learning

The basic purpose of machine learning is to study an approximation of an unknown function

, using a generic model function that has unknown parameters . The true function can be, for example, the category of an image , and hopefully the function is rich enough to approximate . The parameters , called the weights of the model, are learned using a set of samples from a distribution of reasonable inputs (e.g., natural images, spoken words, etc.) such that .

In supervised learning, the data set contains pairs

so that is the desired result for the model to provide given . To this end, a loss function is defined, and the weights of the model function are estimated by minimizing it. For example, a supervised loss function for predicting the solution of a linear system like (2.3), can be based on the error mean

(2.7)

where the training set contains pairs of satisfying (2.3).

Deep learning (DL) [15]

is a sub-field of machine learning that is concerned with the collection of model functions belonging to the class of neural networks. A typical neural network architecture includes an input layer, an output layer and a several hidden layers consisting of affine linear transformations and pointwise nonlinear activation functions. The basic layer reads

(2.8)

where is the non-linear activation function, are the matrix and bias parameters for the -th layer, respectively. The weights in (2.7) typically include all the parameters for all the layers. The process of optimizing neural networks—minimizing (2.7)—consists of the forward pass in which the layers like (2.8) are performed one after the other, and the backward pass in which the gradient

is computed for applying iterative steps of stochastic gradient descent to minimize the loss. Deep neural networks usually include many hidden layers, that result in a very flexible function representation.

A convolutional neural network (CNN) [27] is a type of a deep neural network, which is based on convolution kernels in (2.8), leading to a shared-weight architecture to handle data of large images and videos. CNNs are among the most effective methods for dealing with high-dimensional grid-like data, and learning spatial features. They are the leading methods for computer vision tasks like image classification, object detection, and semantic segmentation.

2.4 Geometric multigrid using CNN modules

In our work, both the CNN and the multigrid preconditioner are used interchangeably. We implement the multigrid method using CNN components, in order to enable a good integration of the V-cycle with the CNN in a unified framework within the preconditioner component, using a GPU back-end. The idea is based on the close connection between U-Nets and multigrid V-cycles.

The Helmholtz operator needed for the residual calculation and the Jacobi relaxation is executed using a convolution operator and an element-wise vector multiplication with the mass term according to the discretized Helmholtz operator (2.2) or the discrete (2.4). The high order discretization operators in [47] can also be obtained similarly.

The geometric multigrid transfer operators, the restriction (coarsening of the mesh) and prolongation (interpolation), are realized by convolution operators with the fixed kernels

(2.9)

that correspond to the full-weighting and bi-linear operators, respectively. Both kernels are applied with a stride of 2. For the restriction, this results in a coarse mesh . For the prolongation, we use a transposed strided convolution with , so that the mesh is refined. The solution for the coarsest mesh problem is done using GMRES with diagonal Jacobi preconditioner [7]

. In addition, all these convolutions are applied with zero padding of size 1, which is consistent with the Dirichlet boundary condition (the ABC is obtained through the mass term). Neumann BC can be obtained by using reflection or replication padding in CNN frameworks, which result in first and second order Neumann BC, respectively.

3 Multigrid-augmented deep learning preconditioners

We present a CNN in a U-Net architecture to accelerate the convergence of the Helmholtz solution, utilizing the GPU capabilities and parallel computation of DL frameworks. In order for the computational complexity to be cost-effective we define the architecture to be relatively small in DL standards (see Fig. 1 and details later), and yet demonstrate a significant acceleration to the solution. Our DL-based preconditioner is applied within the FGMRES Krylov method [42]. The flexible variant is important as our preconditioner is non-linear, hence also non-stationary.

Figure 1: Illustration of our U-Net architecture. The purple layers denote the input () and the output () of the network, while the orange layers are hidden layers. The red arrows indicate a downsampling convolution operator, followed by a ResNet step, while the green arrows indicate the upsampling convolution operator. The blue arrows indicate feature map concatenations.

3.1 A U-Net architecture

The U-Net architecture [41]

is a CNN for image-to-image tasks such as semantic segmentation, where each pixel is classified. It consists of two rather symmetric parts—contraction and expansion. As described in

[41], the purpose of the contraction part is to capture the significant contexts and features, and the purpose of the expansion part is to enables precise localization. In the contraction part, the feature maps are coarsened and the number of channels is increased to get the context vectors. In the expansion part of the U-Net, the number of feature channels is reduced, but the resolution of the channels is increased. This allows the network to propagate context information from the lower resolution layers to higher resolution layers. Since the contraction and expansion paths are more or less symmetric to each other, this yields a U-shaped architecture.

In our case, given a slowness model and a right-hand-side sampled on a grid (viewed as images), we wish to predict a solution for (2.3) on that grid (an image). Furthermore, when solving the Helmholtz equation, it is important to allow a point source to spread to the entire mesh, as each right-hand-side can be viewed as multiple point-sources. For this reason, a suitable network structure for this task can be a U-Net, since, like a multigrid cycle, it has a multiscale structure. That is, restriction and prolongation layers are performed, allowing each point to affect a wider neighborhood of grid cells (or pixels).

Our U-Net, presented in Fig. 1, receives a complex valued right-hand-side, the slowness model , and the attenuation model (includes the absorbing BC, but can include a heterogeneous attenuation). The non-linearity of the network architecture allows us to learn the complex interactions between these input “channels”, and minimize the error loss function w.r.t to the weights. In a nutshell, on its way down the hierarchy, the architecture interchangeably applies down-sampling convolutions and ResNet convolution blocks, which are similar in some sense to the Jacobi smoothing steps. Each step applies convolutions and a non-linear activation function. Then, on the way up the hierarchy it applies up-sampling convolutions. The down- and up-sampling are applied using strided convolutions, similarly to (2.9) with learned weights.

Specifically, on the way down the U-Net, we start with a down-sampling operator, which is performed using a strided convolution (a restriction) layer

(3.1)

where denotes the strided convolution operator, which maps channels to channels while coarsening the channels by a factor of 2 at each dimension (). We use a kernel of 5 by 5, so that the network can learn rather high-order transfer operators.

denotes an element-wise exponential linear unit (eLU) activation function. We chose a smooth activation function to allow for spatially smooth feature maps which are necessary because our solution is expected to be rather smooth. The batch normalization layers are expressed by

, whose role is to stabilize the optimization of network [22].

In between each restriction, while going down the hierarchy, we apply ResNet blocks. The structure of our ResNet block is described as

(3.2)

The weights and are two convolution operators with sized kernels. ResNet layers are explicitly reformulated as an addition of residual functions with respect to the inputs, instead of learning direct mappings like the standard neural network step in (2.8) or (3.1). In [19], the authors provide empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from increased depth. In the “coarsest” level of our U-Net, before the prolongation starts, 3 ResNet layers are performed one after the other.

When we go up the hierarchy, we apply the up-sampling convolutions (like prolongations) to refine the channels and decrease the channel space. Such layers are given by

(3.3)

where denotes a strided transposed convolution operator with a 5-by-5 kernel, which maps channels to channels. The function expresses a concatenation of the current prolonged feature maps and the corresponding feature maps from the restriction phase . This is marked in arrows in Fig. 1, and is similar in spirit to the multigrid’s coarse grid correction.

3.2 U-Net as a preconditioner accelerator

The U-Net is used to accelerate the solution of the Helmholtz equation. As noted before, since the medium can be very heterogeneous, the solution, even for a point source or a plane wave, can be highly complex at high wavenumber due to reflections and interference. Therefore, we do not pursue a solution via a single application of a network, but aim for a network to be applied several times in the form of a preconditioner. This ensures users that the discrete equation is indeed satisfied to any desired accuracy.

In this work we aim for the U-Net to complement either the Jacobi smoother or the shifted Laplacian multigrid method. Like a relaxation, the SL is efficient at handling error modes with rather large absolute eigenvalues, as the shift does not deviate those eigenvalues by a significant factor. Since SL or a relaxation are efficient at smoothing (=reducing residuals), we aim that the U-Net will focus on the error directly, and it will be accompanied by classical smoothing. That is, we wish to help the U-Net so it can focus on error modes that are not treated well by smoothing or SL. This guides our training procedure, as we describe later on. Furthermore, the smoothing is important since we apply the preconditioner inside a Krylov method, which is sensitive to high residuals. Below we list two preconditioning schemes that we propose.

U-Net as a coarse grid solver with Jacobi smoothing

Our first version of the preconditioner applies one pre- and post- Jacobi relaxations with a U-Net in between. Since the U-Net starts and ends with strided convolutions (like a restriction and prolongation), this ends up being similar to a two level cycle. Alg. 2 summarizes the process, where the matrix denotes the discrete Helmholtz equation.

Algorithm: preconditioner :
  1. Perform a Jacobi step: .

  2. Compute residual, and apply a U-Net:

  3. Perform a Jacobi step: .

Algorithm 2 U-Net as a coarse grid solver with Jacobi smoothing.

U-Net as a deflation operator

A slightly more evolved option would be to apply U-Net, and afterwords to smooth the output using a SL V-cycle. More explicitly:

(3.4)

This option somewhat resembles the concept of deflation operator studied in [43], only without the non-linearities, and a coarse grid solution instead of a network. There, the solution is projected using an interpolation operator , and a coarse grid solution is obtained. That is in addition to SL cycles. Here, the operator corresponding to is the strided convolution layer at the entrance to the network, and the last U-Net layer is a transposed convolution corresponding to the matrix that returns the mesh to its original dimension.

3.3 Encoder-Solver: an encoder network as a preconditioner setup

So far, we introduced a U-Net that receives a residual and a slowness model, and returns an approximate solution. That is, our U-Net generalizes on both the residual and the slowness. This setup is quite challenging compared to the setup in the previous [24, 33, 48], where the right hand side and slowness are represented by very few parameters, and the solution is obtained by a single forward pass, and not as a preconditioner. Obviously, the less we have to generalize on, the easier it will be for the network to approximate the solution well. This is also verified in our experiments, where we show that the same network performs better when the slowness model is known during training and during the Krylov solution at test time (see Sec 4.2).

Having the observation above in mind, we wish to exploit the following: when we solve an equation using several iterations, the slowness model remains the same and is known during those iterations—the U-Net is applied with the same model over and over again, with different residuals. So, in the grand scheme we need to generalize on the slowness model, but we need not do it at every step. To this end, we propose to split the generalization over the residual and slowness model into two “encoder” and “solver” networks. The encoder network receives only the slowness model as input, and prepares feature vectors that include the slowness model information. Those vectors are computed only once at the beginning of the solution process, and are fed into the solver network at every iteration in the solution time. The solver network is applied at every iteration, with the readily available feature vectors inside it. This way we can have a large network for the slowness model, that will be applied only once, and the solver network for the residual can be cheaper. This idea resembles a preconditioner setup that is applied once before the solution process, and is adequate for any right hand side.

Figure 2: Illustration of our Encoder-Solver scheme. The encoder U-Net gets the slowness model and outputs a hierarchy of feature maps. These are fed into the solver U-Net (gray arrows), which receives the residual and the slowness model as input, and outputs .

Figure 2 illustrates our proposed encoder-solver architecture. The encoder network is also built in a U-Net structure and is almost identical to the network we described in Sec 3.1. The outputs of the encoder network, the context for the solver network, are hierarchical feature vectors which we concatenate in each layer at the restriction and the prolongation phases of the solver network. The concatenation in the restriction and the prolongation steps can be described as

(3.5)

respectively, where is the feature vector from the corresponding phase in the encoder network.

3.4 Training and data augmentation

In order for the U-Net to be efficient at solve time as a preconditioner, it must practice or learn on residual vectors that are as close to the solve time FGMRES residuals as possible. Let

(3.6)

be the forward application of the network for a given right-hand-size and slowness model . As before, denote the collection of network weights (convolutions, biases, etc.) that we wish to learn. Since we wish to complement smoothing processes, in our supervised learning setup we aim to minimize the error with respect to multiple residuals and models. That is, given training data triplets , where , we minimize

(3.7)

which is the mean

error norm, also known as the mean squared error (MSE). The training phase is performed by a stochastic gradient descent optimizer, where in each epoch we sweep through all the training examples in batches and varying order. Specifically, we use the ADAM optimizer

[25] with a variable step size.

For supervised learning, we have to provide the network with the corresponding error solution for each residual vector and slowness model. To have such triplets, in some cases one needs to invest a lot of time and resources in preparing the training-set, i.e., solve the PDE many times. However, since we have a linear system here, we can simply generate a random

, and easily compute the residual. But, we wish to create a data-set that is as close as possible to the situation in the real solution using FGMRES, and if we simply take the FGMRES residual vectors, we will not have a corresponding error without solving the system. To obtain that we need to generate data with rather smooth vectors in varying “smoothness levels”, since FGMRES and V-cycles are smoothing operations. Hence, we propose the following procedure for which, in our experience, the training error predicts the final solution’s convergence factor quite accurately. First, we generate a random normally distributed

and compute . Then, we apply a random number of FGMRES iterations with SL V-cycle as preconditioner, starting from a zero vector, to get :

(3.8)

Following that, we compute a residual and let be the network input. The true error that we wish the network to recover is given by

(3.9)

Beyond the procedure above, we insert as input to the network. This causes the network’s input and output to be of the same order of magnitude222The magnitude of the input and output are important because of how certain default parameters are chosen in DL frameworks—e.g., initialization of weights and biases and scales of activation functions. For this reason, the input is usually normalized, and batch normalization is applied throughout the network., and because the system is linear, this change has no effect on the accuracy. Furthermore, to avoid over-fitting and achieve better generalization, we augmented the training examples using a linear combination of the errors and residuals to expand the set. This summarizes the training data generation, given slowness models. The training set and the validation set are manufactured in the same way, and we saw no difference in the loss between these two.

Overall, our network has 4 input channels of the size of the grid. These are the real and imaginary parts of the residuals , the real (squared) slowness model , and the real attenuation model . Although we do not generalize of , we found it slightly beneficial to include it as input, as it allows the network to have a different behaviour at the grid’s boundaries.

Remark

As an alternative to the error loss in (3.7), one may use the residual norm , so that the training can be unsupervised [2]. This is a viable option, but it can be a bit misleading, since high errors can in fact have low residuals—these are “algebraically smooth errors”. Since we can eliminate those errors with V-cycles or Jacobi iterations, we wish to let the network focus on those smooth modes and not on the whole spectrum. We note that empirically, our experience in augmenting the error norms in (3.7) with residual norms generally yields inferior performance than focusing on the errors only during training.

3.4.1 Slowness models

Solving the Helmholtz equation for a uniform slowness model is not difficult, as it can be performed using a Fourier transform. When trying to solve this equation over a non-uniform media, the difficulty increases significantly, especially if there are sharp jumps or large ratios between the smallest and highest slowness. Here, we want the network to succeed in different types of models, smooth and non-smooth, that faithfully represent reality. In certain applications (e.g., geophysical or medical imaging) one may have slowness models of a certain typical distribution coming from the application domain. E.g., in geophysical imaging, one typically has a layered model with higher velocity values deeper in the earth. In machine learning tasks, if we wish for an algorithm to generalize on an input, it is best if we let the algorithm learn on inputs from the same distribution as in the real life application. That is a large part of the advantage of data-driven approaches. Usually, however, to obtain a good generalization we need a large data set of inputs—a collection of slowness models in our case.

Figure 3: Four examples for the models generated from the CIFAR10 image collection. Each model is a random image from the set, enlarged to a grid of 128 by 128 pixels, smoothed using a Gaussian kernel and normalized to the range of [0.25,1].

Here, we use a rather general slowness model collection to demonstrate our method. The models that we use for the U-Net training, are created from a large set of natural images CIFAR-10

[26]. This data set is usually used as image classification benchmark and contain 50K images of resolution pixels. We use this data set and manipulate the images to become slowness models in order to have a significant amount of training models to learn from. In particular, we use a training set of random sample images. To transform the natural images into slowness models, we first enlarge them to the grid size using a bi-linear interpolation, then smooth them with a single application of a Gaussian smoothing kernel, and normalize the values to the range of . Fig. 3 shows four different models produced from the CIFAR-10 image collection. Fig. 4 presents how the smoothness level of the model, , affects the difficulty of the solution. It is evident that as the medium is less smooth, the solution is more complex and is harder to model by a neural network. Similarly, the range of slowness values in the model also affects the the complexity of the solution. Fig. 5 illustrates how the solution is again more complex as the contrast in the slowness model is larger.

(a) (b) (c) (d)
Figure 4: Illustration of smooth levels of , and their effect on the full solution of the Helmholtz equation for a point source. In the upper row we show the model for each test. The leftmost, (a), is an image from the CIFAR10 data-set, enlarged to a grid, and the two inner figures, (b) and (c), are the same model after a Gaussian smoothing with kernel sizes of 5 and 10, respectively. The rightmost, (d), is the constant homogeneous model. On the bottom row we show the full solution for the corresponding model in the first row.
(a) (b) (c) (d)
Figure 5: Illustration of the effect of the value ranges of on the full solution of the Helmholtz equation for the point source problem. The leftmost, (a) is a full solution for kappa normalization between 0 and 1, (b) , (c) and (d) .

3.5 CNNs vs. PINNs and a proposed mini re-training

In the approach of PINNs, one approximates the true solution as a point-wise neural network that receives in a mesh-free manner, and possibly another simple property such as a source location, or a plane-wave direction for the right-hand-side. The PDE is then solved by the optimization process for learning the neural network’s weights, and any underlying heterogeneous medium or boundary conditions are assumed to be incorporated into the learned weights through the loss minimization. However, this way, we are quite limited in generalizing over the model coefficients, and must have a powerful enough architecture for that is able to approximate the solution well enough with some learned weights. The upside is that the optimization hopefully leads to the best approximation possible for the specific solution given the architecture. As the learning is in fact the solution of the PDE, we also need to assume that the optimization is not expensive and does not require the forward-backward passes over too many points.

In some sense, if we define a CNN with convolutions only, and feed it with the function on a grid or a mesh, then we get multiple separate instances of in each forward pass for all the points on the grid. A real CNN will also incorporate spatial connections between neighboring (grid) points, hence it is a more powerful model. Similarly, in cases where a spatially adaptive mesh is more suitable than a regular grid, a graph neural network can replace the role of the CNN for the same tasks [10]. So - if we would define our CNN with as input, and consider training in solve time, we get a CNN version of PINNs. The true difference, however, is that here, we try to generalize over grid-sized values, like and , which is the way CNNs are used in computer vision, for example.

With the comparison above in mind, our next proposed approach aims to further improve the solver U-Net for the slowness model by a mini retraining procedure at solve time for the given slowness model. That is, given a slowness model we apply a few training iterations over a randomly generated data set of residuals to improve the U-Net weights. This is similar to the process obtained in PINNs, and is motivated by our relatively better performance for known models compared to the case where we generalize over the models. This option would fit cases where we need to solve multiple linear systems with the same slowness model but for many right-hand-sides. Such a scenario fits the case of inverse problems, where the medium needs to be estimated. There, the slowness model evolves iteratively and does not change a lot between consecutive iterations of the inverse solution. Hence, the model retraining can gradually improve the forward solver network with the iterations, like in the PINNs forward/inverse solver in [2]. We note that here our retraining phase uses the error-based supervised training procedure described earlier, and not the residual-based training in the common approaches of PINNs.

4 Numerical Results

In this section we evaluate the performance of the preconditioner together with FGMRES for solving the Helmholtz equation for random right-hand-sides. In particular, we use a block restarted FGMRES(10) described in [6], with a subspace of size 10, solving 10 different right-hand-sides together (a block size of 10). We consider it to be the preconditioner test, and later discuss and demonstrate the influence of the number of right-hand-sides and Krylov subspace size. In addition, we present the relative error norm plot throughout the training period, which shows the value of the loss function (3.7) vs. the epoch number. Ideally, the asymptotic value of the relative training error should indicate the real convergence factor at solve time using FGMRES. It is indeed approximately so thanks to the varying smoothing levels that we apply for our training residuals in Eq. (3.8). Using this plot, we can also estimate over-fitting and adjust the hyper-parameters (learning rate and mini-batch size, for examples). Our code is written in the Julia language [4], utilizing the Flux DL framework [21].

Unless stated otherwise, the test cases that we present are for (2.1) defined on a 2D domain of , discretized on a nodal regular grid of cells, and a wave frequency defined by , which leads to about 12.5 grid points per wavelength. We use an absorbing boundary layer of width 10 to prevent reflection from all the model boundaries. We compare three preconditioners: The SL V-cycle , in Alg. 2 and for Eq. (3.4). We evaluate the methods as preconditioners by their convergence factor, calculated as

(4.1)

where is the number of iterations required to reduce the residual norm by a factor of . For the SL V-cycles we use a shift of , 3 levels, 1 pre- and post- damped Jacobi smoothing with a damping factor of 0.8, and a coarsest grid solution of GMRES(10) with the Jacobi diagonal preconditioner as suggested in [7]. Under these settings, the convergence factor of FGMRES using the SL V-cycle preconditioner for a model is , and more than V-cycles are required to reach convergence. The average convergence factor of for a grid of cells, and frequency , is . The advantage of SL is that is quite consistent and robust for uniform, smooth and non-smooth models—the difference in the convergence factors is only, depending on the model. Note that the convergence factor according to (4.1) is typically lower than the asymptotic convergence factor.

The training process for all the U-Nets is performed by the ADAM optimizer [25], and includes 80-120 epochs. The initial learning rate is and the initial mini-batch size is 20. Along the epochs we reduce the learning rate and increase the mini-batch size. The schedule of these changes as well as the amount of epochs we apply vary a bit between the different experiments and are adjusted by cross validation on the test set. In the majority of the cases, after epochs we divide the initial learning rate by 10 and double the mini-batch size. The next changes of the learning rate and mini-batch size occur after every epochs. That is, the changes occur at epocs 48, 72, 84, 90 etc, until the learning rate essentially reaches zero. These settings can be used as default parameters for all the experiments that we show.

4.1 Performance of a homogeneous model

(a) (b)
(c)
(d)
Figure 6: Experiments for a homogeneous slowness model. (a) the training plot. (b) the residual drop history of the FGMRES method. ,, are the convergence factors of the preconditioners , and , respectively. (c) shows the real part of the result of a single U-Net operation on a point source, compared to the result of a single V-cycle for the same input in (d).

In the homogeneous test case, we solve the system (2.3) for a uniform model, known at training. That is, the training is performed on random residual and error vectors only as described earlier. Fig. 6 describes the results of the proposed method for this case. The average relative error, i.e. the minimum estimate value of the loss function (3.7), of the training and test sets at the end of the training process is . The FGMRES method with U-Net-based preconditioner as described in Eq. (3.4) converges within 10 iterations with convergence factor compared to using only V-cycles. Our cheaper preconditioner, which uses only our U-Net and Jacobi steps as described in Alg. 2, converges within 12 iterations with convergence factor . These are very favorable convergence factors given how poorly V-cycles behave for a homogeneous model. On the other hand, homogeneous problems can be efficiently solved using fast Fourier transform.

4.2 Performance on a single heterogeneous model

In this section we examine our method for a single heterogeneous slowness model, known during the training phase. In this scenario, the network can learn weights specific to that model. As shown in Fig. 7, for a model normalization of we obtained convergence factors of and , which are close to the performance on the uniform model. For the more difficult case of , we obtain , . Furthermore, in plot (c) of Fig. 6 and in the bottom plots (c) and (e) of Fig. 7, we show the results of a single iteration of Eq. (3.4) on a centralized point source, a problem that is not included in the training set. It is evident that U-Net has indeed learned to produce a global solution, and the integration with the V-cycle is intended to “clean” or smooth the solution and ensure convergence.

(a) (b)
(c) (d) (e) (f)
Figure 7: Experiments for a known heterogeneous slowness model. We compare models for which , and . (a) shows the convergence in the training phase and (b) shows the results of the three preconditioner tests. The bottom figures describe the real part of the results of a single U-Net operation, (c) for compared to the real part of the full solution in (d), and (e) for the case of compared to the full solution (f).

4.3 Generalization on the slowness model

As we showed in the previous section, our U-Net can learn to solve (2.3) quite efficiently for a given heterogeneous slowness model. However, the training of the U-Net is quite expensive to apply at solve time. In a more applicable approach for the realistic world, we wish the neural network to work and generalize on models it had not seen in the training phase. Therefore, we will now present the ability of the proposed method to generalize for different models. In addition, we will also present the upgraded versions of our approach: the encoder-solver dual networks and the mini re-training approach.

In the experiments in this section, the training phase was performed on a set of pairs of produced as described earlier in Sec. 3.4, and were randomly generated from a collection of the images in the CIFAR10 data-set. In the preconditioner test, we compare the performance of our methods to that of V(1,1)-cycle, which is also used for . We will examine the capabilities of the preconditioner using

  1. A stand-alone U-Net with smoothing/deflation.

  2. The encoder-solver framework.

  3. All options of the U-Nets after a modest “mini retraining” phase at solve time.

Table 1 and Figure 8 summarize and compare the results for all three cases, and in the next subsections we discuss each case separately. For all options we examine how the solver performs in the case of its training setting. That is, we train the network on a grid of for , but also examine the performance on larger grids and frequencies, and for . In particular, we solve (2.3) for random right-hand-sides on grids of size and with , respectively , using a network trained on residuals and slowness models of size of only. Table 1 presents the results for the encoder-solver setting. Regardless of the specific preconditioner, we see that if the networks are trained on a single model, they exhibit the best performance. However, they fail to generalize on slightly different cases—even for the easier normalization of the same model (row (a) vs row (b) in the single model column). If we train the networks to generalize over slowness models, it can also generalize on the problem size and slowness contrast. All cases are improved using mini-retraining.

Single model General model Re-trained
(a)
Generalization for cases not seen in training
(b) fail
(c) fail
(d) fail
Table 1: Performance comparison of the U-Net-based preconditioners. is the convergence factor and is the number of iterations required to reduce the residual norm by a factor of . (a) summarizes the numerical results of the method on new examples equal in difficulty to the training set. Rows (b),(c) and (d) describe the performance of the same U-Nets on problems of a different size or with a different level of contrast in the slowness model. The “single model” column demonstrate the performance when the U-Net is trained using a single heterogeneous model. In the “general model” column the network is trained on multiple slowness models. The “re-trained” column describes the performance after mini-retraining.
(a) (b) (c)
(d) (e) (f)
Figure 8: The convergence history of the solutions of 2.3 for random RHSs.First row: stand-alone U-Net for problem of size (a) , (b) and (c) . Second row: the encoder-solver U-Nets the same experiments. Solid and dotted lines mark the preconditioners without and with the mini-retraining phase at solve time, respectively.

4.3.1 A stand-alone U-Net

At the end of the training phase, the average relative error value for the stand-alone U-Net is (not shown). Due to the way the samples are produced and the limited network size, this value is consistent for both the training and the validation sets, i.e., there is no over-fitting. The preconditioner test of the U-Net is described in the top left plot in Fig. 8. The preconditioner achieves the convergence within iterations and a convergence factor of . The preconditioner achieves a convergence factor of . When examining the network’s ability to generalize to larger sizes than the training size, does not improve on the V-cycle performance, while reaches a convergence factor of for -size and for -size, compared to the V-cycle achieving and , respectively. Next, we show that with little effort, the U-Nets can do better than that.

4.3.2 An encoder-solver pair of networks

Taking advantage of the fact that the model does not change throughout the preconditioner test, we now use the encoder-solver framework, where another neural network pre-process the slowness model and generate context vectors for the solver U-Net. We train both networks as a single component, and in the preconditioner test we run the encoder network once and use the same feature vectors we have received from it throughout the iterations, only executing the solver network many times.

The average relative error of the training and validation sets at the end of the training phase is . In the preconditioner test phase, the preconditioner reached convergence within iterations, and a convergence factor of , while achieved . Plot (d) of Fig. 8 demonstrates the results of the preconditioner test on -grid inputs, consistent with the training set. The other two plots in (e) and (f) describe the generalization capability for larger-sized inputs and frequencies.

Beyond the improvement in the preconditioner test of and especially the more cost-effective , the ability of the preconditioner to generalize on other sizes also improved significantly with the addition of the encoder network. Problems of -size converge within 62 iterations, compared to 391 using V-cycle only. For the grids (with ), the amount of iterations is reduced significantly—more than 600 iterations are required with compared to 141 with .

4.3.3 Mini-retraining towards a single model network at solve time

Due to the significant gap in the performance when generalizing over the slowness model (as opposed to knowing the model), we now add a mini-training phase to tune the network weights to a specific model. The experiment consists of three phases: training to learn the solver for different slowness models, mini-retraining to tune the network to a specific model, and the preconditioner test.

At this point, there is a trade-off between the cost of the retraining process and the success of the preconditioner test. The more resources we invest in the retraining phase, the closer the result will be to a single model result in Table 1. We present the enhancement of a re-training phase that includes 30 epochs on 100 different pairs produced by a single slowness model . The re-training for problems of -size, was done using 30 epochs on 200 pairs while for problems from size of we performed a re-training that includes 30 epochs on a set of 300 samples. These amounts of data samples and epochs are just a fraction of the corresponding amounts of a full neural network training. Still, the retraining is quite expensive and suits the case of multiple right-hand-sides only. To reach the single model performance, a large set is required as re-training on a too small set may over-fit the input pairs and not just the model. The dotted lines in Fig. 8 graphs describe the effect of re-training on the preconditioner test.

In plot (d) of Fig. 8, the encoder-solver network reached a good network model in its training, and hence performs well, and is only slightly improved by retaining, saving only 5-6 iterations in each case (20%). In contrast, when re-training for larger problems, the network learns both the model and the form of solution to the new problem size. The dotted lines in plots (e) and (f), show a significant improvement of the mini re-training process. For -size, the preconditioner converges within iterations compared to before the re-training. The preconditioner converges within iterations instead of . For problems, the improvement is even more significant, as we got convergence in 75 iterations compared to 141 iterations (almost twice) without the re-training process for learning the model and adjusting to -size. Note that SL preconditioner for converges in more than iterations, and also achieved a significant improvement following the re-training.

Plots (a), (b) and (c) in Fig. 8

show the results of the same experiment with a stand-alone U-Net. We see that the improvement is even more significant than in the encoder-solver case, probably because the starting point of the retraining phase is worse in the stand-alone experiment than in the encoder-solver case, and hence there is more room for improvement.

4.3.4 The influence of the Krylov subspace and block sizes

(a) (b)
Figure 9: The influence of the Krylov subspace and block sizes. On the left we see the influence of the block size (convergence for a number of right-hand-sides solved together by block FGMRES(10)). On the right we see the influence of the subspace size on the convergence factor.

In this set of experiments we examine the influence of both the subspace size and number of RHSs solved together (block size) that the FGMRES method uses. The experiment corresponds to the experiment shown in Fig 8 (e). Fig. 9 summerizes the results. It is obvious that whether by multiple RHSs or by subspace size, increasing the effective Krylov subspace can improve the performance of the solver significantly. That is in contrast to the SL V-cycle that is not improved by the same magnitude. This shows that there is more potential to the U-Net based preconditioners than shown earlier, since in addition, we typically did not reach the training error drop in the preconditioner test, indicating that FGMRES stabilizes the solution on the one hand, but also slows the preconditioners due to high residuals.

4.3.5 GPU computation time

In this set of experiments, we compare the running and convergence times of the different preconditioners. To have a fair comparison it is necessary to run the various preconditioners on the same processor unit. Since the U-Net is tailored to run efficiently on a GPU, we chose to have all the code running on a GPU as well, and compare the running times on this platform, which is quite common these days. In addition to GPU implementation of the multigrid cycle described in Sec. 2.4, we also implemented the FGMRES method using the implementation of linear layers from the Julia Flux library. This function was used both in the external loop that compares performance of the preconditioners and in the coarse-grid solver of the V-cycle method. Since multiplication of 64-bit matrices and vectors on our GPU is significantly slower than those in 32-bit, we performed all the computations and timings in 32-bit. In our experiments, the convergence rate was not affected by the transition from CPU to GPU or from 64 to 32 bit, except to the minimum value of the residual norm which is limited to only in 32-bit. This behavior is reasonable.

The FGMRES function that we use here supports block structure in the most naive way, which is less efficient than the full block-FGMRES used earlier in the CPU based implementation. That is because the block-FGMRES method includes QR decompositions which are highly inefficient on a GPU and are time consuming. As a result, as we increase the block size (number of right hand sides) the efficiency slightly drops but the computations better utilize the GPU resources. In our experiments we measured the times of running the algorithms on blocks of 10 RHSs. In addition, timings were averaged across 10 runs. The 32-bit GPU implementation was roughly 3-4 times faster than the CPU-based multicore implementation. Our experiments were conducted using an NVIDIA RTX 3090 GPU operated by the Julia Flux framework.

(a) (b) (c)
Figure 10: Convergence history relative to times. Axis-x indicates the average time per RHS (in seconds / milliseconds) it took to reach each relative residual norm value. These experiments were measured on a GPU on 32-bit data describing problems of size (a) , (b) and (c) .

Fig. 10 describes the residual convergence relative to the solution time. These runs correspond to the ones presented in the second row of Fig. 8 in terms of the problem setup (grid size, slowness models) for the various preconditioners, including those based on the re-train process. In addition, for each preconditioner we indicate the convergence factor (Eq. (4.1)), and the amount of iterations for the convergence.

In these experiments, the preconditioner for -grids as well as for -grids achieves better results than the preconditioner even though it performs a larger number of iterations, since the runtimes of each iteration are significantly smaller. Both, and , achieve better results than the V-cycle, even for -grids and even without the re-training. For -size problems, the preconditioner after 250 milliseconds (per RHS) and the preconditioner after 600 milliseconds, reduced the residual norm by a factor of , while the V-cycle-preconditioner reached the same result after more than 2.5 seconds. For -grids the two proposed preconditioners, and , convergence in less than 1.5 seconds while V-cycle converges after almost 10 seconds. For larger grids with size of , converged within 4 seconds, converged within 2.8 seconds, and V-cycle needed more than 25 seconds.

4.3.6 Out of distribution models

Having examined the ability of the U-Nets to solve problems produced using the CIFAR-10 training set of slowness models, in this experiment we present the ability of the CNN to adapt and generalize for realistic geophysical models that were not included in the training set. This aspect is important for the usability of the pre-trained CNN for new slowness models which are not available at training time. We do note that if one has an application in mind, it is better to use training data that are representative of that application. Here, we demonstrate a case where a new slowness model is out of the training distribution, as may inevitably happen in real life.

Fig. 11 describes the comparison between the performance of the encoder-solver U-Nets and V-cycle for problems produced by 3 different geophysical slowness models that are commonly used in the literature—the SEG salt and Overthrust models [1], and the Marmousi model [5]. These models include many sharp edges and are significantly different than the rather smooth models that we produced to train the CNN. We again compare the convergence history of the residual norm with respect to the iterations of the block-FGMRES for each of the preconditioners. As before, the combination of U-Net and V-cycle () is more effective than the V-cycle alone even without the re-training phase, i.e. a CNN trained on other types of models of size , effective for different geophysical models and problem sizes not seen during training.

(a) (b) (c)
(d) (e) (f)
Figure 11: Performance for geophysical models. First row: (a) SEG/EAGE salt-dome model (), (b) The Overthrust velocity model () and (c) the Marmousi velocity model (). Second row: the convergence history of block-FGMRES on 10 Helmholtz problems produced by the respective slowness model from the row above. The dotted lines describe the results of a 30 epochs re-training process for the specific model on a training set of 300 RHSs.

For each of the models we did a re-training that included 30 epochs on a set of 300 pairs produced using the specific slowness model, and marked their performance with dotted lines in the plots. First, for the three models the re-training process has made the preconditioner more effective than the V-cycle. In particular, for the first two models, the SEG/EAGE salt-dome model and the Overthrust velocity model, re-training saved about 100 iterations, depending on the desired threshold of the residual norm. The cost of the retraining: 30 (epochs) 300 (train-set size) 3 (U-Net operation gradient calculation + the encoder’s gradient) amounts to about 27,000 U-Net applications. Divided by the gain of 100 iterations per RHS, the retraining in these settings will be beneficial if we need to solve about 270 RHSs. Such a number is small compared to the number of RHSs that need to be solved as part of a solution of an inverse problem with many sources. Note that the slowness models here are significantly different than the ones used for training, and we expect a lighter retraining phase if the grid-size and training set distribution better match the target problems.

5 Conclusions

In this work we presented a new approach for preconditioning the heterogenouos Helmholtz equation. Our approach is based on a combination of CNN and multigrid components, using a U-Net CNN as a coarse grid solver, or as a deflation operator together with SL V-cycles. The idea is that the CNN is trained to reduce the error, and the smoothing or V-cycles are responsible for stabilizing the solution process in a Krylov solver.

Since we target a preconditioner, our CNN is trained to generalize over the right-hand-sides (residuals), and over the slowness models. To help with the generalization, we propose to use an encoder-solver framework, where two CNNs are used. One prepares context vectors and generalizes over the slowness, and the other receives the context vectors and generalize over the right-hand-sides. This encoder-solver framework indeed upgrades the stand-alone U-Net in all cases. In addition, inspired by PINNs, we offer a mini-retraining phase over the slowness model at solve time, since the method is more effective when the slowness model is known. This consistently shows yet another upgrade to the performance. We show that while our U-Net may require more FLOPs than traditional methods, it can be applied efficiently on GPU hardware, and yield favorable running times. Furthermore, for many cases CNN models can be significantly compressed by a variety of ways without harming their performance, making the preconditioner more beneficial.

In this work we used the U-Net architecture “as is”. However, better architectures can probably be defined specifically for the Helmholtz equation. Furthermore, since problems typically include a more specific right-hand-side (e.g., point source or a plane wave), the first iteration can be performed with a different network that is trained for such cases, and our framework can then follow to improve the solution. Another interesting direction can be to train iterative U-Nets directly (like Recurrent Neural Networks), removing the need of FGMRES to guide the solution.

Lastly, in this paper we demonstrated 2D problems only, so that the number of waves in the domain is high and realistic. Performing the same in 3D will require the training of 3D networks in the same resolution, which is beyond the abilities of our hardware at this point. Small 3D problems are solved by V-cycles quite efficiently, hence are less relevant. However, it is indeed relevant for HPC centers solving large 3D problems, because the training is done offline.

References

  • [1] F. Aminzadeh, B. Jean, and T. Kunz, 3-D salt and overthrust models, Society of Exploration Geophysicists, 1997.
  • [2] L. Bar and N. Sochen, Strong solutions for pde-based tomography by unsupervised learning, SIAM Journal on Imaging Sciences, 14 (2021), pp. 128–155.
  • [3] Y. Bar-Sinai, S. Hoyer, J. Hickey, and M. P. Brenner, Learning data-driven discretizations for partial differential equations, Proceedings of the National Academy of Sciences, 116 (2019), pp. 15344–15349.
  • [4] J. Bezanson, A. Edelman, S. Karpinski, and V. B. Shah, Julia: A fresh approach to numerical computing, SIAM Review, 59 (2017), pp. 65–98, https://doi.org/10.1137/141000671.
  • [5] A. Brougois, M. Bourget, P. Lailly, M. Poulet, P. Ricarte, and R. Versteeg, Marmousi, model and data, in EAEG workshop-practical aspects of seismic data inversion, European Association of Geoscientists & Engineers, 1990, pp. cp–108.
  • [6] H. Calandra, S. Gratton, J. Langou, X. Pinel, and X. Vasseur, Flexible variants of block restarted gmres methods with application to geophysics, SIAM Journal on Scientific Computing, 34 (2012), pp. A714–A736.
  • [7] H. Calandra, S. Gratton, X. Pinel, and X. Vasseur, An improved two-grid preconditioner for the solution of three-dimensional Helmholtz problems in heterogeneous media, Numerical Linear Algebra with Applications, 20 (2013), pp. 663–688.
  • [8] Y. Cheng, D. Wang, P. Zhou, and T. Zhang, Model compression and acceleration for deep neural networks: The principles, progress, and challenges, IEEE Sig. Proc. Mag., 35 (2018), pp. 126–136.
  • [9] V. Dwarka and C. Vuik, Scalable convergence using two-level deflation preconditioning for the Helmholtz equation, SIAM J. on Sci. Comp., 42 (2020), pp. A901–A928.
  • [10] M. Eliasof and E. Treister, Diffgcn: Graph convolutional networks via differential operators and algebraic multigrid pooling, Advances in neural information processing systems (NeurIPS), (2020).
  • [11] B. Engquist and A. Majda, Radiation boundary conditions for acoustic and elastic wave calculations, Communications on pure and applied mathematics, 32 (1979), pp. 313–357.
  • [12] V. C. Erlangga YA, Oosterlee CW, A novel multigrid based preconditioner for heterogeneous Helmholtz problems, SIAM Journal on Scientific Computing, 27 (2006), pp. 1471–1492.
  • [13] M. J. Gander and H. Zhang, A class of iterative solvers for the Helmholtz equation: Factorizations, sweeping preconditioners, source transfer, single layer potentials, polarized traces, and optimized Schwarz methods, Siam Review, 61 (2019), pp. 3–76.
  • [14] M. Geist, P. Petersen, M. Raslan, R. Schneider, and G. Kutyniok, Numerical solution of the parametric diffusion equation by deep neural networks, Journal of Scientific Computing, 88 (2021), pp. 1–37.
  • [15] I. J. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT Press, Cambridge, MA, USA, 2016. http://www.deeplearningbook.org.
  • [16] I. G. Graham, E. A. Spence, and J. Zou, Domain decomposition with local impedance conditions for the Helmholtz equation with absorption, SIAM Journal on Numerical Analysis, 58 (2020), pp. 2515–2543.
  • [17] D. Greenfeld, M. Galun, R. Basri, I. Yavneh, and R. Kimmel, Learning to optimize multigrid pde solvers, in International Conference on Machine Learning, PMLR, 2019, pp. 2415–2423.
  • [18] J. Han, A. Jentzen, and E. Weinan, Solving high-dimensional partial differential equations using deep learning, Proceedings of the National Academy of Sciences, 115 (2018), pp. 8505–8510.
  • [19] K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition

    , in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.

  • [20] N. F. Hermann Jan, Schätzle Zeno, Deep-neural-network solution of the electronic Schrödinger equation, Nature Chemistry, 12 (2020), pp. 891––897.
  • [21] M. Innes, Flux: Elegant machine learning with julia

    , Journal of Open Source Software, (2018).

  • [22] S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, in International conference on machine learning, PMLR, 2015, pp. 448–456.
  • [23] Y. Khoo, J. Lu, and L. Ying, Solving parametric pde problems with artificial neural networks, European Journal of Applied Mathematics, 32 (2021), pp. 421–435.
  • [24] Y. Khoo and L. Ying, Switchnet: a neural network model for forward and inverse scattering problems, SIAM J. on Sci. Comp., 41 (2019), pp. A3182–A3201.
  • [25] D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, in International Conference for Learning Representations (ICLR), 2015.
  • [26] A. Krizhevsky, Learning multiple layers of features from tiny images, tech. report, 2009.
  • [27] A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification with deep convolutional neural networks, in Advances in neural information processing systems, 2012, pp. 1097–1105.
  • [28] Y. LeCun, Y. Bengio, and G. Hinton, Deep learning, nature, 521 (2015), pp. 436–444.
  • [29] Y. Li, H. Yang, E. R. Martin, K. L. Ho, and L. Ying, Butterfly factorization, Multiscale Modeling & Simulation, 13 (2015), pp. 714–732.
  • [30] L. Lu, X. Meng, Z. Mao, and G. E. Karniadakis, Deepxde: A deep learning library for solving differential equations, SIAM Review, 63 (2021), pp. 208–228.
  • [31] S. Luo, J. Qian, and R. Burridge, Fast huygens sweeping methods for Helmholtz equations in inhomogeneous media in the high frequency regime, Journal of Computational Physics, 270 (2014), pp. 378–401.
  • [32] L. Métivier, R. Brossier, S. Operto, and J. Virieux, Full waveform inversion and the truncated newton method, SIAM Review, 59 (2017), pp. 153–195.
  • [33] B. Moseley, A. Markham, and T. Nissen-Meyer, Solving the wave equation with physics-informed deep learning, arXiv preprint arXiv:2006.11894, (2020).
  • [34] O. Obiols-Sales, A. Vishnu, N. Malaya, and A. Chandramowliswharan, CFDNet: a deep learning-based accelerator for fluid simulations, Proceedings of the 34th ACM International Conference on Supercomputing, (2020), https://doi.org/10.1145/3392717.3392772.
  • [35] L. N. Olson and J. B. Schroder, Smoothed aggregation for Helmholtz problems, Numerical Linear Algebra with Applications, 17 (2010), pp. 361–386.
  • [36] S. Papadimitropoulos and D. Givoli, The double absorbing boundary method for the Helmholtz equation, Applied Numerical Mathematics, (2021).
  • [37] J. Poulson, B. Engquist, S. Li, and L. Ying, A parallel sweeping preconditioner for heterogeneous 3D Helmholtz equations, SIAM J. on Sci. Comp., 35 (2013), pp. C194–C212.
  • [38] M. Raissi and G. E. Karniadakis, Hidden physics models: Machine learning of nonlinear partial differential equations, Journal of Computational Physics, 357 (2018), pp. 125–141.
  • [39] M. Raissi, P. Perdikaris, and G. E. Karniadakis, Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations, Journal of Computational Physics, 378 (2019), pp. 686–707.
  • [40] B. Reps and T. Weinzierl, Complex additive geometric multilevel solvers for Helmholtz equations on spacetrees, ACM Trans. Math. Softw., 44 (2017).
  • [41] O. Ronneberger, P. Fischer, and T. Brox, U-net: Convolutional networks for biomedical image segmentation, 2015, https://arxiv.org/abs/1505.04597.
  • [42] Y. Saad, A flexible inner-outer preconditioned gmres algorithm, SIAM J. on Sci. Comp., 14 (1993), pp. 461–469.
  • [43] A. Sheikh, D. Lahaye, L. Garcia Ramos, R. Nabben, and C. Vuik, Accelerating the shifted laplace preconditioner for the helmholtz equation by multilevel deflation, Journal of Computational Physics, 322 (2016), pp. 473–490.
  • [44] A. H. Sheikh, D. Lahaye, and C. Vuik, On the convergence of shifted laplace preconditioner combined with multilevel deflation, Numerical Linear Algebra with Applications, 20 (2013), pp. 645–662.
  • [45] I. Singer and E. Turkel, A perfectly matched layer for the Helmholtz equation in a semi-infinite strip, J. Comput. Phys., 201 (2004), pp. 439–465.
  • [46] E. Treister and E. Haber, A multigrid solver to the Helmholtz equation with a point source based on travel time and amplitude, Numerical Linear Algebra with Applications, 26 (2019), p. e2206.
  • [47] N. Umetani, S. P. MacLachlan, and C. W. Oosterlee, A multigrid-based shifted Laplacian preconditioner for a fourth-order Helmholtz discretization, Num. Lin. Alg. with App., 16 (2009), pp. 603–626.
  • [48] P. R. Wiecha and O. L. Muskens, Deep learning meets nanophotonics: a generalized accurate predictor for near fields and far fields of arbitrary 3d nanostructures, Nano letters, 20 (2019), pp. 329–338.