Abstract
In this paper, we present a datadriven approach to iteratively solve the discrete heterogeneous Helmholtz equation at high wavenumbers. In our approach, we combine classical iterative solvers with convolutional neural networks (CNNs) to form a preconditioner which is applied within a Krylov solver. For the preconditioner, we use a CNN of type UNet that operates in conjunction with multigrid ingredients. Two types of preconditioners are proposed 1) UNet as a coarse grid solver, and 2) UNet as a deflation operator with shifted Laplacian Vcycles. Following our training scheme and dataaugmentation, our CNN preconditioner can generalize over residuals and a relatively general set of wave slowness models. On top of that, we also offer an encodersolver framework where an “encoder” network generalizes over the medium and sends context vectors to another “solver” network, which generalizes over the righthandsides. We show that this option is more robust and efficient than the standalone variant. Lastly, we also offer a miniretraining procedure, to improve the solver after the model is known. This option is beneficial when solving multiple righthandsides, like in inverse problems. We demonstrate the efficiency and generalization abilities of our approach on a variety of 2D problems.
elmholtz equation, Shifted Laplace preconditioner, Iterative solution methods, Multigrid, Deep Learning, UNet, Convolutional Neural Networks.
68T07, 65N55, 65N22
1 Introduction
The efficient numerical solution of the Helmholtz equation with a high and spatiallydependent wavenumber is a difficult task. The linear system that results from the discretization of the Helmholtz equation involves a large, complexvalued, sparse, and indefinite matrix. As the wavenumber is higher, the linear system needs to be larger, and the solution becomes more oscillatory and spatially complex. Many authors have contributed to the development of various iterative methods for solving the problem, e.g., [37, 31, 13, 16, 43, 9, 46, 12, 35, 40], yet it remains expensive and difficult to solve.
One of the most common methods to solve the heterogeneous Helmholtz equation is the Shifted Laplacian (SL) multigrid method [12, 47]
, which is based on a multigrid preconditioner for a damped version of the equation. That is, a complex shift is added to the matrix (and to its eigenvalues) to form a preconditioner that is efficiently solved by standard multigrid components. The preconditioning is typically applied with a Krylov solver, such as (flexible) GMRES
[42]. However, the shifted Laplacian preconditioner is not scalable due to the high shift that is required for multigrid to solve the shifted system, and further investigation on the topic is needed. The work in [44] provides analysis of the SL solver, showing that the increase of the wavenumber leads to a large increase in the number of iterations and the cost required for the solution. Indeed, the number of iterations in SL increases linearly with the wavenumber.The numerical methods mentioned above are in some sense “classical”. In recent years, datadriven methods have revolutionized many fields, and in particular, Deep Learning (DL) approaches have changed the landscape of fields like computer vision, natural language processing and bioinformatics
[28, 27]. Similarly, solving partial differential equations (PDEs) with a deep neural network is a natural idea, and has been recently considered in various forms, which are roughly divided into two, as we present below. Another related approach is optimization of PDE solvers
[17].One common class of DLforPDEs approaches, which are often referred to as physicsinformed neural networks (PINNs), includes both PDE solvers [38, 39, 30, 18], and PDE discovery methods [3]
. The general idea is to implement a neural network that implicitly represents the PDE solution. That is, the solution is approximated by the neural network as an analytical function at a given point in space. This function is obtained using stochastic optimization for learning the network’s weights as an unsupervised learning task, given a large training set of points. The loss function that is minimized forces the network to meet both the system of equations and the initial and/or boundary conditions, using automatic differentiation to get the PDE derivatives. Therefore, the learning process essentially minimizes the residual, which is computed by differentiating the neural network function to get the terms of the PDE analytically. Following the learning process, the PDE coefficients and boundary conditions are assumed to be embedded into the neural network weights. This approach can be applied to different types of PDEs. As PINNs are evaluated at a single point, their training is in principle meshfree, but requires the training data set and the architecture to be large enough to capture the complexity of the solution and the problem. A nice property of PINNs is that they can also solve inverse problems, as suggested, for example, in
[2] for the weighted Poisson equation and its inverse problem, using extended unsupervised loss function that includes the residual in norm.A different family of approaches involves solvers in which DL is used to approximate the solution of the PDE in an endtoend fashion given the problem properties (coefficients, initial conditions or sources). That is, the network needs to generalize over some of the PDE properties, which is the more common scenario in computer vision applications. For example, in [20] the authors propose ’PauliNet’, a wavefunction ansatz based on a neural network that achieves nearly exact solutions of the electronic Schrödinger equation. Similarly, some works employ Convolutional Neural Networks (CNNs) which are typically used for applications involving data on regular grids, like images and videos, or, discretized PDEs on a structured grid. For example, the authors of [14, 23] suggested CNNs to solve 1D and 2D parametric elliptic PDEs. Khoo and Ying [24] introduced a CNN named ’SwitchNet’, for approximating forward and inverse maps arising from the timeharmonic wave equation, which is the Helmholtz equation in time domain. In these problems, the local information in the input has a global impact on the output, hence, SwitchNet is based on convolution layers and operators of reshaping, switching, flattening, and pointwise matrix multiplication, with which the network performs knowledgebased actions and mimics the butterfly Helmholtz preconditioner [29]. The network is trained to solve the problem for planewave sources and generalize for 24 Gaussian scatterers in a constant medium. Similarly, [48] employs a 3D CNN to predict near and far scattering fields. However, no neural network has been presented in the literature that effectively solves the Helmholtz equation at high wavenumber for rather general heterogeneous models and righthandsides.
The main principle of our work is merging datadriven approaches, such as DL, together with classical methods. We wish to exploit the strengths of the classical solvers and to complement them with DL, covering for their weaknesses. One DL accelerator adopting this approach, named CFDNet [34], has been suggested for fluid simulations. The authors present a coupled framework using both simulation and deep learning, for accelerating the convergence of Reynolds averaged NavierStokes simulations. They show that their model performs well and also generalizes for examples not seen in the training. We note that while DL models are considered to be computationally heavy, the emergence of graphics processing units (GPUs) makes DL efficient and beneficial, both in terms of computational cost and in terms of parallelism. Furthermore, compression methods can really accelerate the inference of DL methods by a significant factor [8], using a variety of approaches like pruning (sparsification), quantization, knowledge distillation and more, as the community aims to make DL feasible for lowpower edge devices.
In order to solve the Helmholtz equation for a heterogeneous medium and a high wavenumber in a single application of a neural network, one would have to train a huge network to be able capture all the complexity in the solution—a lot of oscillatory waves, reflections and interference. That is even for a certain type of righthandsides. In the context of PINNs, the high variability of the solution may require the training to include a huge amount of points, which, together with the fact that we need a large network, makes the training expensive at solve time. To be effective, PINNs assume that the PDE solution is relatively simple, and that the network is light and can be trained with a relatively small data set of samples. As example, the work of [33] considered the timedomain wave equation, and limited the solution to point sources, and a given heterogeneous medium (i.e., the network does not generalize over the medium—it is fixed). While the first arrival wave is indeed well predicted, it seems that reflections are not recovered well by the NN, compared to the true solution. These reflections are important to capture for geophysical applications, e.g., Full Waveform Inversion [32]. Lastly, we note that NNs can easily approximate simple functions or solve problems to reasonablebutlow accuracy using a relatively low number of parameters. However, when high complexity and accuracy are needed, the typical size of the neural networks that are required grow tremendously. For example, to improve classification accuracy by 2% only, architectures need to almost double their size [19].
To summarize, given a problem we cannot be sure that the architecture is sufficient to accurately represent the solution, and in hard cases, huge architectures are needed. This brings us to the conclusion that in order to accurately solve complicated problems, the network has to be used iteratively to build and improve the solution in several steps, like a preconditioner. This requires, however, the ability to work on general righthandsides (residuals) which are multivariate gridshaped inputs. Furthermore, if we wish to generalize on the medium as an input of the network, we have to use a network that consumes images—a CNN—as a full heterogeneous medium is an image. We note that generating training data for our task and training the CNN model are very expensive, and we assume they are done once offline to be used multiple times later. However, the preconditioner is fast and efficient in the solution time once the model is trained. That is also a reason it is important to have the slowness model generalization.
For the reasons stated above, in this work we utilize CNNs as preconditioners for the discretized Helmholtz equation. We propose to use a CNN together with a classical multigrid approach—the Shifted Laplacian—which is efficient at removing only part of the error. We build on the similarity between a Vcycle and a UNet [41] which is a multiscale CNN used for imagetoimage mappings such as semantic segmentation, image denoising etc. Indeed, a geometric Vcycle can be naturally implemented using DL frameworks, suggesting that a UNet architecture, like a Vcycle, can be used as a preconditioner for solving PDEs on a regular grid. Also, a Vcycle can be used to solve the equation for any medium and righthandside, meaning that it generalizes on the problem. This suggests that a UNet or at least a variant of it can be trained to generalize on both a righthandsize and a medium, and act as a preconditioner. In this work we consider a standard UNet, but note that other variants are worthy of consideration.
To precondition the Helmholtz equation, we use a UNet together with a classical approach, to complement the network and stabilize the solution process. Our UNet architecture begins with a strided convolution, initially projecting the fine level input to a coarse grid. We examine two preconditioners: one is a UNet that acts as a nonlinear coarse grid correction and is applied with a Jacobi iteration for preand postsmoothing. In the second formulation we apply alternating UNet and SL multigrid Vcycle, which provides a more powerful smoothing and somewhat resembles the deflation preconditioner of
[44, 9]. Both preconditioners are applied within a Krylov subspace method—flexible GMRES (FGMRES). We define our UNet to generalize over a random righthandside and a random piecewise smooth slowness model. We also suggest two upgrades to the scheme above: (1) an encodersolver framework and (2) a miniretrain approach. Both upgrades are detailed later, and can improve the standard UNet tremendously.2 Preliminaries and Background
2.1 Problem formulation
The heterogeneous Helmholtz problem is give by
(2.1) 
The unknown
represents the pressure wave function in the frequency domain,
denotes the angular frequency, is the Laplacian operator and is the heterogeneous wave slowness model  the inverse of its velocity. The source term is the righthandside (RHS) of the system, and is denoted by . The parameter indicates the fraction of attenuation (or damping) in the medium and . We focus here on the hardest case for iterative methods, which is , but it is also possible to consider a varying attenuation. As boundary conditions, we consider an absorbing layer [12, 11], but Sommerfeld, PML [45] or [36] can be viable options as well.Discretization
The second order finite difference discretization of the aforementioned problem (2.1) on a uniform mesh of width in both and directions yields the stencil
(2.2) 
where boldface letters like denote discrete vectors. This leads to a system of linear equations
(2.3) 
When discretizing the Helmholtz equation, we must obey the rule of thumb that at least 10 grid nodes per wavelength are used, which leads to the bound . This typically requires a very fine mesh for high wavenumbers, significantly increasing the number of unknowns and making the system (2.3) huge, illconditioned, in addition to being indefinite, and complexvalued due to the boundary conditions. Moreover, the number of eigenvalues of the matrix in (2.2) with negative real part increases as the wavenumber is higher. Hence, solutions for two and especially three dimensional problems are challenging and require creative iterative solvers.
2.2 Shifted Laplacian multigrid
The common approach to solving (2.3) is to use preconditioner within an iterative Krylov subspace method. The SL operator
(2.4) 
is used to accelerate the convergence of a Krylov subspace method for solving (2.3). The preconditioning matrix is obtained from the discretization of (2.4), in the same way we are discretizing (2.1). The solution is computed from the (right) preconditioned system
(2.5) 
where and are the discrete Helmholtz and shifted Laplacian matrices, respectively. In this paper we focus on the pair and , which in [12] is shown to lead to a good compromise between approximating (2.3) and our ability to solve the shifted system using multigrid tools. The SL multigrid method is very consistent and robust for nonuniform models. However, it is considered slow and computationally expensive, especially for large wavenumbers.
Geometric multigrid
Generally, multigrid methods are used to iteratively solve discretized PDEs, like the linear system in (2.3), defined on a fine grid using a hierarchy of grids. In a nutshell, in a two level setting, we project the fine level problem onto a coarser grid, solve the problem on that coarser grid, and then use the coarse solution to correct the solution on the original fine grid. This is repeated recursively to form a Vcycle, which is applied iteratively to solve the problem.
More explicitly, multigrid methods are based on two complementary processes: relaxation and coarse grid correction. The relaxation is obtained by a standard method like Jacobi or GaussSeidel, which is only effective at reducing part of the error^{1}^{1}1In the case of the Helmholtz system, such relaxation methods do not converge due to the indefiniteness of , yet they are effective at smoothing the error using 12 iterations only.
. The remaining error, called “algebraically smooth”, is typically spanned by the eigenvectors of
corresponding to small eigenvalues (in magnitude), i.e., vectors s.t.(2.6) 
To reduce such errors, multigrid methods use a coarse grid correction, where the error for some iterate
is estimated by solving a coarser system and interpolating its solution:
Here, the matrix is an approximation of on a coarser mesh , obtained for width . To interpolate the solution from coarse to fine grids we choose the bilinear interpolation operator which is suitable for the problem because the Laplacian operator in (2.4) is homogeneous. The restriction operator , dubbed “fullweighting”, projects the finelevel residual to the coarse grid. Since the coarse problem is still too large to solve, the process of relaxation and coarse grid correction is applied recursively resulting in the Vcycle algorithm given in Alg. 1.
In Alg. (1) one has to choose the number of levels in the Vcycle. Unlike other scenarios, the smooth error modes of the Helmholtz operator are oscillatory at high wavenumber, and cannot be represented well on very coarse grids. Hence, the performance of the solver deteriorates as we use more levels. For example, the results in [7] show that the best performance is achieved using three levels only, and the authors suggest to use GMRES as a coarsest grid solver.
2.3 Machine and deep learning
The basic purpose of machine learning is to study an approximation of an unknown function
, using a generic model function that has unknown parameters . The true function can be, for example, the category of an image , and hopefully the function is rich enough to approximate . The parameters , called the weights of the model, are learned using a set of samples from a distribution of reasonable inputs (e.g., natural images, spoken words, etc.) such that .In supervised learning, the data set contains pairs
so that is the desired result for the model to provide given . To this end, a loss function is defined, and the weights of the model function are estimated by minimizing it. For example, a supervised loss function for predicting the solution of a linear system like (2.3), can be based on the error mean(2.7) 
where the training set contains pairs of satisfying (2.3).
Deep learning (DL) [15]
is a subfield of machine learning that is concerned with the collection of model functions belonging to the class of neural networks. A typical neural network architecture includes an input layer, an output layer and a several hidden layers consisting of affine linear transformations and pointwise nonlinear activation functions. The basic layer reads
(2.8) 
where is the nonlinear activation function, are the matrix and bias parameters for the th layer, respectively. The weights in (2.7) typically include all the parameters for all the layers. The process of optimizing neural networks—minimizing (2.7)—consists of the forward pass in which the layers like (2.8) are performed one after the other, and the backward pass in which the gradient
is computed for applying iterative steps of stochastic gradient descent to minimize the loss. Deep neural networks usually include many hidden layers, that result in a very flexible function representation.
A convolutional neural network (CNN) [27] is a type of a deep neural network, which is based on convolution kernels in (2.8), leading to a sharedweight architecture to handle data of large images and videos. CNNs are among the most effective methods for dealing with highdimensional gridlike data, and learning spatial features. They are the leading methods for computer vision tasks like image classification, object detection, and semantic segmentation.
2.4 Geometric multigrid using CNN modules
In our work, both the CNN and the multigrid preconditioner are used interchangeably. We implement the multigrid method using CNN components, in order to enable a good integration of the Vcycle with the CNN in a unified framework within the preconditioner component, using a GPU backend. The idea is based on the close connection between UNets and multigrid Vcycles.
The Helmholtz operator needed for the residual calculation and the Jacobi relaxation is executed using a convolution operator and an elementwise vector multiplication with the mass term according to the discretized Helmholtz operator (2.2) or the discrete (2.4). The high order discretization operators in [47] can also be obtained similarly.
The geometric multigrid transfer operators, the restriction (coarsening of the mesh) and prolongation (interpolation), are realized by convolution operators with the fixed kernels
(2.9) 
that correspond to the fullweighting and bilinear operators, respectively. Both kernels are applied with a stride of 2. For the restriction, this results in a coarse mesh . For the prolongation, we use a transposed strided convolution with , so that the mesh is refined. The solution for the coarsest mesh problem is done using GMRES with diagonal Jacobi preconditioner [7]
. In addition, all these convolutions are applied with zero padding of size 1, which is consistent with the Dirichlet boundary condition (the ABC is obtained through the mass term). Neumann BC can be obtained by using reflection or replication padding in CNN frameworks, which result in first and second order Neumann BC, respectively.
3 Multigridaugmented deep learning preconditioners
We present a CNN in a UNet architecture to accelerate the convergence of the Helmholtz solution, utilizing the GPU capabilities and parallel computation of DL frameworks. In order for the computational complexity to be costeffective we define the architecture to be relatively small in DL standards (see Fig. 1 and details later), and yet demonstrate a significant acceleration to the solution. Our DLbased preconditioner is applied within the FGMRES Krylov method [42]. The flexible variant is important as our preconditioner is nonlinear, hence also nonstationary.
3.1 A UNet architecture
The UNet architecture [41]
is a CNN for imagetoimage tasks such as semantic segmentation, where each pixel is classified. It consists of two rather symmetric parts—contraction and expansion. As described in
[41], the purpose of the contraction part is to capture the significant contexts and features, and the purpose of the expansion part is to enables precise localization. In the contraction part, the feature maps are coarsened and the number of channels is increased to get the context vectors. In the expansion part of the UNet, the number of feature channels is reduced, but the resolution of the channels is increased. This allows the network to propagate context information from the lower resolution layers to higher resolution layers. Since the contraction and expansion paths are more or less symmetric to each other, this yields a Ushaped architecture.In our case, given a slowness model and a righthandside sampled on a grid (viewed as images), we wish to predict a solution for (2.3) on that grid (an image). Furthermore, when solving the Helmholtz equation, it is important to allow a point source to spread to the entire mesh, as each righthandside can be viewed as multiple pointsources. For this reason, a suitable network structure for this task can be a UNet, since, like a multigrid cycle, it has a multiscale structure. That is, restriction and prolongation layers are performed, allowing each point to affect a wider neighborhood of grid cells (or pixels).
Our UNet, presented in Fig. 1, receives a complex valued righthandside, the slowness model , and the attenuation model (includes the absorbing BC, but can include a heterogeneous attenuation). The nonlinearity of the network architecture allows us to learn the complex interactions between these input “channels”, and minimize the error loss function w.r.t to the weights. In a nutshell, on its way down the hierarchy, the architecture interchangeably applies downsampling convolutions and ResNet convolution blocks, which are similar in some sense to the Jacobi smoothing steps. Each step applies convolutions and a nonlinear activation function. Then, on the way up the hierarchy it applies upsampling convolutions. The down and upsampling are applied using strided convolutions, similarly to (2.9) with learned weights.
Specifically, on the way down the UNet, we start with a downsampling operator, which is performed using a strided convolution (a restriction) layer
(3.1) 
where denotes the strided convolution operator, which maps channels to channels while coarsening the channels by a factor of 2 at each dimension (). We use a kernel of 5 by 5, so that the network can learn rather highorder transfer operators.
denotes an elementwise exponential linear unit (eLU) activation function. We chose a smooth activation function to allow for spatially smooth feature maps which are necessary because our solution is expected to be rather smooth. The batch normalization layers are expressed by
, whose role is to stabilize the optimization of network [22].In between each restriction, while going down the hierarchy, we apply ResNet blocks. The structure of our ResNet block is described as
(3.2) 
The weights and are two convolution operators with sized kernels. ResNet layers are explicitly reformulated as an addition of residual functions with respect to the inputs, instead of learning direct mappings like the standard neural network step in (2.8) or (3.1). In [19], the authors provide empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from increased depth. In the “coarsest” level of our UNet, before the prolongation starts, 3 ResNet layers are performed one after the other.
When we go up the hierarchy, we apply the upsampling convolutions (like prolongations) to refine the channels and decrease the channel space. Such layers are given by
(3.3) 
where denotes a strided transposed convolution operator with a 5by5 kernel, which maps channels to channels. The function expresses a concatenation of the current prolonged feature maps and the corresponding feature maps from the restriction phase . This is marked in arrows in Fig. 1, and is similar in spirit to the multigrid’s coarse grid correction.
3.2 UNet as a preconditioner accelerator
The UNet is used to accelerate the solution of the Helmholtz equation. As noted before, since the medium can be very heterogeneous, the solution, even for a point source or a plane wave, can be highly complex at high wavenumber due to reflections and interference. Therefore, we do not pursue a solution via a single application of a network, but aim for a network to be applied several times in the form of a preconditioner. This ensures users that the discrete equation is indeed satisfied to any desired accuracy.
In this work we aim for the UNet to complement either the Jacobi smoother or the shifted Laplacian multigrid method. Like a relaxation, the SL is efficient at handling error modes with rather large absolute eigenvalues, as the shift does not deviate those eigenvalues by a significant factor. Since SL or a relaxation are efficient at smoothing (=reducing residuals), we aim that the UNet will focus on the error directly, and it will be accompanied by classical smoothing. That is, we wish to help the UNet so it can focus on error modes that are not treated well by smoothing or SL. This guides our training procedure, as we describe later on. Furthermore, the smoothing is important since we apply the preconditioner inside a Krylov method, which is sensitive to high residuals. Below we list two preconditioning schemes that we propose.
UNet as a coarse grid solver with Jacobi smoothing
Our first version of the preconditioner applies one pre and post Jacobi relaxations with a UNet in between. Since the UNet starts and ends with strided convolutions (like a restriction and prolongation), this ends up being similar to a two level cycle. Alg. 2 summarizes the process, where the matrix denotes the discrete Helmholtz equation.
UNet as a deflation operator
A slightly more evolved option would be to apply UNet, and afterwords to smooth the output using a SL Vcycle. More explicitly:
(3.4) 
This option somewhat resembles the concept of deflation operator studied in [43], only without the nonlinearities, and a coarse grid solution instead of a network. There, the solution is projected using an interpolation operator , and a coarse grid solution is obtained. That is in addition to SL cycles. Here, the operator corresponding to is the strided convolution layer at the entrance to the network, and the last UNet layer is a transposed convolution corresponding to the matrix that returns the mesh to its original dimension.
3.3 EncoderSolver: an encoder network as a preconditioner setup
So far, we introduced a UNet that receives a residual and a slowness model, and returns an approximate solution. That is, our UNet generalizes on both the residual and the slowness. This setup is quite challenging compared to the setup in the previous [24, 33, 48], where the right hand side and slowness are represented by very few parameters, and the solution is obtained by a single forward pass, and not as a preconditioner. Obviously, the less we have to generalize on, the easier it will be for the network to approximate the solution well. This is also verified in our experiments, where we show that the same network performs better when the slowness model is known during training and during the Krylov solution at test time (see Sec 4.2).
Having the observation above in mind, we wish to exploit the following: when we solve an equation using several iterations, the slowness model remains the same and is known during those iterations—the UNet is applied with the same model over and over again, with different residuals. So, in the grand scheme we need to generalize on the slowness model, but we need not do it at every step. To this end, we propose to split the generalization over the residual and slowness model into two “encoder” and “solver” networks. The encoder network receives only the slowness model as input, and prepares feature vectors that include the slowness model information. Those vectors are computed only once at the beginning of the solution process, and are fed into the solver network at every iteration in the solution time. The solver network is applied at every iteration, with the readily available feature vectors inside it. This way we can have a large network for the slowness model, that will be applied only once, and the solver network for the residual can be cheaper. This idea resembles a preconditioner setup that is applied once before the solution process, and is adequate for any right hand side.
Figure 2 illustrates our proposed encodersolver architecture. The encoder network is also built in a UNet structure and is almost identical to the network we described in Sec 3.1. The outputs of the encoder network, the context for the solver network, are hierarchical feature vectors which we concatenate in each layer at the restriction and the prolongation phases of the solver network. The concatenation in the restriction and the prolongation steps can be described as
(3.5) 
respectively, where is the feature vector from the corresponding phase in the encoder network.
3.4 Training and data augmentation
In order for the UNet to be efficient at solve time as a preconditioner, it must practice or learn on residual vectors that are as close to the solve time FGMRES residuals as possible. Let
(3.6) 
be the forward application of the network for a given righthandsize and slowness model . As before, denote the collection of network weights (convolutions, biases, etc.) that we wish to learn. Since we wish to complement smoothing processes, in our supervised learning setup we aim to minimize the error with respect to multiple residuals and models. That is, given training data triplets , where , we minimize
(3.7) 
which is the mean
error norm, also known as the mean squared error (MSE). The training phase is performed by a stochastic gradient descent optimizer, where in each epoch we sweep through all the training examples in batches and varying order. Specifically, we use the ADAM optimizer
[25] with a variable step size.For supervised learning, we have to provide the network with the corresponding error solution for each residual vector and slowness model. To have such triplets, in some cases one needs to invest a lot of time and resources in preparing the trainingset, i.e., solve the PDE many times. However, since we have a linear system here, we can simply generate a random
, and easily compute the residual. But, we wish to create a dataset that is as close as possible to the situation in the real solution using FGMRES, and if we simply take the FGMRES residual vectors, we will not have a corresponding error without solving the system. To obtain that we need to generate data with rather smooth vectors in varying “smoothness levels”, since FGMRES and Vcycles are smoothing operations. Hence, we propose the following procedure for which, in our experience, the training error predicts the final solution’s convergence factor quite accurately. First, we generate a random normally distributed
and compute . Then, we apply a random number of FGMRES iterations with SL Vcycle as preconditioner, starting from a zero vector, to get :(3.8) 
Following that, we compute a residual and let be the network input. The true error that we wish the network to recover is given by
(3.9) 
Beyond the procedure above, we insert as input to the network. This causes the network’s input and output to be of the same order of magnitude^{2}^{2}2The magnitude of the input and output are important because of how certain default parameters are chosen in DL frameworks—e.g., initialization of weights and biases and scales of activation functions. For this reason, the input is usually normalized, and batch normalization is applied throughout the network., and because the system is linear, this change has no effect on the accuracy. Furthermore, to avoid overfitting and achieve better generalization, we augmented the training examples using a linear combination of the errors and residuals to expand the set. This summarizes the training data generation, given slowness models. The training set and the validation set are manufactured in the same way, and we saw no difference in the loss between these two.
Overall, our network has 4 input channels of the size of the grid. These are the real and imaginary parts of the residuals , the real (squared) slowness model , and the real attenuation model . Although we do not generalize of , we found it slightly beneficial to include it as input, as it allows the network to have a different behaviour at the grid’s boundaries.
Remark
As an alternative to the error loss in (3.7), one may use the residual norm , so that the training can be unsupervised [2]. This is a viable option, but it can be a bit misleading, since high errors can in fact have low residuals—these are “algebraically smooth errors”. Since we can eliminate those errors with Vcycles or Jacobi iterations, we wish to let the network focus on those smooth modes and not on the whole spectrum. We note that empirically, our experience in augmenting the error norms in (3.7) with residual norms generally yields inferior performance than focusing on the errors only during training.
3.4.1 Slowness models
Solving the Helmholtz equation for a uniform slowness model is not difficult, as it can be performed using a Fourier transform. When trying to solve this equation over a nonuniform media, the difficulty increases significantly, especially if there are sharp jumps or large ratios between the smallest and highest slowness. Here, we want the network to succeed in different types of models, smooth and nonsmooth, that faithfully represent reality. In certain applications (e.g., geophysical or medical imaging) one may have slowness models of a certain typical distribution coming from the application domain. E.g., in geophysical imaging, one typically has a layered model with higher velocity values deeper in the earth. In machine learning tasks, if we wish for an algorithm to generalize on an input, it is best if we let the algorithm learn on inputs from the same distribution as in the real life application. That is a large part of the advantage of datadriven approaches. Usually, however, to obtain a good generalization we need a large data set of inputs—a collection of slowness models in our case.
Here, we use a rather general slowness model collection to demonstrate our method. The models that we use for the UNet training, are created from a large set of natural images CIFAR10
[26]. This data set is usually used as image classification benchmark and contain 50K images of resolution pixels. We use this data set and manipulate the images to become slowness models in order to have a significant amount of training models to learn from. In particular, we use a training set of random sample images. To transform the natural images into slowness models, we first enlarge them to the grid size using a bilinear interpolation, then smooth them with a single application of a Gaussian smoothing kernel, and normalize the values to the range of . Fig. 3 shows four different models produced from the CIFAR10 image collection. Fig. 4 presents how the smoothness level of the model, , affects the difficulty of the solution. It is evident that as the medium is less smooth, the solution is more complex and is harder to model by a neural network. Similarly, the range of slowness values in the model also affects the the complexity of the solution. Fig. 5 illustrates how the solution is again more complex as the contrast in the slowness model is larger.(a)  (b)  (c)  (d) 
(a)  (b)  (c)  (d) 
3.5 CNNs vs. PINNs and a proposed mini retraining
In the approach of PINNs, one approximates the true solution as a pointwise neural network that receives in a meshfree manner, and possibly another simple property such as a source location, or a planewave direction for the righthandside. The PDE is then solved by the optimization process for learning the neural network’s weights, and any underlying heterogeneous medium or boundary conditions are assumed to be incorporated into the learned weights through the loss minimization. However, this way, we are quite limited in generalizing over the model coefficients, and must have a powerful enough architecture for that is able to approximate the solution well enough with some learned weights. The upside is that the optimization hopefully leads to the best approximation possible for the specific solution given the architecture. As the learning is in fact the solution of the PDE, we also need to assume that the optimization is not expensive and does not require the forwardbackward passes over too many points.
In some sense, if we define a CNN with convolutions only, and feed it with the function on a grid or a mesh, then we get multiple separate instances of in each forward pass for all the points on the grid. A real CNN will also incorporate spatial connections between neighboring (grid) points, hence it is a more powerful model. Similarly, in cases where a spatially adaptive mesh is more suitable than a regular grid, a graph neural network can replace the role of the CNN for the same tasks [10]. So  if we would define our CNN with as input, and consider training in solve time, we get a CNN version of PINNs. The true difference, however, is that here, we try to generalize over gridsized values, like and , which is the way CNNs are used in computer vision, for example.
With the comparison above in mind, our next proposed approach aims to further improve the solver UNet for the slowness model by a mini retraining procedure at solve time for the given slowness model. That is, given a slowness model we apply a few training iterations over a randomly generated data set of residuals to improve the UNet weights. This is similar to the process obtained in PINNs, and is motivated by our relatively better performance for known models compared to the case where we generalize over the models. This option would fit cases where we need to solve multiple linear systems with the same slowness model but for many righthandsides. Such a scenario fits the case of inverse problems, where the medium needs to be estimated. There, the slowness model evolves iteratively and does not change a lot between consecutive iterations of the inverse solution. Hence, the model retraining can gradually improve the forward solver network with the iterations, like in the PINNs forward/inverse solver in [2]. We note that here our retraining phase uses the errorbased supervised training procedure described earlier, and not the residualbased training in the common approaches of PINNs.
4 Numerical Results
In this section we evaluate the performance of the preconditioner together with FGMRES for solving the Helmholtz equation for random righthandsides. In particular, we use a block restarted FGMRES(10) described in [6], with a subspace of size 10, solving 10 different righthandsides together (a block size of 10). We consider it to be the preconditioner test, and later discuss and demonstrate the influence of the number of righthandsides and Krylov subspace size. In addition, we present the relative error norm plot throughout the training period, which shows the value of the loss function (3.7) vs. the epoch number. Ideally, the asymptotic value of the relative training error should indicate the real convergence factor at solve time using FGMRES. It is indeed approximately so thanks to the varying smoothing levels that we apply for our training residuals in Eq. (3.8). Using this plot, we can also estimate overfitting and adjust the hyperparameters (learning rate and minibatch size, for examples). Our code is written in the Julia language [4], utilizing the Flux DL framework [21].
Unless stated otherwise, the test cases that we present are for (2.1) defined on a 2D domain of , discretized on a nodal regular grid of cells, and a wave frequency defined by , which leads to about 12.5 grid points per wavelength. We use an absorbing boundary layer of width 10 to prevent reflection from all the model boundaries. We compare three preconditioners: The SL Vcycle , in Alg. 2 and for Eq. (3.4). We evaluate the methods as preconditioners by their convergence factor, calculated as
(4.1) 
where is the number of iterations required to reduce the residual norm by a factor of . For the SL Vcycles we use a shift of , 3 levels, 1 pre and post damped Jacobi smoothing with a damping factor of 0.8, and a coarsest grid solution of GMRES(10) with the Jacobi diagonal preconditioner as suggested in [7]. Under these settings, the convergence factor of FGMRES using the SL Vcycle preconditioner for a model is , and more than Vcycles are required to reach convergence. The average convergence factor of for a grid of cells, and frequency , is . The advantage of SL is that is quite consistent and robust for uniform, smooth and nonsmooth models—the difference in the convergence factors is only, depending on the model. Note that the convergence factor according to (4.1) is typically lower than the asymptotic convergence factor.
The training process for all the UNets is performed by the ADAM optimizer [25], and includes 80120 epochs. The initial learning rate is and the initial minibatch size is 20. Along the epochs we reduce the learning rate and increase the minibatch size. The schedule of these changes as well as the amount of epochs we apply vary a bit between the different experiments and are adjusted by cross validation on the test set. In the majority of the cases, after epochs we divide the initial learning rate by 10 and double the minibatch size. The next changes of the learning rate and minibatch size occur after every epochs. That is, the changes occur at epocs 48, 72, 84, 90 etc, until the learning rate essentially reaches zero. These settings can be used as default parameters for all the experiments that we show.
4.1 Performance of a homogeneous model


In the homogeneous test case, we solve the system (2.3) for a uniform model, known at training. That is, the training is performed on random residual and error vectors only as described earlier. Fig. 6 describes the results of the proposed method for this case. The average relative error, i.e. the minimum estimate value of the loss function (3.7), of the training and test sets at the end of the training process is . The FGMRES method with UNetbased preconditioner as described in Eq. (3.4) converges within 10 iterations with convergence factor compared to using only Vcycles. Our cheaper preconditioner, which uses only our UNet and Jacobi steps as described in Alg. 2, converges within 12 iterations with convergence factor . These are very favorable convergence factors given how poorly Vcycles behave for a homogeneous model. On the other hand, homogeneous problems can be efficiently solved using fast Fourier transform.
4.2 Performance on a single heterogeneous model
In this section we examine our method for a single heterogeneous slowness model, known during the training phase. In this scenario, the network can learn weights specific to that model. As shown in Fig. 7, for a model normalization of we obtained convergence factors of and , which are close to the performance on the uniform model. For the more difficult case of , we obtain , . Furthermore, in plot (c) of Fig. 6 and in the bottom plots (c) and (e) of Fig. 7, we show the results of a single iteration of Eq. (3.4) on a centralized point source, a problem that is not included in the training set. It is evident that UNet has indeed learned to produce a global solution, and the integration with the Vcycle is intended to “clean” or smooth the solution and ensure convergence.
(a)  (b)  
(c)  (d)  (e)  (f) 
4.3 Generalization on the slowness model
As we showed in the previous section, our UNet can learn to solve (2.3) quite efficiently for a given heterogeneous slowness model. However, the training of the UNet is quite expensive to apply at solve time. In a more applicable approach for the realistic world, we wish the neural network to work and generalize on models it had not seen in the training phase. Therefore, we will now present the ability of the proposed method to generalize for different models. In addition, we will also present the upgraded versions of our approach: the encodersolver dual networks and the mini retraining approach.
In the experiments in this section, the training phase was performed on a set of pairs of produced as described earlier in Sec. 3.4, and were randomly generated from a collection of the images in the CIFAR10 dataset. In the preconditioner test, we compare the performance of our methods to that of V(1,1)cycle, which is also used for . We will examine the capabilities of the preconditioner using

A standalone UNet with smoothing/deflation.

The encodersolver framework.

All options of the UNets after a modest “mini retraining” phase at solve time.
Table 1 and Figure 8 summarize and compare the results for all three cases, and in the next subsections we discuss each case separately. For all options we examine how the solver performs in the case of its training setting. That is, we train the network on a grid of for , but also examine the performance on larger grids and frequencies, and for . In particular, we solve (2.3) for random righthandsides on grids of size and with , respectively , using a network trained on residuals and slowness models of size of only. Table 1 presents the results for the encodersolver setting. Regardless of the specific preconditioner, we see that if the networks are trained on a single model, they exhibit the best performance. However, they fail to generalize on slightly different cases—even for the easier normalization of the same model (row (a) vs row (b) in the single model column). If we train the networks to generalize over slowness models, it can also generalize on the problem size and slowness contrast. All cases are improved using miniretraining.
Single model  General model  Retrained  
(a)  
Generalization for cases not seen in training  
(b)  fail  
(c)  fail  
(d)  fail  
(a)  (b)  (c) 
(d)  (e)  (f) 
4.3.1 A standalone UNet
At the end of the training phase, the average relative error value for the standalone UNet is (not shown). Due to the way the samples are produced and the limited network size, this value is consistent for both the training and the validation sets, i.e., there is no overfitting. The preconditioner test of the UNet is described in the top left plot in Fig. 8. The preconditioner achieves the convergence within iterations and a convergence factor of . The preconditioner achieves a convergence factor of . When examining the network’s ability to generalize to larger sizes than the training size, does not improve on the Vcycle performance, while reaches a convergence factor of for size and for size, compared to the Vcycle achieving and , respectively. Next, we show that with little effort, the UNets can do better than that.
4.3.2 An encodersolver pair of networks
Taking advantage of the fact that the model does not change throughout the preconditioner test, we now use the encodersolver framework, where another neural network preprocess the slowness model and generate context vectors for the solver UNet. We train both networks as a single component, and in the preconditioner test we run the encoder network once and use the same feature vectors we have received from it throughout the iterations, only executing the solver network many times.
The average relative error of the training and validation sets at the end of the training phase is . In the preconditioner test phase, the preconditioner reached convergence within iterations, and a convergence factor of , while achieved . Plot (d) of Fig. 8 demonstrates the results of the preconditioner test on grid inputs, consistent with the training set. The other two plots in (e) and (f) describe the generalization capability for largersized inputs and frequencies.
Beyond the improvement in the preconditioner test of and especially the more costeffective , the ability of the preconditioner to generalize on other sizes also improved significantly with the addition of the encoder network. Problems of size converge within 62 iterations, compared to 391 using Vcycle only. For the grids (with ), the amount of iterations is reduced significantly—more than 600 iterations are required with compared to 141 with .
4.3.3 Miniretraining towards a single model network at solve time
Due to the significant gap in the performance when generalizing over the slowness model (as opposed to knowing the model), we now add a minitraining phase to tune the network weights to a specific model. The experiment consists of three phases: training to learn the solver for different slowness models, miniretraining to tune the network to a specific model, and the preconditioner test.
At this point, there is a tradeoff between the cost of the retraining process and the success of the preconditioner test. The more resources we invest in the retraining phase, the closer the result will be to a single model result in Table 1. We present the enhancement of a retraining phase that includes 30 epochs on 100 different pairs produced by a single slowness model . The retraining for problems of size, was done using 30 epochs on 200 pairs while for problems from size of we performed a retraining that includes 30 epochs on a set of 300 samples. These amounts of data samples and epochs are just a fraction of the corresponding amounts of a full neural network training. Still, the retraining is quite expensive and suits the case of multiple righthandsides only. To reach the single model performance, a large set is required as retraining on a too small set may overfit the input pairs and not just the model. The dotted lines in Fig. 8 graphs describe the effect of retraining on the preconditioner test.
In plot (d) of Fig. 8, the encodersolver network reached a good network model in its training, and hence performs well, and is only slightly improved by retaining, saving only 56 iterations in each case (20%). In contrast, when retraining for larger problems, the network learns both the model and the form of solution to the new problem size. The dotted lines in plots (e) and (f), show a significant improvement of the mini retraining process. For size, the preconditioner converges within iterations compared to before the retraining. The preconditioner converges within iterations instead of . For problems, the improvement is even more significant, as we got convergence in 75 iterations compared to 141 iterations (almost twice) without the retraining process for learning the model and adjusting to size. Note that SL preconditioner for converges in more than iterations, and also achieved a significant improvement following the retraining.
Plots (a), (b) and (c) in Fig. 8
show the results of the same experiment with a standalone UNet. We see that the improvement is even more significant than in the encodersolver case, probably because the starting point of the retraining phase is worse in the standalone experiment than in the encodersolver case, and hence there is more room for improvement.
4.3.4 The influence of the Krylov subspace and block sizes
(a)  (b) 

In this set of experiments we examine the influence of both the subspace size and number of RHSs solved together (block size) that the FGMRES method uses. The experiment corresponds to the experiment shown in Fig 8 (e). Fig. 9 summerizes the results. It is obvious that whether by multiple RHSs or by subspace size, increasing the effective Krylov subspace can improve the performance of the solver significantly. That is in contrast to the SL Vcycle that is not improved by the same magnitude. This shows that there is more potential to the UNet based preconditioners than shown earlier, since in addition, we typically did not reach the training error drop in the preconditioner test, indicating that FGMRES stabilizes the solution on the one hand, but also slows the preconditioners due to high residuals.
4.3.5 GPU computation time
In this set of experiments, we compare the running and convergence times of the different preconditioners. To have a fair comparison it is necessary to run the various preconditioners on the same processor unit. Since the UNet is tailored to run efficiently on a GPU, we chose to have all the code running on a GPU as well, and compare the running times on this platform, which is quite common these days. In addition to GPU implementation of the multigrid cycle described in Sec. 2.4, we also implemented the FGMRES method using the implementation of linear layers from the Julia Flux library. This function was used both in the external loop that compares performance of the preconditioners and in the coarsegrid solver of the Vcycle method. Since multiplication of 64bit matrices and vectors on our GPU is significantly slower than those in 32bit, we performed all the computations and timings in 32bit. In our experiments, the convergence rate was not affected by the transition from CPU to GPU or from 64 to 32 bit, except to the minimum value of the residual norm which is limited to only in 32bit. This behavior is reasonable.
The FGMRES function that we use here supports block structure in the most naive way, which is less efficient than the full blockFGMRES used earlier in the CPU based implementation. That is because the blockFGMRES method includes QR decompositions which are highly inefficient on a GPU and are time consuming. As a result, as we increase the block size (number of right hand sides) the efficiency slightly drops but the computations better utilize the GPU resources. In our experiments we measured the times of running the algorithms on blocks of 10 RHSs. In addition, timings were averaged across 10 runs. The 32bit GPU implementation was roughly 34 times faster than the CPUbased multicore implementation. Our experiments were conducted using an NVIDIA RTX 3090 GPU operated by the Julia Flux framework.
(a)  (b)  (c) 

Fig. 10 describes the residual convergence relative to the solution time. These runs correspond to the ones presented in the second row of Fig. 8 in terms of the problem setup (grid size, slowness models) for the various preconditioners, including those based on the retrain process. In addition, for each preconditioner we indicate the convergence factor (Eq. (4.1)), and the amount of iterations for the convergence.
In these experiments, the preconditioner for grids as well as for grids achieves better results than the preconditioner even though it performs a larger number of iterations, since the runtimes of each iteration are significantly smaller. Both, and , achieve better results than the Vcycle, even for grids and even without the retraining. For size problems, the preconditioner after 250 milliseconds (per RHS) and the preconditioner after 600 milliseconds, reduced the residual norm by a factor of , while the Vcyclepreconditioner reached the same result after more than 2.5 seconds. For grids the two proposed preconditioners, and , convergence in less than 1.5 seconds while Vcycle converges after almost 10 seconds. For larger grids with size of , converged within 4 seconds, converged within 2.8 seconds, and Vcycle needed more than 25 seconds.
4.3.6 Out of distribution models
Having examined the ability of the UNets to solve problems produced using the CIFAR10 training set of slowness models, in this experiment we present the ability of the CNN to adapt and generalize for realistic geophysical models that were not included in the training set. This aspect is important for the usability of the pretrained CNN for new slowness models which are not available at training time. We do note that if one has an application in mind, it is better to use training data that are representative of that application. Here, we demonstrate a case where a new slowness model is out of the training distribution, as may inevitably happen in real life.
Fig. 11 describes the comparison between the performance of the encodersolver UNets and Vcycle for problems produced by 3 different geophysical slowness models that are commonly used in the literature—the SEG salt and Overthrust models [1], and the Marmousi model [5]. These models include many sharp edges and are significantly different than the rather smooth models that we produced to train the CNN. We again compare the convergence history of the residual norm with respect to the iterations of the blockFGMRES for each of the preconditioners. As before, the combination of UNet and Vcycle () is more effective than the Vcycle alone even without the retraining phase, i.e. a CNN trained on other types of models of size , effective for different geophysical models and problem sizes not seen during training.
(a)  (b)  (c) 
(d)  (e)  (f) 
For each of the models we did a retraining that included 30 epochs on a set of 300 pairs produced using the specific slowness model, and marked their performance with dotted lines in the plots. First, for the three models the retraining process has made the preconditioner more effective than the Vcycle. In particular, for the first two models, the SEG/EAGE saltdome model and the Overthrust velocity model, retraining saved about 100 iterations, depending on the desired threshold of the residual norm. The cost of the retraining: 30 (epochs) 300 (trainset size) 3 (UNet operation gradient calculation + the encoder’s gradient) amounts to about 27,000 UNet applications. Divided by the gain of 100 iterations per RHS, the retraining in these settings will be beneficial if we need to solve about 270 RHSs. Such a number is small compared to the number of RHSs that need to be solved as part of a solution of an inverse problem with many sources. Note that the slowness models here are significantly different than the ones used for training, and we expect a lighter retraining phase if the gridsize and training set distribution better match the target problems.
5 Conclusions
In this work we presented a new approach for preconditioning the heterogenouos Helmholtz equation. Our approach is based on a combination of CNN and multigrid components, using a UNet CNN as a coarse grid solver, or as a deflation operator together with SL Vcycles. The idea is that the CNN is trained to reduce the error, and the smoothing or Vcycles are responsible for stabilizing the solution process in a Krylov solver.
Since we target a preconditioner, our CNN is trained to generalize over the righthandsides (residuals), and over the slowness models. To help with the generalization, we propose to use an encodersolver framework, where two CNNs are used. One prepares context vectors and generalizes over the slowness, and the other receives the context vectors and generalize over the righthandsides. This encodersolver framework indeed upgrades the standalone UNet in all cases. In addition, inspired by PINNs, we offer a miniretraining phase over the slowness model at solve time, since the method is more effective when the slowness model is known. This consistently shows yet another upgrade to the performance. We show that while our UNet may require more FLOPs than traditional methods, it can be applied efficiently on GPU hardware, and yield favorable running times. Furthermore, for many cases CNN models can be significantly compressed by a variety of ways without harming their performance, making the preconditioner more beneficial.
In this work we used the UNet architecture “as is”. However, better architectures can probably be defined specifically for the Helmholtz equation. Furthermore, since problems typically include a more specific righthandside (e.g., point source or a plane wave), the first iteration can be performed with a different network that is trained for such cases, and our framework can then follow to improve the solution. Another interesting direction can be to train iterative UNets directly (like Recurrent Neural Networks), removing the need of FGMRES to guide the solution.
Lastly, in this paper we demonstrated 2D problems only, so that the number of waves in the domain is high and realistic. Performing the same in 3D will require the training of 3D networks in the same resolution, which is beyond the abilities of our hardware at this point. Small 3D problems are solved by Vcycles quite efficiently, hence are less relevant. However, it is indeed relevant for HPC centers solving large 3D problems, because the training is done offline.
References
 [1] F. Aminzadeh, B. Jean, and T. Kunz, 3D salt and overthrust models, Society of Exploration Geophysicists, 1997.
 [2] L. Bar and N. Sochen, Strong solutions for pdebased tomography by unsupervised learning, SIAM Journal on Imaging Sciences, 14 (2021), pp. 128–155.
 [3] Y. BarSinai, S. Hoyer, J. Hickey, and M. P. Brenner, Learning datadriven discretizations for partial differential equations, Proceedings of the National Academy of Sciences, 116 (2019), pp. 15344–15349.
 [4] J. Bezanson, A. Edelman, S. Karpinski, and V. B. Shah, Julia: A fresh approach to numerical computing, SIAM Review, 59 (2017), pp. 65–98, https://doi.org/10.1137/141000671.
 [5] A. Brougois, M. Bourget, P. Lailly, M. Poulet, P. Ricarte, and R. Versteeg, Marmousi, model and data, in EAEG workshoppractical aspects of seismic data inversion, European Association of Geoscientists & Engineers, 1990, pp. cp–108.
 [6] H. Calandra, S. Gratton, J. Langou, X. Pinel, and X. Vasseur, Flexible variants of block restarted gmres methods with application to geophysics, SIAM Journal on Scientific Computing, 34 (2012), pp. A714–A736.
 [7] H. Calandra, S. Gratton, X. Pinel, and X. Vasseur, An improved twogrid preconditioner for the solution of threedimensional Helmholtz problems in heterogeneous media, Numerical Linear Algebra with Applications, 20 (2013), pp. 663–688.
 [8] Y. Cheng, D. Wang, P. Zhou, and T. Zhang, Model compression and acceleration for deep neural networks: The principles, progress, and challenges, IEEE Sig. Proc. Mag., 35 (2018), pp. 126–136.
 [9] V. Dwarka and C. Vuik, Scalable convergence using twolevel deflation preconditioning for the Helmholtz equation, SIAM J. on Sci. Comp., 42 (2020), pp. A901–A928.
 [10] M. Eliasof and E. Treister, Diffgcn: Graph convolutional networks via differential operators and algebraic multigrid pooling, Advances in neural information processing systems (NeurIPS), (2020).
 [11] B. Engquist and A. Majda, Radiation boundary conditions for acoustic and elastic wave calculations, Communications on pure and applied mathematics, 32 (1979), pp. 313–357.
 [12] V. C. Erlangga YA, Oosterlee CW, A novel multigrid based preconditioner for heterogeneous Helmholtz problems, SIAM Journal on Scientific Computing, 27 (2006), pp. 1471–1492.
 [13] M. J. Gander and H. Zhang, A class of iterative solvers for the Helmholtz equation: Factorizations, sweeping preconditioners, source transfer, single layer potentials, polarized traces, and optimized Schwarz methods, Siam Review, 61 (2019), pp. 3–76.
 [14] M. Geist, P. Petersen, M. Raslan, R. Schneider, and G. Kutyniok, Numerical solution of the parametric diffusion equation by deep neural networks, Journal of Scientific Computing, 88 (2021), pp. 1–37.
 [15] I. J. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT Press, Cambridge, MA, USA, 2016. http://www.deeplearningbook.org.
 [16] I. G. Graham, E. A. Spence, and J. Zou, Domain decomposition with local impedance conditions for the Helmholtz equation with absorption, SIAM Journal on Numerical Analysis, 58 (2020), pp. 2515–2543.
 [17] D. Greenfeld, M. Galun, R. Basri, I. Yavneh, and R. Kimmel, Learning to optimize multigrid pde solvers, in International Conference on Machine Learning, PMLR, 2019, pp. 2415–2423.
 [18] J. Han, A. Jentzen, and E. Weinan, Solving highdimensional partial differential equations using deep learning, Proceedings of the National Academy of Sciences, 115 (2018), pp. 8505–8510.

[19]
K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for
image recognition
, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
 [20] N. F. Hermann Jan, Schätzle Zeno, Deepneuralnetwork solution of the electronic Schrödinger equation, Nature Chemistry, 12 (2020), pp. 891––897.

[21]
M. Innes, Flux: Elegant machine learning with julia
, Journal of Open Source Software, (2018).
 [22] S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, in International conference on machine learning, PMLR, 2015, pp. 448–456.
 [23] Y. Khoo, J. Lu, and L. Ying, Solving parametric pde problems with artificial neural networks, European Journal of Applied Mathematics, 32 (2021), pp. 421–435.
 [24] Y. Khoo and L. Ying, Switchnet: a neural network model for forward and inverse scattering problems, SIAM J. on Sci. Comp., 41 (2019), pp. A3182–A3201.
 [25] D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, in International Conference for Learning Representations (ICLR), 2015.
 [26] A. Krizhevsky, Learning multiple layers of features from tiny images, tech. report, 2009.
 [27] A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification with deep convolutional neural networks, in Advances in neural information processing systems, 2012, pp. 1097–1105.
 [28] Y. LeCun, Y. Bengio, and G. Hinton, Deep learning, nature, 521 (2015), pp. 436–444.
 [29] Y. Li, H. Yang, E. R. Martin, K. L. Ho, and L. Ying, Butterfly factorization, Multiscale Modeling & Simulation, 13 (2015), pp. 714–732.
 [30] L. Lu, X. Meng, Z. Mao, and G. E. Karniadakis, Deepxde: A deep learning library for solving differential equations, SIAM Review, 63 (2021), pp. 208–228.
 [31] S. Luo, J. Qian, and R. Burridge, Fast huygens sweeping methods for Helmholtz equations in inhomogeneous media in the high frequency regime, Journal of Computational Physics, 270 (2014), pp. 378–401.
 [32] L. Métivier, R. Brossier, S. Operto, and J. Virieux, Full waveform inversion and the truncated newton method, SIAM Review, 59 (2017), pp. 153–195.
 [33] B. Moseley, A. Markham, and T. NissenMeyer, Solving the wave equation with physicsinformed deep learning, arXiv preprint arXiv:2006.11894, (2020).
 [34] O. ObiolsSales, A. Vishnu, N. Malaya, and A. Chandramowliswharan, CFDNet: a deep learningbased accelerator for fluid simulations, Proceedings of the 34th ACM International Conference on Supercomputing, (2020), https://doi.org/10.1145/3392717.3392772.
 [35] L. N. Olson and J. B. Schroder, Smoothed aggregation for Helmholtz problems, Numerical Linear Algebra with Applications, 17 (2010), pp. 361–386.
 [36] S. Papadimitropoulos and D. Givoli, The double absorbing boundary method for the Helmholtz equation, Applied Numerical Mathematics, (2021).
 [37] J. Poulson, B. Engquist, S. Li, and L. Ying, A parallel sweeping preconditioner for heterogeneous 3D Helmholtz equations, SIAM J. on Sci. Comp., 35 (2013), pp. C194–C212.
 [38] M. Raissi and G. E. Karniadakis, Hidden physics models: Machine learning of nonlinear partial differential equations, Journal of Computational Physics, 357 (2018), pp. 125–141.
 [39] M. Raissi, P. Perdikaris, and G. E. Karniadakis, Physicsinformed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations, Journal of Computational Physics, 378 (2019), pp. 686–707.
 [40] B. Reps and T. Weinzierl, Complex additive geometric multilevel solvers for Helmholtz equations on spacetrees, ACM Trans. Math. Softw., 44 (2017).
 [41] O. Ronneberger, P. Fischer, and T. Brox, Unet: Convolutional networks for biomedical image segmentation, 2015, https://arxiv.org/abs/1505.04597.
 [42] Y. Saad, A flexible innerouter preconditioned gmres algorithm, SIAM J. on Sci. Comp., 14 (1993), pp. 461–469.
 [43] A. Sheikh, D. Lahaye, L. Garcia Ramos, R. Nabben, and C. Vuik, Accelerating the shifted laplace preconditioner for the helmholtz equation by multilevel deflation, Journal of Computational Physics, 322 (2016), pp. 473–490.
 [44] A. H. Sheikh, D. Lahaye, and C. Vuik, On the convergence of shifted laplace preconditioner combined with multilevel deflation, Numerical Linear Algebra with Applications, 20 (2013), pp. 645–662.
 [45] I. Singer and E. Turkel, A perfectly matched layer for the Helmholtz equation in a semiinfinite strip, J. Comput. Phys., 201 (2004), pp. 439–465.
 [46] E. Treister and E. Haber, A multigrid solver to the Helmholtz equation with a point source based on travel time and amplitude, Numerical Linear Algebra with Applications, 26 (2019), p. e2206.
 [47] N. Umetani, S. P. MacLachlan, and C. W. Oosterlee, A multigridbased shifted Laplacian preconditioner for a fourthorder Helmholtz discretization, Num. Lin. Alg. with App., 16 (2009), pp. 603–626.
 [48] P. R. Wiecha and O. L. Muskens, Deep learning meets nanophotonics: a generalized accurate predictor for near fields and far fields of arbitrary 3d nanostructures, Nano letters, 20 (2019), pp. 329–338.