1. Introduction
In this paper, we study the forward and inverse scattering problems via the use of artificial neural networks (NNs). In order to simplify the discussion, we focus on the timeharmonic acoustic scattering in two dimensional space. The inhomogeneous media scattering problem with a fixed frequency is modeled by the Helmholtz operator
(1) 
where is the velocity field. In many settings, there exists a known background velocity field such that is identical to except in a compact domain . By introducing the scatterer compactly supported in
(2) 
we can equivalently work with instead of . Note that in this definition scales quadratically with the frequency . However, as is assumed to be fixed throughout this paper, this scaling does not affect any discussion below.
In many realworld applications, is unknown. The task of the inverse problem is to recover based on some observation data . The observation data is often a quantity derived from the Green’s function of the Helmholtz operator and, therefore, it depends closely on . This paper is an exploratory attempt of constructing efficient approximations to the forward map and the inverse map
using the modern tools from machine learning and artificial intelligence. Such approximations are highly useful for the numerical solutions of the scattering problems: an efficient map
provides an alternative to expensive partial differential equation (PDE) solvers for the Helmholtz equation; an efficient map
is more valuable as it allows us to solve the inverse problem of determining the scatterers from the scattering field, without going through the usual iterative process.In the last several years, deep neural network has become the goto method in computer vision, image processing, speech recognition and many other machine learning applications
[18, 28, 12, 9]. More recently, methods based on NN have also been applied to solving PDEs. Based on the way that the NN is used, these methods for solving the PDE can be roughly separated into two different categories. For the methods in the first category [16, 27, 3, 10, 14, 6], instead of specifying the solution space via the choice of basis (as in finite element method or Fourier spectral method), NN is used for representing the solution. Then an optimization problem, for example an variational formulation, is solved in order to obtain the parameters of the NN and hence the solution to the PDE. Similar to the use of an NN for regression and classification purposes, the methods in the second category such as [23, 11, 13, 8] use an NN to learn a map that goes from the coefficients in the PDE to the solution of the PDE. As in machine learning, the architecture design of an NN for solving PDE usually requires the incorporation of the knowledge from the PDE domain such that the NN architecture is able to capture the behavior of the solution process. Despite the abundance of the works in using the NN for solving PDE, none of the above mentioned methods have tried to obtain the solution to the wave equation.This paper takes a deep learning approach to learn both the forward and inverse maps. For the Helmholtz operator (
1), we propose an NN architecture for determining the forward and inverse maps between the scatterer and the observation datagenerated from the scatterer. Although this task looks similar to the computer vision problems such as image segmentation, denoising, and superresolution where the map between the two images has to be determined, the nature of the map in our problem is much more complicated. In many image processing tasks, the value of a pixel at the output generally only depends on a neighborhood of that pixel at the input layer. However, for the scattering problems, the input and output are often defined on different domains and, due to wave propagation, each location of the scatterer can influence every point of the scattered field. Therefore, the connectivity in the NN has to be wired in a nonlocal fashion, rendering typical NN with local connectivity insufficient. This leads to the development of the proposed
SwitchNet. The key idea is the inclusion of a novel lowcomplexity switch layer that sends information between all pairs of sites effectively, following the ideas from butterfly factorizations [21]. The same factorization was used earlier in the architecture proposed [19], but the network weights there are hardcoded and not trainable.The paper is organized as followed. In Section 2, we discuss about some preliminary results concerning Helmholtz equation. In Section 3, we study the so called far field pattern of the scattering problem, where the sources and receivers can be regarded as placed at infinity. We propose SwitchNet to determine the maps between the far field scattering pattern and the scatterer. In Section 4, we turn to the setting of a seismic imaging problem. In this problem, the sources and receivers are at a finite distance, but yet wellseparated from the scatterer.
2. Preliminary
The discussion of this paper shall focus on the twodimensional case. Here, we summarize the mathematical tools and notations used in this paper. As mentioned above, the scatterer is compactly supported in a domain , whose diameter is of . For example, one can think of to be the unit square centered at the origin. In (1), the Helmholtz operator is defined on the whole space with the radiative (Sommerfeld) boundary condition [5] specified at infinity. Since the scatterer field is localized in , it is convenient to truncate the computation domain to by imposing the perfectly matched layer [1] that approximates the radiative boundary condition.
In a typical numerical solution of the Helmholtz operator, is discretized by a Cartesian grid at the rate of a few points per wavelength. As a result, the number of grid points per dimension is proportional to the frequency . We simply use to denote the discretization points of this grid . The Laplacian operator in the Helmholtz operator is typically discretized with local numerical schemes, such as the finite difference method [17]. Via this discretization, we can consider the scatterer field , discretized at the points in
, as a vector in
and the Helmholtz operator as a matrix in .Using the background velocity field , we first introduce the background Helmholtz operator . With the help of , one can write in a perturbative way as
(3) 
where is viewed as a perturbation. By introducing the background Green’s function
(4) 
one can write down a formal expansion for the Green’s function of the dependent Helmholtz operator :
(5)  
(6)  
(7)  
(8) 
which is valid when the scatterer field is sufficiently small. The last line of the above equation serves as the definition of the successive terms of the expansion (, , and so on). As can be computed from the knowledge of the background velocity field , most data gathering processes (with appropriate postprocessing) focus on the difference instead of itself.
A usual experimental setup consists of a set of sources and a set of receivers :
The data gathering process usually involves three steps: (1) impose an external force or incoming wave field via some sources, (2) solve for the scattering field either computationally or physically, (3) gather the data with receivers at specific locations or directions. The second step is modeled by the difference of the Green’s function , as we mentioned above. As for the other steps, it is convenient at this point to model the first step with a sourcedependent operator and the third one with a receiverdependent operator . We shall see later how these operators are defined in more concrete settings. By putting these components together, one can set the observation data abstractly as
(9) 
In this paper, we focus on two scenarios: far field pattern and seismic imaging. We start with far field pattern first to motivate and introduce SwitchNet. We then move on to the seismic case by focusing on the main differences.
3. SwitchNet for far field pattern
3.1. Problem setup
In this section, we consider the problem of determining the map from the scatterer to the far field scattering pattern, along with its inverse map. Without loss of generality, we assume that the diameter of the domain is of after appropriate rescaling. The background velocity is assumed to be 1 since the far field pattern experiments are mostly performed in free space.
In this problem, both the sources and the receivers are indexed by a set of unit directions in . The source associated with a unit direction is an incoming plane waves pointing at direction . It is well known that the scattered wave field, denoted by , at a large distance takes the following form [5]
where the function is defined on the unit circle . The receiver at direction simply records the quantity for each . The set of observation data is then defined to be
Figure 1 provides an illustration of this experimental setup. Henceforth, we assume that both and
are chosen to be a set of uniformly distributed directions on
. Their size, denoted by , typically scales linearly with frequency .This data gathering process can be put into the framework of (9). First, one can think of the source prescription as a limiting process that produces in the limit the incoming wave . The source can be considered to be located at the point for the direction with the distance going to infinity. In order to compensate the geometric spreading of the wave field and also the phase shift, the source magnitude is assumed to scale like as goes to infinity. Under this setup, we have
(10)  
(11)  
(12)  
(13) 
Similarly, one can also regard the receiver prescription as a limiting process as well. The receiver is located at point for a fixed unit direction with going to infinity. Again in order to to compensate the geometric spreading and the phase shift, one scales the received signal with . As a result, we have
(14)  
(15)  
(16)  
(17) 
In this limiting setting, one redefine the observation data as
(18) 
Now taking the two limits (10) and (14) under consideration, one arrives at the following representation of the observation data for and
(19) 
3.2. Lowrank property
The intuition behind the proposed NN architecture comes from examining (19) when (or ) is small. In such a situation, we simply retain the term that is linear in . Using the fact that , (19) becomes
for and . This linear map takes defined on a Cartesian grid to defined on yet another Cartesian grid . Recalling that both and are of size and working with a vectorized , we can write the above equation compactly as
(20) 
where the element of the matrix at and is given by
(21) 
The following theorem concerning the matrix plays a key role in the design of our NN. Let us first partition uniformly into Cartesian squares of sidelength equal to . Here we assume that is an integer. Note that, since the diameter of is of , . This naturally partitions the set of grid points into subgroups depending on which square each point belongs to. We shall denote these subgroups by . Similarly, we also partition uniformly (in the angular parameterization) into squares of sidelength equal to . is also assumed to be an integer and obviously . This further partitions the set into subgroups depending on which square they belong to. We shall denote these subgroups by . Fig. 2 illustrates the partition for .
Theorem 1.
For any and , the submatrix
(22) 
is numerically lowrank.
Proof.
The proof of this theorem follows the same line of argument in [2, 29, 20] and below we outline the key idea. Denote the center of by and the center of by . For each and , we write
(23) 
Note that for fixed and each of the last three terms is either a constant or depends only on or . As a result, is numerically lowrank if and only if the first term is so. Such a lowrank property can be derived from the conditions concerning the sidelengths of and . More precisely, since resides in with center , then
(24) 
Similarly as resides in with center , then
(25) 
Multiplying these two estimates results in the estimate
(26) 
for the phase of . Therefore,
(27) 
for or
is nonoscillatory and hence can be approximated effectively by applying, for example, Chebyshev interpolation in both the
and variables. Since the degree of the Chebyshev polynomials only increases polylogarithmically with respect to the desired accuracy, is numerically lowrank by construction. This proves that the submatrix defined in (22) is also numerically lowrank. ∎3.3. Matrix factorization
In this subsection, we show that Theorem 1 guarantees a lowcomplexity factorization of the matrix . Let the row and column indices of be partitioned into index sets and , respectively, as in Theorem 1. To simplify the presentation, we assume , , and .
Since the submatrix
is numerically lowrank, assume that
(28) 
where and . Here can be taken to be the maximum of the numerical ranks of all submatrices . Theorem 1 implies that is a small constant.
By applying (28) to each block , can be approximated by
(29) 
The next step is to write (29) into a factorized form. First, introduce and
(30) 
and define in addition
(31) 
In addition, introduce
(32) 
where the submatrix itself is a block matrix with blocks of size .
is defined to be zero everywhere except being the identity matrix at the
th block. In order to help understand the NN architecture discussed below sections, it is imperative to understand the meaning of . Let us assume for simplicity that . Then for an arbitrary vector , essentially performs a “switch” that shuffles as follows(33) 
With the above definitions for , , and , the approximation in (29) can be written compactly as
(34) 
Notice that although has entries, using the factorization (34), can be stored using entries. In this paper, and and are typically on the same order. Therefore, instead of , one only needs entries to parameterize the map approximately using (34). Such a factorization is also used in [21] for the compression of Fourier integral operators.
We would like to comment on another property that may lead to further reduction in the parameters used for approximating . Let us focus on any two submatrices and of . For two regions and , where the center of and are and respectively, . Let . For and , we have
(35)  
(36) 
where
(37) 
Therefore, the lowrank factorizations of and are solely determined by the factorization of . This implies that it is possible to construct lowrank factorizations for and :
(38) 
such that . Since this is true for all possible , one can pick lowrank factorizations so that .
3.4. Neural networks
Based on the lowrank property of in Section 3.2 and its lowcomplexity factorization in Section 3.3, we propose new NN architectures for representing the inverse map and the forward map .
3.4.1. NN for the inverse map .
As pointed out earlier, when is sufficiently small. The usual filtered backprojection algorithm [25] solves the inverse problem via
(40) 
where is the regularization parameter. In the far field pattern problem, can be understood as a deconvolution operator. To see this, a direct calculation reveals that
(41) 
for . (41) shows that is a translationinvariant convolution operator. Therefore, the operator , as a regularized inverse of , simply performs a deconvolution. In summary, the above discussion shows that in order to obtain from the scattering pattern in the regime of small , one simply needs to apply sequentially to

the operator ,

a translationinvariant filter that performs the deconvolution .
Although these two steps might be sufficient when is small, a nonlinear solution is needed when is not so. For this purpose, we propose a nonlinear neural network SwitchNet for the inverse map. There are two key ingredients in the design of SwitchNet.

The first key step is the inclusion of a Switch layer that sends local information globally, as depicted in Figure 4. The structure of the Switch layer is designed to mimic the matrixvector multiplication of the operator in (29). However unlike the fixed coefficients in (29), as an NN layer, the Switch layer allows for tunable coefficients and learns the right values for the coefficients from the training data. This gives the architecture a great deal of flexibility.

The second key step is to replace the linear deconvolution in the backprojection algorithm with a few convolution (Conv) layers. This enriches the architecture with nonlinear capabilities when approximating the nonlinear inverse map.
The pseudocode for SwitchNet is summarized in Algorithm 1. The input is a matrix, while the output is a matrix. The first three steps of Algorithm 1 mimics the application of the operator . The Switch layer does most of the work, while the Vect and Square layers are simply operations that reshape the input and output data to the correct matrix form at the beginning and the end of the Switch layer. In particular, Vect groups the entries of the 2D field according to squares defined by the partition and Square does the opposite. The remaining lines of Algorithm 1 simply apply the Conv layers with window size and channel number .
These basic building blocks of SwitchNet are detailed in the following subsection. We also take the opportunity to include the details of the pointwise multiplication PM layer that will be used in later on.
3.4.2. Layers for SwitchNet
In this section we provide the details for the layers that are used in SwitchNet. Henceforth, we assume that the entries of a tensor is enumerated in the Python convention, i.e., going through the dimensions from the last one to the first. One operation that will be used often is a
reshape, in which a tensor is changed to a different shape with the same number of entries and with the enumeration order of the entries kept unchanged.Vectorize layer. with input . Henceforth we assume that is an integer and divides . This operation partitions into square subblocks of equal size. Then each subblock is vectorized, and the vectorized subblocks are stacked together as a vector in . Intuitively, these operations cluster the nearby entries in a subblock together. The details of the Vect layer are given in the following:

Reshape the to a tensor.

Swap the second and the third dimensions to get a tensor.

Reshape the result to an vector and set it to .
Square layer. with input , where is an integer. The output is . Essentially as the adjoint operator of the Vect layer, this layer fills up each square subblock of the matrix with a segment of entries in . The details are given as followed:

Reshape the to a tensor.

Swap the second and the third dimensions to get a tensor.

Reshape the result to an matrix and set it to .
Switch layer. with input . It is assumed that and are integer multiples of and , respectively. This layer consists the following steps.

Apply to :

Reshape to be a tensor. Here we follow the Python convention of going through the dimensions from the last one to the first one. Then a permutation is applied to swap the first two dimensions to obtain a tensor of size . Finally, the result is reshaped to a vector again going through the dimensions from the last to the first.

Apply to :
Here the nonzero entries of are the trainable parameters. The Switch layer is illustrated in Figure 4.
Convolution layer. with input . Here denote the input and output channel numbers and
denotes the window size. In this paper we only use the convolution layer with stride 1 and with zeropadding:
(43)  
with . Here and
is assumed to be odd in the presentation. Both
and are the trainable parameters.Pointwise multiplication layer. with input . It is defined as
(44) 
. Both and are trainable parameters.
We remark that, among these layers, the Switch layer has the most parameters. If the input and output to the Switch layer both have size , the number of parameter is where is the number of squares that partition the input field and is the rank of the lowrank approximation.
3.4.3. NN for the forward map .
We move on to discuss the parameterization of the forward map . The proposal is based on the simple observation that the inverse of the inverse map is the forward map.
More precisely, we simply reverse the architecture of the inverse map proposed in Algorithm 1. This results in an NN presented Algorithm 2. The basic architecture of this NN involves applying a few layers of Conv first, then followed by a Switch layer that mimics .
We would also like to mention yet another possibility to parameterize the forward map
, via a recurrent neural network
[24]. Let(45) 
One can leverage the following recursion
(46) 
to approximate by treating each as an image and using a recurrent neural network. At the th level of the recurrent neural network, it takes and as inputs and outputs . More specifically, in order to go from to , one first apply to each column of the image , then each row of the image is reweighted by the diagonal matrix . Stopping at the th level for a sufficiently large , can be approximated by
(47) 
Once holding such an approximation to , we plug it into (19)
(48) 
This shows that the map from to can be realized by applying a matrix product to first on the dimension, then on the dimension. If we view applying the Green’s function as applying a convolution layer in an NN, the above discussion shows that the forward map can be obtained by first applying a recurrent NN followed by a convolutional NN. The main drawback of this approach is the large memory requirement (i.e., ) to store each individual . In addition, the use of a recurrent NN may lead to difficulty in training [26] due to the issue of exploding or vanishing gradient. Moreover, since the weights for parameterizing are shared over multiple layers in the recurrent NN, one might not be able to efficiently use backpropagation, which may lead to a longer training time. These are the main reasons why we decided to adopt the approach in Algorithm 2.
3.5. Numerical results
In this section, we present numerical results of SwitchNet for far field pattern at a frequency . The scatterer field supported in is assumed to be a mixture of Gaussians
(49) 
where and . When preparing the training and testing examples, the centers of the Gaussians are chosen to be uniformly distributed within . The number of the Gaussians in the mixture is set to vary between and . In the numerical experiments, the domain is discretized by an Cartesian grid . To discretize the source and receiver directions, we set both and to be a set of 80 equally spaced unit directions on . Therefore in this example, .
In Algorithm 1, the parameters are specified as (rank of the lowrank approximation), , , (window size of the convolution layers), (channel number of the convolution layers), and (number of convolution layers), resulting 3100K number of parameters. The parameters for Algorithm 2 are chosen to be , , , , , and , with a total of 4200K parameters. Note that for both algorithms the number of parameters is significantly less than the one of a fully connected NN, which has at least K parameters.
SwitchNet is trained with the ADAM optimizer [15]
in Keras
[4]with a step size of 0.002 and a minibatch size of size 200. The optimization is run for 2500 epochs. Both the training and testing data sets are obtained by numerically solving the forward scattering problem with an accurate finite difference scheme with a perfectly matched layer. In the experiment, 12.5K pairs of
are used for training, and another 12.5K pairs are reserved for testing. The errors are reported using the mean relative errors(50) 
where and denote the predicted and ground truth scattering patterns respectively for the th testing sample, and and denote the predicted and ground truth scatterer field respectively. Here is the Frobenius norm.
Table 1 summarizes the test errors for Gaussian mixtures with different choices of . For the purpose of illustration, we show the predicted and by SwitchNet along with the ground truth in Figure 5 for one typical test sample.
Forward map  Inverse map  

2  9.4e03  1.2e02 
3  4.0e02  1.4e02 
4  4.8e02  2.4e02 
4. SwitchNet for seismic imaging
4.1. Problem setup
This section considers a twodimensional model problem for seismic imaging. The scatterer is again assumed to be supported in a domain with an diameter, after appropriate rescaling. is discretized with a Cartesian grid at the rate of at least a few point per wavelength. Compared to the source and receiver configurations in Section 3.1, the experiment setup here is simpler. One can regard both and to be equal to a set of uniformly sampled points along a horizontal line near the top surface of the domain. The support of is at a certain distance below the top surface so that it is wellseparated from the sources and the receivers (see Figure 6 for an illustration of this configuration).
The source and receiver operators in (9) take a particularly simple form. For the sources, the operator is simply given by sampling:
Similarly for the receivers, the operator is given by
After plugging these two formulas back into (9), one arrives at the following representation of the observation data for and
4.2. Lowrank property
Following the approach taken in Section 3.2, we start with the linear approximation under the assumption that is weak. Since , the first order approximation is
(51) 