1 Introduction
High frequency wave scattering in inhomogeneous media, arising from various application such as seismic waves, geophysical problems, poses great computational challenges due to the highly oscillatory natures of the solutions as well as the variance of random media properties. To model the high frequency waves in a deterministic media, high order methods such as spectral methods for the differential equations or wideband fsat multipole methods for the integral equations are used. Sparse grid methods have been used to reduce the cost of simulation over the high dimensional random spaces, however, these methods still suffer the curse of dimensionality.
Deep neural network (DNNs) have shown great potential in approximating high dimensional functions compared with traditional approximations based on Lagrangian interpolation or spectral methods. Recently, it has been shown in
xu18 ; xu19that some common NNs, including fully connected and convolution neural network (CNN) with tanh and ReLU activation functions, demonstrated a frequency dependent convergence behavior, namely, the DNNs during the training are able to approximate the low frequency components of the targeted functions first before than the higher frequency components, named as the FPrinciple of DNNs
xu18 . The stalling of DNN convergence in the later stage of training could be mostly coming from learning the high frequency components of the data. This Fprinciple behavior of DNN is just the opposite to that of the traditional multigrid method (MGM) bandt77 in approximating the solutions of PDEs where the convergence occurs first in the higher frequency end of the spectrum, due to the smoothing operator employed in the MGM. The MGM takes advantage of this fast high frequency error reduction in the smoothing iteration cycles and restrict the original solution of a fine grid to a coarse grid, then continuing the smoothing iteration on the coarse grid to reduce the higher end frequency spectrum in the context of the coarse grid. This downward restriction can be continued until errors over all frequency are reduced by a small number of iteration on each level of the coarse grids.Our object is to develop DNNs based numerical methods to handle high frequency functions and solutions of high frequency wave equation in inhomogeneous media, whose solution will be oscillatory and high dimensional functions due to the random coefficients. For this purpose, we will have to develop new classes of DNNs with capability of representing highly oscillatory solutions in the physical spatial variables, at the same time, addressing the challenging of representing high random space dimensions.
In this paper, we will consider the following Helmhotlz equations PDE in an inhomogeneous media
(1) 
with appropriate boundary condition on for a bounded domain plus a radiation condition at for an exterior scattering problem.
Our goal is to hopefully find a DNN parameterized by approximation to
(2) 
To improve the capability of usual DNNs for learning highly oscillatory functions in the physical variable , we propose phase shift DNNs with wideband learning capabilities in error reductions in the approximation for all frequencies of the targeted function by making use of the faster convergence in the low frequencies of the DNN during its training. We approximate the target function with a series of functions with disjoint frequency spectrum through a partition of the unit (POU) in the space. To learn each of these function of specific frequency range, we employ a phase shift in the space to translate its frequency to the range
, then the phase shifted function now only has low frequency content and can be learned by common DNNs with a small number of epoches of training. This series of DNNs with phase shifts will be named phase shift deep neural network (PhaseDNN).
To achieve this wideband approximation of a function, we can implement the PhaseDNN in a parallel manner where each range of frequency of the training data, after a proper phase shift, is learned independently. This approach can take advantage of parallel implementation, however, frequency extraction of the original training data have to be done using convolution with a frequency selection kernel numerically, which could become expensive or not accurate for scattered training data. Alternatively, we can also implement the PhaseDNN in a nonparallel manner where data from all range of frequencies are learned together with phase shifts included in the PhaseDNN, which will be called coupled PhaseDNN. Although, the coupled PhaseDNN lacks the parallelism of the parallel PhaseDNN, it avoids the costly convolution used in the parallel PhaseDNN to extract the frequency component from the original training data. This feature will be shown to be important when higher dimensional data are involved in the training. Due to this property, the coupled PhaseDNN will be used to solve the high frequency wave problems where we will seek solutions in a space of PhaseDNN by mininizing the residuals of the differential equation.
The rest of the paper will be organized as follows. In section 2, we will review the fast low frequency convergence property of neural network with tanh activation function. Section 3 will present the parallel version phase shift deep neural network  PhaseDNN. Based on the properties of the PhaseDNN, a coupled PhaseDNN is introduced in Section 4 to reduce the cost of learning in training the DNN for approximations. Then, the coupled PhaseDNN is used to solve the differential equation for wave propagation. Section 5 contains various numerical results of the PhaseDNN for approximations and solutions of wave problems by differential equations. Finally, a conclusion and discussions will be given in Section 6.
2 Deep Neural Network and frequency dependent training of DNN
A deep neuron network (DNN) is a sequential alternative composition of linear functions and nonlinear activation functions. Given
, let is a linear function with the form , where , . The nonlinear activation function . By applying this function componentwisely, we can extend activation function to naturally. A DNN with layers can be expressed as a compact form(3)  
with . Equivalently, we can also express it explicitly:
(4) 
Here, are linear functions. This DNN is also said to have hidden layers. The th layer has neurons.
In the following text, we only consider DNN with one dimensional input layer and one dimensional output layer. Namely, , . That is, . The activation function is chosen to be .
In this paper, we consider the problem of approximating a function by DNN through training. Given a function
(This choice is only to ensure the existence of Fourier transform and loss function for analysis reason.), we are going to minimize
(5) 
We consider this problem in frequency space. We define Fourier transform and its inverse of a function by
(6) 
Let , . By Paseval’s equality,
(7) 
Let denotes all the parameters in DNN. Namely,
Here, is the number of the parameters. The Fprinciple states the relative changing rate of for different frequency . We cite the following result from xu19
Theorem 1.
Given a function and a DNN function defined by (3), we scale by , where is a small parameter. Suppose for all , and . For any frequency and such that , and , there exists a such that when ,
holds for all , where , is the Lebesgue measure and , are two constants.
3 A parallel phase shift DNN (PhaseDNN)
The Fprinciple and estimation of
in Theorem 1 suggest that when we apply a gradientbased training method to loss function, the low frequency part of loss function converges faster than the high frequency part. So when we train a DNN to approximate , there exists positive frequency number and an integer such that after steps of training, the Fourier transform of training function should be a good approximation of for , where are parameters obtained after steps of training.
Frequency selection kernel
Following phaseShift , in order to speed up the learning of the higher frequency content of the target function , we will employ a phase shift technique to translate higher frequency spectrum to the frequency range of . Such a shift in frequency is a simple phase factor multiplication on the training data in the physical space, which can be easily implemented.
For a given , let us assume that
Before we can carry out this scheme, we construct a mesh for the interval by
(8) 
Then, we introduce a POU for the domain associated with the mesh as
The inverse Fourier transform of , indicated by is
(11) 
For a smoother function whose inverse Fourier transform, which will act as a frequency selection kernel in space, has a faster decay condition, we could use BSplines defined by the following recursive convolutions. Namely,
(12) 
It can be easily shown that
and the inverse Fourier transform of is
In most cases, we will just use the cubic BSpine. We choose
(13) 
and . Therefore,
which defines the frequency selection kernel
and
From the basic property of Bspline functions, we know that
Now, with the POU in (9), we can decompose the target function in the Fourier space as follows,
(15) 
Equation (15) will give a corresponding decomposition in space
(16) 
where
The decomposition (16) involves function whose frequency spectrum is limited to therefore, a simple phase shift could translate its spectrum to , then it could be learned quickly by a DNN with training epoches.

A parallel phase shift DNN (PhaseDNN) algorithm
Our goal is to learn function based on its training data
(17) 
In order to learn for its frequency information for all frequencies uniformly we will apply the decomposition (16) to
(18) 
where
(19) 
The training data for will be computed by (19), which can be implemented in the data space through the following convolution
(20) 
where is chosen such that the kernel function
Now as the support of is , then will have its frequency spectrum in therefore its inverse Fourier transform denoted as
(21) 
can be learned quickly by a DNN
(22) 
in epoches of training. Moreover, we know that
(23) 
which provides the training data for . Equation (23) shows that once is learned, is also learned by removing the phase factor.
(24) 
Now all have been learned, we will have an approximation to over all frequency range as follows
(25) 
Remark 1.
It should be noted that during the construction of PhaseDNN, only the original set of training data will be needed, but it will be modified to account for the frequency shifts, implemented simply by multiplying the training data by an phase factor on the residuals from the previous neural network approximations. We also allude the way PhaseDNN achieve approximation over increasing range of frequency with each additional network to that of wavelet approximations, which generate better and better resolution of approximation by adding wavelet subspaces produced by dialation of mother wavelet to account for higher frequency dau92 cai96 . In the case of PhaseDNN though phase shifts are used to cover higher frequencies. The substraction between the target function provided by the training data and the lower frequency network approximation is similar to the strategy employed in construction of multiresolution interpolation for PDE solutionscai96 .
Remark 2.
(Parallelism of PhaseDNN) It is clear the learning for each part of residual function can be done in parallel, the PhaseDNN is a version of domain decomposition method in space.
Remark 3.
(Frequency Adaptivity of PhaseDNN) The frequencyselected function requires the evaluation of the convolution (20), which implicitly assumes sufficient data information is available in the neighborhood of . In case such an assumption is not valid, we basically do not have enough data to extract the frequency information for the target function. In all practical sense of learning, we should assume the target function do not have the information of this specific frequency, and there is no need to learn this function near and we could simply set for , which is what could happen for a constant (or low) frequency function . As this process is done locally in space and frequency, the PhaseDNN will be able to achieve frequency adaptivity commonly associated with wavelet approximation of signals and images.
Adaptivity in frequency and local radial Basis fitting A preanalysis of the training data could be done to investigate the frequency content of the data, so only those frequencies will be considered when phase shifts are used for error reduction. Firstly, we group the training data into
clusters (disjoint or overlapping). Secondly, in a sphere centered at the center of a cluster, a local radial basis function fitting of training data
falling in the sphere will be carried out wendland04 . Finally, we apply an 1D discrete Fourier transform over the fitting function along each coordinate direction within the cluster to get the frequency content for the data in the cluster. Once this analysis is done for each cluster, we will have a collection of frequencies and those with significant magnitude will be used as candidates for the phase shifts in the PhaseDNN.Remark 4.
(Selection of and and preprocessing training data) In principle, any nonzero frequency threshold can be used in the implementation of PhaseDNN, which will dictate the size of to achieve sufficient convergence of the underlying DNN over the frequency range . Therefore, a careful analysis and numerical experiment of the specific low frequency convergence of the DNN will be needed.
As is selected to train the initial network in PhaseDNN to learn low frequent content of the data, a lowpass filtering on the training data could be used to select appropriate size of needed for achieve convergence within frequency.
4 A coupled PhaseDNN
4.1 Approximations
In the last section, we use the frequency selection kernal to decompose the training data into different frequency components, each of them after being phaseshifted will be represented by a small DNN. This method can be implemented in parallel, however, has to use convolution to construct the training data for each small DNN. The convolution is equivalent to a matrix multiplication and the matrix requires a storage of , is the number of samples. As a result, this convolution process strongly restricts the performance of phase DNN for higher dimensions.
To avoid this problem, based on the construction of the parallel PhaseDNN (25), we would like to consider a coupled weighted phaseshifted DNNs as an ansate for a new DNN, termed as coupled PhaseDNN,
(26) 
to approximate , where are relatively small DNNs, are frequencies we are particularly interested in. Namely, we are going to minimize
(27) 
This method is similar to expansion with Fourier modes of selected frequency with variable coefficients by DNNs. As is a real function, it is equivalent to use real ”Fourier” series rather than complex ”Fourier” series. Namely, we will consider the following sine and cosine expansions
(28) 
to approximate , where are DNNs while will always be included.
4.2 Solving differential equations
We will apply the coupled PhaseDNN to solve differential equation for the interior wave problems. Eqn.(28) is a good ansatz for solving differential equation (DE).
The test problem is
(29) 
where the right hand side source function, representing external source, is taken to be
(30) 
When and , , the unique solution of the equation is
(31) 
Loss function In finding the coupled PhaseDNN solution to the DE, we use the residual of the DE to define the loss function as follows
(32) 
5 Numerical results
5.1 Numerical result of parallel PhaseDNN for approximations
In this section, we will present some preliminary numerical results to demonstrate the capability of PhaseDNN to learn high frequency content of target functions. In practice, we could sweep over all frequency ranges. For the test function we have some rough idea of the the range of frequencies in the data, only a few frequency intervals are selected for the phase shift.
We choose a target function defined in by
(33) 
Because the frequencies of this function is well separated, we need not to sweep all the frequencies in . Instead, we choose , and use the following functions
to collect the frequency information in the corresponding frequency intervals and shift the center of the interval to the origin by a phase factor. For each
, we construct two DNNs to learn its real part and imaginary part, separately. Every DNN have 4 hidden layers and each layer has 40 neurons. Namely, the DNN has the structure 1404040401. The training data is obtained by 10,000 samples from the uniform distribution on
and the testing data is 500 uniformly distributed samples. We train these DNNs by Adam optimizer with training rate 0.0002 with 10 epochs of training for each DNN. The result is shown in Fig. 1.The detail of this training is shown in Fig. 2.
These figures clearly shows that phase DNN can capture the various high frequencies, from low frequency , to high frequency quite well. The training error and time of phase DNN are collected in table 1
Convolution  Training  Training  Test  
time(s)  Time(s)  Error  Error  
34.43  169.82      
33.74  185.40      
33.22  199.48      
9.70  214.39      
9.78  230.46      
33.21  247.34      
33.38  263.89      
32.68  279.39      
Total  220.14  1790.17  0.01508  0.06831 
It is shown that taking the advantage of vectorization, the convolution step costs about 20% time of training. Because in different interval,
can be trained in parallel, PhaseDNN is ideal to take advantage of parallel computing architectures.In comparison with one single DNN, we have used a 24 hidden layers with 80 neurons per hidden layer with the same amount of training data, the loss is still around 100 after more than 5000 epoch of training taking over 22 hours nonstop running on the same workstation.
5.2 Numerical results of coupled PhaseDNN
5.2.1 1D Problem
We will use test the coupled PhaseDNN on our previous test problems. Namely, we are going to learn
(34) 
by Eqn. (28). The frequencies are selected to be as before. For each and , we set it as a 1404040401 DNN.
The training parameters are as follow: we use 10000 random samples and train 40 epochs with batchsize 50. The optimizer is Adam with learning rate 0.002. The testing data is 10000 evenly distributed samples. The fitting result is shown in the left panel in Fig.3.
The total training time is less than 10 seconds. The average training error and testing error are both . The maximum error is 0.26. The error is concentrated in the neighborhood of 0, where the derivative of is discontinuous. Out of this neighborhood, the maximum error is 0.08.
We can also sweep the frequency space by choosing the frequencies to be 250:10:250, training parameters are the same as before. The fitting result and error is shown in Fig.4. The average training and testing error are reduced to .
For comparison, we can also use a fully connected DNN with similar scale as Eqn.(28) to learn . This DNN has 4 hidden layers with 360 neuron in each layer. This DNN was trained with the same parameter as before. The result is shown in the right panel of Fig.3. It is clear that almost nothing was learned by this fully connected DNN for this high frequency data.
In fact, 10000 samples are too many for this example, it turns out that 1000 samples can lead to a good approximation with Eqn. (28). Even with 500 samples, which is a much too small data set for the frequency 203, We can still get a ‘not too bad’ result. The testing is also done with 10000 samples. These results are shown in Fig.5.
Discontinuous functions and frequency sweep
Next we consider discontinuous functions and we replace the in Eqn.(34) by square wave function with same frequency and learn it by (28) with and . These two DNNs are trained with 10000 samples and 100 epochs. The results are shown in Fig.6.
It can be seen that although the sweeping strategy performs better than selecting, it is less accurate than the case. The average error is , maximum error is 6.
5.2.2 2DProblem
We use Eqn. (28) to learn 2D and 3D problems. For 2D test, the the function is used, where is defined by
(35) 
The function is the in (34) without the component. In this test, we choose . Training setting are samples and 80 epochs with batchsize 100. Testing data is . The result is shown in Fig.7.
The result is quite good even for the highest frequency domain. In this example, the highest frequency is 137. With more data, we can learn a function of even higher frequency.
5.2.3 3DProblem
The test problem for 3D is , where
(36) 
The selected frequency is . Training uses random samples and 50 epoches with batchsize 1000. Because we cannot plot a 4D picture, we choose hypersurface and as test data. Each is chosen to be 1202020201 DNN. The whole network is trained 40 epochs with batchsize 1000. This training cost about 5 hours. Results are shown in Fig.8.
Remark 5.
Advantages: The main advantage of this approach is we don’t need to treat convolution. This makes it possible to deal with large data set. More data can reveal higher frequency and higher dimensional. In our previous selecting process, the convolution may introduce some numerical error. Now Eqn.(28) has no such kind of numerical error. This is another advantage.
Disadvantage: Comparing with our previous method, the main disadvantage of Eqn.(28) is it cannot be parallelized. We must choose all the before training, then build DNN according to these . If the we choose are not enough, we can learn the residual by some more , but cannot use these to train the original separately.
Remark 6.
The number of data. Basically,the data set must be big enough so that it can reveal all the frequencies. That means we still need data. For each direction, samples must reveal the highest frequency of this direction. This means, even though DNN has the advantage that the unknowns increase linearly w.r.t the number of dimension, we still need to pay exponentially computing cost, at least for computing loss function. In our 3D test, 150 random samples for each direction is merely enough for frequency 32, but the total number is a very big data set.
The advantage that number of unknowns of DNN increase linearly also need more discussion. The traditional DNN can be viewed as a sum of a series of DNNs. Fprinciple tells the main part of a single DNN in Fourier space concentrates in a neighborhood of 0. If we agree with the small parameter assumption, one cannot approximate a function who has a support far away from 0. Our approach Eqn.(28) uses phase shifting technique. The main part of in Fourier space lies in the neighborhood of each . It seems we build a mesh in Fourier space w.r.t implicitly. Like in my 3D test, the mesh is .
It can been that our approach cannot overcome curse of dimensionality either, since generally, the number of also increase exponentially. In our 3D test problem, there are 108 different , which corresponds to 216 small DNNs. The whole DNN has a width over 4000. It is a very wide and shallow DNN. The number of parameters is very large. Combining a large number of data, the whole training costs 5 hours.
However, we only need to apply the coupled PhaseDNN for the 3 physical dimension of the solution while the other dimensions from the random coefficients can be handled by normal DNN.
5.3 Coupled PhaseDNN for solving Helmholtz equations in inhomogeneous media
5.3.1 Helmholtz equation with constant wave numbers
We set , each to be 1404040401 DNN. The whole is trained with 10000 evenly distributed samples, 100 epochs and batchsize 100. We choose four special cases: , , and . The result is shown in Fig.9.
The training costs about 5 minute. The maximum error is . The results are quite accurate. For comparison, a single fully connected DNN with similar scale as cannot solve the equation at all. The training result after 1500 epochs for and are shown in Fig.10, showing the nonconvergence of the usual fully connected DNN.
5.3.2 Helmholtz Equation with variable wave numbers
Next we are going to solve the following Helmholtz equations with variable wave numbers
(37) 
where , are constants.
As there is no explicit exact solution to this equation, the numerical solution obtained by finite difference method will be used as the reference solution.
We first choose , , and in Eqn.(37), which corresponds to a low frequency external wave source and a low wave number with small background media inhomogeneity. In the coupled phaseDNN, we choose . Other training parameters are set to be similar as in the constant coefficient case. The numerical result of coupled phaseDNN and reference solution is shown in Fig11 and the absolute error is in the order of .
Next, We choose , , and in Eqn.(37), which corresponds to a high frequency external wave source and a low wave number with small background media inhomogeneity. In coupled phaseDNN, we choose . Other training parameters are set to be similar as in the constant coefficient case. The numerical result of coupled phaseDNN and reference solution is shown in Fig12 and the absolute error is in the order of .
We further choose , , and , which corresponds to a high frequency external wave source and a high wave number with larger background media inhomogeneity. is chosen to be and . The result is shown in Fig.13 and the absolute error is in the order of .
It can be seen that coupled phaseDNN can handle both small and large wave numbers and low and high frequency external sources. The absolute error are all in the order of for the test cases.
6 Conclusion
In this paper, we have proposed a phase shift DNN to learn high frequency information by using frequency shifts to convert the high frequency learning to low frequency learning. As shown by various numerical tests, this approach increases dramatically the capability of the DNN as a viable tool for solving high frequency wave differential equations in inhomogeneous media.
For future work, we will further develop the PhaseDNN to handle the high dimensionality issues from possibly random inhomogeneous media.
Acknowledgement
This work was supported by US Army Research Office (Grant No. W911NF1710368).
References
 (1) Wei Cai, Xiaoguang Li, Lizuo Liu, PhaseDNN – A Parallel Phase Shift Deep Neural Network for Adaptive Wideband Learning, arxiv 1905.01389, May 3, 2019.

(2)
Weinan, E., and Bing Yu. ”The Deep Ritz method: A deep learningbased numerical algorithm for solving variational problems.” Communications in Mathematics and Statistics 6.1 (2018): 112.
 (3) Xu, ZhiQin John, Understanding training and generalization in deep learning by Fourier analysis, arXiv:1808.04295, November, 2018.
 (4) Xu, ZhiQin John, Yaoyu Zhang, Tao Luo, Yanyang Xiao, and Zheng Ma. Frequency Principle: Fourier Analysis Sheds Light on Deep Neural Networks. arXiv preprint arXiv:1901.06523 (2019).
 (5) Brandt, Achi. Multilevel adaptive solutions to boundaryvalue problems. Mathematics of computation 31.138 (1977): 333390.
 (6) Wei Cai, Xiaoguang Li, Lizuo Liu, PhaseDNN  A Parallel Phase Shift Deep Neural Network for Adaptive Wideband Learning, arxiv 1905.01389, May 3, 2019.
 (7) Daubechies, Ingrid. Ten lectures on wavelets. Vol. 61. Siam, 1992.
 (8) Cai, Wei, and Jianzhong Wang. Adaptive multiresolution collocation methods for initialboundary value problems of nonlinear PDEs. SIAM Journal on Numerical Analysis 33, no. 3 (1996): 937970.
 (9) Wendland, Holger. Scattered data approximation. Vol. 17. Cambridge university press, 2004.