1 Introduction
Deep convolutional neural network (CNN) has been widely applied to solving PDEs as well as inverse problems in signal processing. In both applications, spectral methods, namely involving forward and backward Fourier transform operators, serve as a traditional solution. Spectral methods have been a classical tool for solving elliptic PDEs. For image inverse problems, primarily image restoration tasks like denoising and deblurring, a large class of PDE methods consider the nonlinear diffusion process
[1, 2] which are connected to wavelet frame methods [3, 4]. The involved operator is elliptic, typically the Laplace operator.Apart from the rich prior information in these problems, the conventional endtoend deep CNNs, like UNet [5]
and Pix2pix
[6], consist of convolutional layers which have fully trainable local filters and are densely connected across channels. The enlarged model capacity and flexibility of CNN improve the performance in many endtoend tasks, however, such fully datadriven models may give a superior performance on one set of training and testing datasets, but encounter difficulty when transfer to another dataset, essentially due to the overfitting of the trained model which has a large amount of flexibility [7, 8]. Also, as indicated in [9, 10, 11, 12, 13], the dense channel connection can be much pruned in the posttraining process without loss of the prediction accuracy.This motivates the design of structured CNNs which balance between model capacity and preventing overfitting, by incorporating prior knowledge of PDEs and signal inverse problems into the deep CNN models. The superiority of structured CNNs over ordinary CNNs has been shown both for PDE solvers [14] and for image inverse problems [15]. Several works have borrowed ideas from numerical analysis in deep models: [16] introduces ButterflyNet (BNet) based on the butterfly algorithm for fast computation of Fourier integral operators [17, 18, 19, 20, 21]; [22] proposes to use a switching layer with sparse connections in a shallow neural network, also inspired by the butterfly algorithm, to solve wave equation based inverse scattering problems; [14] and [23] introduce hierarchical matrix into deep network to compute nonlinear PDE solutions; [24] proposes a neural network based on the nonstandard form [25] and applies to approximate nonlinear maps in PDE computation. Problemprior informed structured CNNs have become an emerging tool for a large class of endtoend tasks in solving PDEs and inverse problems.
This paper introduces a new Butterfly network architecture, which we call BNet2. A main observation is that, as long as the Fourier transform operator is concerned, the middle layer in Butterfly algorithm can be removed while preserving the approximation ability of BNet. This leads to the proposed model, which inherits the approximation guarantee the same as the BNet, but also makes the network architecture much simplified and inline with the conventional CNN.
We also investigate the Fourier transform (FT) initialization. FT initialization adopts the interpolative construction in Butterfly factorization as in
[26] to initialize BNet2. Since BNet2 now is a conventional CNN with sparsified channel connections, FT initialization can also be applied to CNN to realize a linear FT. We experimentally find that both BNet2 and CNN are sensitive to initialization in problems that we test on, and FT initialized networks outperform their randomly initialized counterpart in our settings. The trained network from FT initialization also demonstrates better stability with respect to statistical transfer of testing dataset from the training set.In summary, the contributions of the paper include: (1) We introduce BNet2, a simplified structured CNN based on Butterfly algorithm, which removes the middle layer and later layers in BNet and thus is inline with the conventional CNN architecture; (2) FT initialized BNet2 and CNN inherit the theoretical exponential convergence of BNet in approximating FT operator, and the approximation can be further improved after training on data; (3) Applications to endtoend solver for linear and nonlinear PDEs, and inverse problems of signal processing are numerically tested, and under the same initialization BNet2 with fewer parameters achieves similar accuracy as CNN. (4) FT initialization for both BNet2 and CNN serves as an initialization recipe for a large class of CNNs in many applications, and we show that FT initialized BNet2 and CNN outperforms randomly initialized CNN on all tasks included in this paper.
2 ButterflyNet2
The structure of BNet2 inherits a part of design in BNet and extends it to all layers. More precisely, the network structure before the middle layer in BNet is extended to all layers. After removing the middle layer and the second half layers in BNet, the structure of BNet2 is exactly a CNN with sparse channel connections. For completeness, we will first recall the CNN under our own notations and then introduce BNet2. Towards the end of this section, the numbers of parameters for both CNN and BNet2 are derived and compared.
Before introducing network structures, we first familiarize ourselves with notations used throughout this paper. The bracket notation of denotes the set of nonnegative integers smaller than , i.e., . Further, a contiguous subset of is denoted as , where is a divisor of denoting the total number of equal partitions and indexed from zero denotes the th partition, i.e., . While describing the network structure, and
denote the input and output vector with length
and respectively, i.e., and . , , and denote hidden variables, multiplicative weights and biases respectively. For example, is the hidden variable at th layer withspacial degrees of freedom (DOFs) and
channels; is the multiplicative weights at th layer with being the kernel size, and being the in and outchannel sizes; denotes the bias at th layer acting onchannels. Activation function is denoted as
, which is ReLU in this paper by default.
2.1 CNN Revisit
A CNN can be precisely described using notations defined as above. For a layer CNN, we define the feedforward network as follows.

Layer 0: The first layer hidden variable with channels is generated via applying a 1D convolutional layer with kernel size
and stride
followed by an activation function on the input vector , i.e.,(1) for and .

Layer : The connection between the th layer and the th layer hidden variables is a 1D convolutional layer with kernel size , stride size , inchannels and outchannels followed by an activation function, i.e.,
(2) for and . The first and second summation in (2) denotes the summation over inchannels and the spacial convolution respectively.

Layer : The last layer mainly serves as a reshaping from the channel direction to spacial direction, which links the th layer hidden variables with the output via a fully connected layer, i.e.,
(3) for . If is not the final output, then the bias and activation function can be added.
In the above description, readers who are familiar with CNN may find a few irregular places. We will address these irregular places in Remark 2.2 after the introduction of BNet2.
CNN is the most successful neural network in practice, especially in the area of signal processing and image processing. Convolutional structure, without doubt, contributes most to this success. Another contributor is the increasing channel numbers. In practice, people usually double the channel numbers until reaching a fixed number and then stick to it till the end. Continually doubling channel numbers usually improves the performance of the CNN, but has two drawbacks. First, large channel numbers lead to the large parameter number, which in turn leads to overfitting issue. The second drawback is the expensive computational cost in both training and evaluation.
2.2 ButterflyNet2
BNet2, in contrast to CNN, has the identical convolutional structure and allows continually doubling the channel numbers. For the two drawbacks mentioned above, BNet2 overcomes the second one and partially overcomes the first one.
The Layer 0 in BNet2 is identical to that in CNN. Hence, we only define other layers in the feed forward network as follows.

Layer : The inchannels are equally partitioned into parts. For each part, a 1D convolutional layer with kernel size , stride , inchannel size and outchannel size is applied. The connection between the th layer and the th layer hidden variables obeys,
(4) for , , and .

Layer : The inchannels are equally partitioned into parts. For each part, a 1D convolutional layer with kernel size , inchannel size and outchannel size is applied. The last layer links the th layer hidden variables with the output , i.e.,
(5) for and . If is not used as the final output directly, then the bias term and activation function can be added.
This remark addresses two irregular places in the CNN and BNet2 described above against the conventional CNN. First, all convolutions are performed in a nonoverlapping way, i.e., the kernel size equals the stride size. Regular convolutional layer with a pooling layer can be adopted to replace the nonoverlapping convolutional layer in both CNN and BNet2. Second, except the Layer 0 and Layer , all kernel sizes are , which can be generalized to other constant for both CNN and BNet2. We adopt such presentations to simplify the notations in Section 3, the FT initialization.
In (4), the inchannel index and the outchannel index of BNet2 are linked through the auxiliary index , whereas in the CNN, the inchannel index and the outchannel index are independent (see (2)). Figure 1 (b) illustrates the connectivity of and in CNN and in BNet2 at the nd layer. Further, Figure 1 (a) shows the overall structure of CNN and BNet2. If we fill part of the multiplicative weights in CNN to be that in BNet2 according to (4) and set the rest multiplicative weights to be zero, then CNN recovers BNet2. Hence, any BNet2 can be represented by a CNN. The approximation power of CNN is guaranteed to exceed that of BNet2. Surprisingly, according to our numerical experiments, the extra approximation power does not improve the training and testing accuracy much in all examples we have tested.
2.3 Parameter Counts
Parameter counts are explicit for both CNN and BNet2. The numbers of bias are identical for two networks. It is for Layer 0, for Layer and for Layer . Hence the overall number of biases is
(6) 
The total number of multiplicative parameters are very different for CNN and BNet2. For CNN, the parameter count is for Layer 0, for Layer , and for Layer . Hence the overall number of multiplicative parameters for CNN is
(7) 
While, for BNet2, the parameter count is for Layer 0, for Layer , and for Layer . Hence the overall number of multiplicative parameters for BNet2 is
(8) 
If we assume and , which corresponds to doubling channel number till the end, then we have
(9) 
Let us consider another regime, i.e., and , which can be viewed as an analog of doubling the channel number first and then being fixed to a constant . Under this regime, the total numbers of parameters can be compared as,
(10) 
where both come from term. Hence, in both regimes of hyper parameter settings, BNet2 has lower order number of parameters comparing against CNN. If the performance in terms of training and testing accuracy remains similar, BNet2 is then much more preferred than the CNN.
3 Fourier Transform Initialization
A good initialization is crucial in training CNNs especially in training highly structured neural networks like BNet2. It is known that CNN with random initialization achieves remarkable results in practical image processing tasks as shown in [27]. However, for synthetic signal data as in Section 4
, in which the high accuracy prediction is possible through a CNN with a set of parameters, we notice that CNN with random initialization and ADAM stochastic gradient descent optimizer is not able to converge to that CNN.
In this section, we aim to initialize both BNet2 and CNN to fulfill the discrete FT operator, which is defined as
(11) 
for and , where denotes the number of discretization points and denotes the frequency window size. Discrete FT is the traditional computational tool for signal processing and image processing. Almost all related traditional algorithms involve either FT directly or Laplace operator, which can be realized via two FTs, see [28, 29]. Hence, if we can initialize a neural network as such a traditional algorithm involving discrete FT, the training of the neural network would be viewed as refining the traditional algorithm and makes its data adaptive. In another word, neural network solving image processing and signal processing tasks can be guaranteed to outperform traditional algorithms, although it is widely accepted in practice. This section is composed of two parts: preliminary and initialization. We will first introduce related complex number neural network realization, Chebyshev interpolation, FT approximation in the preliminary part. Then the initialization for both CNN and BNet2 is introduced in detail.
3.1 Preliminary
Fourier transform is a linear operator with complex coefficients. In order to realize complex number multiplication and addition via nonlinear neural network, we first represent a complex number as four real numbers, i.e., a complex number is represented as
(12) 
where and for any . The vector form of contains at most two nonzeros. The complex number addition is the vector addition directly, while the complex number multiplication must be handled carefully. Let be two complex numbers. The multiplication is produced as the activation function acting on a matrix vector multiplication, i.e.,
(13) 
In the initialization, all prefixed weights are in the role of instead of . In order to simplify the description below, we define an extensive assign operator as such that the 4 by 4 matrix in (13) then obeys . Without loss of generality, (13) can be extended to complex matrixvector product and notation is adapted accordingly as well.
Another important tool is the Lagrange polynomial on Chebyshev points. The Chebyshev points of order on is defined as,
(14) 
The associated Lagrange polynomial at is
(15) 
If the interval is recentered at and scaled by , then the transformed Chebyshev points obeys and the corresponding Lagrange polynomial at is
(16) 
where is independent of the transformation of the interval.
Recall the Chebyshev interpolation representation of FT as Theorem 2.1 in [16]. We include part of that theorem here with our notation for completeness.
[Theorem 2.1 in [16]] Let and be two parameters such that . For any , let and denote two connected subdomains of and with length and respectively. Then for any and , there exists a Chebyshev interpolation representation of the FT operator,
(17) 
where is the centers of , are the Chebyshev points on . Obviously, part of the approximation, , admits the convolutional structure across all . This part will be called the interpolation part in the following. It is the key that we can initialize CNN and BNet2 as a FT.
3.2 Fourier Transform Initialization for CNN and BNet2
Since all weights fit perfectly into the structure of BNet2, we will only introduce the initialization of BNet2 in detail. Assume the input is a function discretized on a uniform grid of with points and the output is the discrete FT of the input at frequency . Throughout all layers, the bias terms are initialized with zero. In the description below, we focus on the initialization of the multiplicative weights. Without loss of generality, we further assume .

Layer 0: For , we consider and for and , which satisfies the condition in Theorem 3.1. An index pair for being the index of and being the index of the Chebyshev points can be reindexed as . Hence we abuse the channel index as . Then fixing , the interpolation part is the same for all , which is naturally a nonoverlapping convolution. Hence we set
(18) for , , and is the Chebyshev point on or . Then after applying the 1D convolutional layer as (1), the first hidden variable represents the input vector interpolated to the Chebyshev points on with respect to . The following layers recursively apply Theorem 3.1 to the remaining part.

Layer : We concern and for and at the current layer. The hidden variable represents the input interpolated to the Chebyshev points on with respect to , where and is the index of Chebyshev points. is a subinterval of and covers and . For a fixed , the interpolation part is the same for each index . The convolution kernel, hence, is defined as,
(19) where , , and are Chebyshev points on and respectively, and .

Layer : This layer concerns and for . All previous layers take care of the interpolation part. And the current layer applies the FT operator on each . The hidden variable represents the input interpolated to the Chebyshev points on with respect to , where and is the index of Chebyshev points. Define the channel index as an index pair , where is the index of and is the index for uniform points . Then the multiplicative weights are initialized as,
(20)
Since BNet2 can be viewed as a CNN with many zero weights, such an initialization can be used to initialize CNN as well. When we set the weights as above and set the rest weights to be zero, the CNN is then initialized as a FT.
As mentioned in Remark 2.2, a few irregular places in the current CNN and BNet2 description can be modified to match conventional CNN. The FT initialization can be updated accordingly. First, when convolutions are performed in a nonoverlapping way without pooling layer, we can enlarge the kernel size and embed zeros to eliminate the impact of the overlapping part. Second, when the kernel sizes are a constant different from , the generalization of the initialization is feasible as long as the bipartition is modified to a multipartition.
The approximation power of FT initialized CNN and BNet2 can be analyzed in an analogy way as that in [16], and the result is similar as well:
Let and denote the size of the input and output respectively. The depth and channel parameter satisfies . Then there exists a BNet2/CNN, , approximating the discrete FT operator such that for any bounded input vector , the error satisfies,
(21) 
where is a constant depending only on and , for .
The proof Theorem 3.2
is composed of layer by layer estimations on the multiplicative weight matrices, which is analogy to that in
[16]. Hence we omit the proof. Later, Theorem 3.2 is validated numerically.4 Numerical Results
This section presents numerical experiments to demonstrate the approximation power of CNN and BNet2, and compare the difference between FT initialization and random initialization. Thus, four different settings, CNN with random initialization (CNNrand), CNN with FT initialization (CNNFT)^{1}^{1}1The Layer is often combined with feature layers. Hence for both CNN, Layer as in BNet2 is adopted., BNet2 with random initialization (BNet2rand), and BNet2 with FT initialization (BNet2FT) are tested on three different sets of problems: (1) approximation of FT operator; (2) energy and solution maps of elliptic equations; (3) 1D signals deblurring and denoising tasks.
4.1 Approximation of Fourier Transform Operator
This section repeats experiments as in the original BNet [16]
on BNet2, namely approximation power before training, approximation power after training, transfer learning capability, and robustness to adversarial attack.
4.1.1 Approximation Power Before Training
The first experiment aims to validate the exponential decay of the approximation error of the BNet2 as either the depth increases or the number of Chebyshev points increases. We construct and initialize a BNet2 to approximate a discrete FT operator, which has length of input and length of output representing integer frequency on . The approximation power is measured under the relative operator norm, i.e., , where and denote BNet2 and FT operator respectively.
16384  6  3.48e02  5.25e02  6.30e02  8  3.80e02  7.26e02  6.94e02 
7  2.18e03  4.18e03  6.36e03  9  2.39e03  6.05e03  6.95e03  
8  1.37e04  2.84e04  5.30e04  10  1.54e04  4.31e04  5.73e04  
9  8.96e06  1.79e05  4.08e05  11  1.05e05  2.89e05  4.37e05  
10  6.41e07  1.16e06  3.11e06  12  7.64e07  1.86e06  3.30e06 
In Table 1, we fix the number of Cheybsev points being and varying for two choices of . All errors with respect to different norms decay exponentially as increases. The decay rates for different s remain similar, while the prefactor is slightly larger for larger .
In the table in Figure 2, we calculate the logarithms of rates of convergence for different s and s under different norms. The table shows that for all choices of the convergence rates measured under different norms stay similar for any fixed . And the convergence rate decreases as increases.
All of these above convergence behaviors agree with the analysis in [16]. And all rates we obtained are better than the corresponding theoretical ones. In summary, when approximating FT operator using FT initialized BNet2, the approximation accuracy decays exponentially as increases and the rate of convergence decreases as increases.
4.1.2 Approximation Power After Training
The second numerical experiment aims to demonstrate the approximation power of the four networks in approximating FT operator after training.
Each data point used in this section is generated as follows. We first generate an array of random complex numbers with each component being uniformly random in . The zero frequency is a random real number. Second, we apply a Gaussian mask with mean
on the array. The array then is complexly symmetrized to be a frequency vector and the inverse discrete FT is applied to obtain the real input vector. The constant is chosen such that the two norm of the output vector is close to .In this experiment, we have input length being , output length being , level number being , channel parameter being . All networks are trained under the infinity data setting, i.e., the training data is randomly generated on the fly. ADAM optimizer with batch size and exponential decay learning rate is adopted. For FT initialized networks, the maximum training steps is 10,000, whereas for random initialization we train 20,000 steps. The reported relative error in vector two norm is calculated on a testing data set of size . Default values are used for any unspecified hyper parameters.
Network  Parameters  PreTrain Rel Err  Test Rel Err 

BNet2FT  9252  1.84e3  1.33e5 
BNet2rand  1.38e+0  1.34e2  
CNNFT  49572  1.84e3  9.29e6 
CNNrand  5.09e+0  7.41e2 
Table 2 shows the pretraining and testing relative errors for BNet2FT, BNet2rand, CNNFT, and CNNrand. Comparing the results, BNet2 and CNN have similar performance for both initializations, while BNet2 has only about parameters as CNN. Hence those extra coefficients in CNN do not improve the approximation to FT operator. On the other hand, FT initialization achieves an accuracy better than that of BNet2rand and CNNrand after training. After training FT initialized networks, extra two digits accuracy is achieved for both BNet2FT and CNNFT. We conjecture that the local minima found through the training from the FT initialization has a narrow and deep well on the energy landscape such that the random initialization with stochastic gradient descent is not able to find it efficiently.
4.1.3 Transfer Learning Capability
This numerical experiment compares the transfer learning capability of four networks. The training and testing data are generated in a same way as in Section 4.1.2 with different choices of and . We have three training sets: low frequency training set ( and ), high frequency training set ( and ) and mixture training set (no Gaussian mask). A sequence of testing sets of size are generated with and .
The networks used here have the same structure and hyperparameters as in Sec 4.1.2 while the channel parameter . Each experiment is repeated times. Then the mean and standard deviation of the error in two norm are reported below.
As in Figure 3, for both initializations, BNet2 and CNN have similar accuracy especially on testing sets away from the training set. Taking the FT initialization before training as a reference, we also notice that even if randomly initialized networks can reach the accuracy of the reference on some testing sets, they lose accuracy on transferred testing sets. On the other side, FT initialized networks after training maintain accuracy better than that of the reference on all testing sets. In terms of the stability after training, BNet2FT and CNNFT are much more stable than BNet2rand and CNNrand, which is due to the randomness in initializers. This phenomenon also emphasizes the advantage of FT initialization in stability and repeatability.
4.2 Energy and Solution Map of Elliptic PDEs
This section focus on the elliptic PDE of the following form,
(22) 
with periodic boundary condition, where denotes coefficients and denotes the strength of nonlinearity. Such an equation appears in a wide range of physical models governed by Laplace’s equation, Stokes equation, etc. Equation (22) is discretized on a uniform grid with points.
4.2.1 Energy of Laplace Operator
In this section, we aim to construct an approximation of the energy functional of 1D Poisson’s equations, i.e., and . The energy functional of Poisson’s equation is defined as the negative inner product of and , which can also be approximated by a quadratic form of the leading lowfrequency Fourier components. Hence, Here we adopt BNet2FT, BNet2rand, CNNFT, and CNNrand with an extra square layer, which is called taskdependent layer.
In this numerical example, the input has the same distribution as that in Section 4.1.2. All other hyper parameters of networks and the training setting are also identical to that in Section 4.1.2.
Parameters  PreTrain Rel Err  Test Rel Err  

BNet2FT  9268  2.11e3  8.10e6 
BNet2rand  7.97e1  4.62e3  
CNNFT  49588  2.11e3  4.79e6 
CNNrand  5.53e1  6.21e3 
4.2.2 Endtoend Linear Elliptic PDE Solver
In this section, we aim to represent the endtoend solution map of linear elliptic PDEs by UNets. The linear elliptic PDE is (22) with high contrast coefficient as,
(23) 
for being the uniform point in and .
It is well known that the inverse of linear constant coefficient Laplace operator can by represented by where denotes the FT and is a diagonal operator. Therefore, we design our network in the same sprite. Our network contains three parts: a BNet2/CNN with input length , output length , a fully connected dense layer with bias terms and activation function, and another BNet2/CNN with input length , output length . Since the input
is real function, here we apply odd symmetry to it and initialize the first BNet2s/CNNs as FT to make the first part of the network serves as sine transform. Then we initialize
according to , and initialize the third part to be an inverse sine transform so that the overall network is an approximation of the inverse operator. In this example, we set , , and . Both BNet2s/CNNs are constructed with layers and channel parameter .Each training and testing data is generated as follows. We first generate an array of length with random numbers. The first entry is fixed to be 0 to incorporate the periodic boundary condition, whereas the following entries are uniform sampled from . Then an inverse discrete sine transform is applied to obtain the input vector. The reference solution is calculated through traditional spectral methods on a finer grid of nodes. The training and testing data set contain and points respectively. Other settings are the same as in Section 4.1.2.
Parameters  Linear PDE  Nonlinear PDE  

PreTrain Rel Err  Test Rel Err  PreTrain Rel Err  Test Rel Err  
BNet2FT  17856  5.16e2  4.86e3  3.48e+0  2.02e2 
BNet2rand  9.75e+0  4.43e2  4.37e+2  1.00e+0  
CNNFT  82368  5.16e2  3.96e3  3.48e+0  1.52e2 
CNNrand  3.53e+0  2.03e2  5.65e+2  1.00e+0 
Table 4 and Figures 4 show that in this endtoend task, CNN performs slightly better than BNet2 at the cost of times more parameters. Training from FT initialization in both cases provides one more digit of accuracy. Figures 4 further shows that BNet2FT and CNNFT significantly outperform BNet2rand and CNNrand near sharp changing areas in .
4.2.3 Endtoend Nonlinear Elliptic PDE Solver
In this section, we focus on a highly nonlinear elliptic PDE (22) with and . The reference solution for nonlinear PDEs in general is difficult and expensive to obtain. Hence, in this section, we apply the solvetrain framework proposed in [30] to avoid explicitly solving the nonlinear PDE.
Denoting the nonlinear PDE as an operator acting on , i.e.,
, our loss function here is defined as
(24) 
where denotes the used neural network. The reported relative error is calculated on testing data as follows,
(25) 
The same networks and other related settings as in Section 4.2.2 are used here.
Table 4 shows that under solvetrain framework randomly initialized networks are not able to converge to a meaningful result, whereas FT initialized networks find a representation for the solution map with digits accuracy. Partially, this is due to the extra condition number of introduced by solvetrain framework in training. Comparing BNet2FT with CNNFT, we find similar conclusions as before, i.e., CNNFT achieves slightly better accuracy with higher cost in the number of parameters.
4.3 Denoising and Deblurring of 1D Signals
In this section, we aim to apply networks to the denoising and deblurring tasks in signal processing. The network structure used in this experiment is UNet, which concatenates two networks, i.e., two BNet2FT, two BNet2rand, two CNNFT or two CNNrand. Such a structure with FT initialization reproduces a low pass filter.
The low frequency true signal is generated as the input vector in Section 4.1.2. Two polluted signals, and , are generated by adding a Gaussian noise with standard deviation and convolving a Gaussian with standard deviation , respectively. The mean relative errors of and are and respectively.
Regarding the UNet structure, the first part has input length and output length
in representing frequency domain
. After that, the output of the first part is complex symmetrized to frequency domain . Then the second part has input length , output length . In both parts, we adopt layers. The other hyper parameters are the same as that in Section 4.1.2. All relative errors are measured in two norm.Network  para  

PreTrain Rel Err  Test Rel Err  PreTrain Rel Err  Test Rel Err  
BNet2FT  19,392  9.56e2  7.54e3  1.64e1  8.02e4 
BNet2rand  1.07e+0  1.52e2  1.02e+0  1.07e2  
CNNFT  83,904  9.56e2  7.74e3  1.64e1  8.19e4 
CNNrand  1.23e+0  1.28e2  1.05e+0  9.95e3 
Table 5 lists all relative errors of four networks and Figure 5 shows the performance of four networks on an example signal. We observe that, for both tasks, the FT initialized networks have better accuracy than their randomly initialized counterparts. Under the same initialization, BNet2 achieves similar accuracy as CNN with much fewer parameters. Comparing two tasks, we notice that the improvement of FT initialization over random initialization is more significant on deblurring task than that on denoising task. For denoising tasks, as we enlarge additive noise level, BNet2rand and CNNrand perform almost as good as BNet2FT and CNNFT. However, for deblurring tasks, we always observe significant improvement from FT initialization.
5 Conclusion and Discussion
This paper proposes BNet2, a new kind of structured CNN based on Butterfly Algorithm, together with Fourier transform initialization for CNNs. With sparse acrosschannel connections and assuming the number of layers is , the total number of trainable parameters in BNet2 is , in contrast to in the conventional CNN, for being the length of input and being the number of channels. Also, approximation accuracy of the Fourier transform initialized networks to the Fourier kernel is proved to exponentially converge as the number of layers increases.
The new BNet2 is tested on multiple experiments, including approximation to Fourier transform operator and the transfer learning capability when data distribution shifts from training to testing, linear and nonlinear elliptic PDEs, denoising and deblurring tasks in 1D signals, which demonstrate the efficiency of BNet2 and the advantage of Fourier transform initialization over random initialization. Through these experiments, we have found that BNet2, typically with only parameters, have similar training and testing performance as its CNN counterpart under the same initializations. On the other hand, training Fourier transform initialized networks achieves better accuracy than randomly initialized ones and maintains accuracy better in different testing settings.
The work can be extended in several directions. First, the 2D version of BNet2 can be constructed and compared with mainstream CNN in solving PDEs, image processing tasks and inverse problems in 2D. Second, one may consider the case where the input data contain noise, in theory and in experiments. More experiments concerning local and global adversarial attacks can be done as well.
Acknowledgement
The authors thank Bin Dong and Yiping Lu for discussions on image inverse problems. The work of YL is supported in part by National Science Foundation via grants DMS1454939 and ACI1450280. XC is partially supported by NSF (DMS1818945 and DMS1820827), NIH and the Alfred P. Sloan Foundation.
References
 [1] Pietro Perona and Jitendra Malik. Scalespace and edge detection using anisotropic diffusion. IEEE Transactions on pattern analysis and machine intelligence, 12(7):629–639, 1990.
 [2] Chourmouzios Tsiotsios and Maria Petrou. On the choice of the parameters for anisotropic diffusion in image processing. Pattern recognition, 46(5):1369–1381, 2013.
 [3] JianFeng Cai, Bin Dong, Stanley Osher, and Zuowei Shen. Image restoration: total variation, wavelet frames, and beyond. Journal of the American Mathematical Society, 25(4):1033–1089, 2012.
 [4] Bin Dong, Qingtang Jiang, and Zuowei Shen. Image restoration: Wavelet frame shrinkage, nonlinear evolution pdes, and beyond. Multiscale Modeling & Simulation, 15(1):606–660, 2017.
 [5] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. Unet: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computerassisted intervention, pages 234–241. Springer, 2015.

[6]
Phillip Isola, JunYan Zhu, Tinghui Zhou, and Alexei A Efros.
Imagetoimage translation with conditional adversarial networks.
InProceedings of the IEEE conference on computer vision and pattern recognition
, pages 1125–1134, 2017.  [7] Tobias Plotz and Stefan Roth. Benchmarking denoising algorithms with real photographs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1586–1595, 2017.
 [8] Abdelrahman Abdelhamed, Stephen Lin, and Michael S Brown. A highquality denoising dataset for smartphone cameras. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1692–1700, 2018.

[9]
François Chollet.
Xception: Deep learning with depthwise separable convolutions.
In Proc.  30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017, pages 1800–1807. Institute of Electrical and Electronics Engineers Inc., nov 2017.  [10] Jonghoon Jin, Aysegul Dundar, and Eugenio Culurciello. Flattened convolutional neural networks for feedforward acceleration, 2015.
 [11] Franck Mamalet and Christophe Garcia. Simplifying ConvNets for fast learning. In Lect. Notes Comput. Sci., volume 7553 LNCS, pages 58–65, 2012.
 [12] Min Wang, Baoyuan Liu, and Hassan Foroosh. Design of efficient convolutional layers using single intrachannel convolution, topological subdivisioning and spatial "bottleneck" structure, 2017.
 [13] Gaihua Wang, Guoliang Yuan, Tao Li, and Meng Lv. An multiscale learning network with depthwise separable convolutions. IPSJ Trans. Comput. Vis. Appl., 10(11), dec 2018.
 [14] Yuwei Fan, Lin Lin, Lexing Ying, and Leonardo ZepedaNúnez. A multiscale neural network based on hierarchical matrices. Multiscale Modeling & Simulation, 17(4):1189–1213, 2019.
 [15] Davis Gilton, Greg Ongie, and Rebecca Willett. Neumann networks for inverse problems in imaging. arXiv preprint arXiv:1901.03707, 2019.
 [16] Yingzhou Li, Xiuyuan Cheng, and Jianfeng Lu. ButterflyNet: Optimal function representation based on convolutional neural networks, aug 2019.
 [17] Lexing Ying. Sparse Fourier transform via butterfly algorithm. SIAM J. Sci. Comput., 31(3):1678–1694, jan 2009.
 [18] Emmanuel J. Candès, Laurent Demanet, and Lexing Ying. A fast butterfly algorithm for the computation of Fourier integral operators. Multiscale Model. Simul., 7(4):1727–1750, jan 2009.
 [19] Laurent Demanet, Matthew Ferrara, Nicholas Maxwell, Jack Poulson, and Lexing Ying. A butterfly algorithm for synthetic aperture radar imaging. SIAM Journal on Imaging Sciences, 5(1):203–243, jan 2012.
 [20] Yingzhou Li, Haizhao Yang, Eileen R. Martin, Kenneth L. Ho, and Lexing Ying. Butterfly factorization. Multiscale Model. Simul., 13(2):714–732, jan 2015.
 [21] Yingzhou Li, Haizhao Yang, and Lexing Ying. A multiscale butterfly algorithm for multidimensional Fourier integral operators. Multiscale Model. Simul., 13(2):1–18, jan 2015.
 [22] Yuehaw Khoo and Lexing Ying. Switchnet: A neural network model for forward and inverse scattering problems. SIAM Journal on Scientific Computing, 41(5):A3182–A3201, 2019.
 [23] Yuwei Fan, Jordi FeliuFaba, Lin Lin, Lexing Ying, and Leonardo ZepedaNúnez. A multiscale neural network based on hierarchical nested bases. Research in the Mathematical Sciences, 6(2):21, 2019.
 [24] Yuwei Fan, Cindy Orozco Bohorquez, and Lexing Ying. Bcrnet: A neural network based on the nonstandard wavelet form. Journal of Computational Physics, 384:1–15, 2019.
 [25] Gregory Beylkin, Ronald Coifman, and Vladimir Rokhlin. Fast wavelet transforms and numerical algorithms i. Communications on pure and applied mathematics, 44(2):141–183, 1991.
 [26] Yingzhou. Li and Haizhao. Yang. Interpolative butterfly factorization. SIAM Journal on Scientific Computing, 39(2):A503–A531, 2017.
 [27] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet classification with deep convolutional neural networks. In F Pereira, C J C Burges, L Bottou, and K Q Weinberger, editors, Adv. Neural Inf. Process. Syst. 25, pages 1097–1105. Curran Associates, Inc., 2012.
 [28] Antoni Buades, Bartomeu Coll, and JeanMichel Morel. A review of image denoising algorithms, with a new one. Multiscale Model. Simul., 4(2):490–530, 2005.
 [29] Tony Chan and Jianlong Shen. Image processing and analysis variational, PDE, wavelet, and stochastic methods. Society for Industrial and Applied Mathematics, 2005.
 [30] Yingzhou Li, Jianfeng Lu, and Anqi Mao. Variational training of neural network approximations of solution maps for physical models, 2019.
Comments
There are no comments yet.