Butterfly-Net2: Simplified Butterfly-Net and Fourier Transform Initialization

12/09/2019 ∙ by Zhongshu Xu, et al. ∙ Duke University 0

Structured CNN designed using the prior information of problems potentially improves efficiency over conventional CNNs in various tasks in solving PDEs and inverse problems in signal processing. This paper introduces BNet2, a simplified Butterfly-Net and inline with the conventional CNN. Moreover, a Fourier transform initialization is proposed for both BNet2 and CNN with guaranteed approximation power to represent the Fourier transform operator. Experimentally, BNet2 and the Fourier transform initialization strategy are tested on various tasks, including approximating Fourier transform operator, end-to-end solvers of linear and nonlinear PDEs in 1D, and denoising and deblurring of 1D signals. On all tasks, under the same initialization, BNet2 achieves similar accuracy as CNN but has fewer parameters. Fourier transform initialized BNet2 and CNN consistently improve the training and testing accuracy over the randomly initialized CNN.



There are no comments yet.


page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep convolutional neural network (CNN) has been widely applied to solving PDEs as well as inverse problems in signal processing. In both applications, spectral methods, namely involving forward and backward Fourier transform operators, serve as a traditional solution. Spectral methods have been a classical tool for solving elliptic PDEs. For image inverse problems, primarily image restoration tasks like denoising and deblurring, a large class of PDE methods consider the nonlinear diffusion process

[1, 2] which are connected to wavelet frame methods [3, 4]. The involved operator is elliptic, typically the Laplace operator.

Apart from the rich prior information in these problems, the conventional end-to-end deep CNNs, like U-Net [5]

and Pix2pix

[6], consist of convolutional layers which have fully trainable local filters and are densely connected across channels. The enlarged model capacity and flexibility of CNN improve the performance in many end-to-end tasks, however, such fully data-driven models may give a superior performance on one set of training and testing datasets, but encounter difficulty when transfer to another dataset, essentially due to the overfitting of the trained model which has a large amount of flexibility [7, 8]. Also, as indicated in [9, 10, 11, 12, 13], the dense channel connection can be much pruned in the post-training process without loss of the prediction accuracy.

This motivates the design of structured CNNs which balance between model capacity and preventing over-fitting, by incorporating prior knowledge of PDEs and signal inverse problems into the deep CNN models. The superiority of structured CNNs over ordinary CNNs has been shown both for PDE solvers [14] and for image inverse problems [15]. Several works have borrowed ideas from numerical analysis in deep models: [16] introduces Butterfly-Net (BNet) based on the butterfly algorithm for fast computation of Fourier integral operators [17, 18, 19, 20, 21]; [22] proposes to use a switching layer with sparse connections in a shallow neural network, also inspired by the butterfly algorithm, to solve wave equation based inverse scattering problems; [14] and [23] introduce hierarchical matrix into deep network to compute nonlinear PDE solutions; [24] proposes a neural network based on the nonstandard form [25] and applies to approximate nonlinear maps in PDE computation. Problem-prior informed structured CNNs have become an emerging tool for a large class of end-to-end tasks in solving PDEs and inverse problems.

This paper introduces a new Butterfly network architecture, which we call BNet2. A main observation is that, as long as the Fourier transform operator is concerned, the middle layer in Butterfly algorithm can be removed while preserving the approximation ability of BNet. This leads to the proposed model, which inherits the approximation guarantee the same as the BNet, but also makes the network architecture much simplified and inline with the conventional CNN.

We also investigate the Fourier transform (FT) initialization. FT initialization adopts the interpolative construction in Butterfly factorization as in

[26] to initialize BNet2. Since BNet2 now is a conventional CNN with sparsified channel connections, FT initialization can also be applied to CNN to realize a linear FT. We experimentally find that both BNet2 and CNN are sensitive to initialization in problems that we test on, and FT initialized networks outperform their randomly initialized counterpart in our settings. The trained network from FT initialization also demonstrates better stability with respect to statistical transfer of testing dataset from the training set.

In summary, the contributions of the paper include: (1) We introduce BNet2, a simplified structured CNN based on Butterfly algorithm, which removes the middle layer and later layers in BNet and thus is inline with the conventional CNN architecture; (2) FT initialized BNet2 and CNN inherit the theoretical exponential convergence of BNet in approximating FT operator, and the approximation can be further improved after training on data; (3) Applications to end-to-end solver for linear and nonlinear PDEs, and inverse problems of signal processing are numerically tested, and under the same initialization BNet2 with fewer parameters achieves similar accuracy as CNN. (4) FT initialization for both BNet2 and CNN serves as an initialization recipe for a large class of CNNs in many applications, and we show that FT initialized BNet2 and CNN outperforms randomly initialized CNN on all tasks included in this paper.

2 Butterfly-Net2

The structure of BNet2 inherits a part of design in BNet and extends it to all layers. More precisely, the network structure before the middle layer in BNet is extended to all layers. After removing the middle layer and the second half layers in BNet, the structure of BNet2 is exactly a CNN with sparse channel connections. For completeness, we will first recall the CNN under our own notations and then introduce BNet2. Towards the end of this section, the numbers of parameters for both CNN and BNet2 are derived and compared.

Before introducing network structures, we first familiarize ourselves with notations used throughout this paper. The bracket notation of denotes the set of nonnegative integers smaller than , i.e., . Further, a contiguous subset of is denoted as , where is a divisor of denoting the total number of equal partitions and indexed from zero denotes the -th partition, i.e., . While describing the network structure, and

denote the input and output vector with length

and respectively, i.e., and . , , and denote hidden variables, multiplicative weights and biases respectively. For example, is the hidden variable at -th layer with

spacial degrees of freedom (DOFs) and

channels; is the multiplicative weights at -th layer with being the kernel size, and being the in- and out-channel sizes; denotes the bias at -th layer acting on

channels. Activation function is denoted as

, which is ReLU in this paper by default.

2.1 CNN Revisit

A CNN can be precisely described using notations defined as above. For a layer CNN, we define the feedforward network as follows.

  • Layer 0: The first layer hidden variable with channels is generated via applying a 1D convolutional layer with kernel size

    and stride

    followed by an activation function on the input vector , i.e.,


    for and .

  • Layer : The connection between the -th layer and the -th layer hidden variables is a 1D convolutional layer with kernel size , stride size , in-channels and out-channels followed by an activation function, i.e.,


    for and . The first and second summation in (2) denotes the summation over in-channels and the spacial convolution respectively.

  • Layer : The last layer mainly serves as a reshaping from the channel direction to spacial direction, which links the -th layer hidden variables with the output via a fully connected layer, i.e.,


    for . If is not the final output, then the bias and activation function can be added.

In the above description, readers who are familiar with CNN may find a few irregular places. We will address these irregular places in Remark 2.2 after the introduction of BNet2.

CNN is the most successful neural network in practice, especially in the area of signal processing and image processing. Convolutional structure, without doubt, contributes most to this success. Another contributor is the increasing channel numbers. In practice, people usually double the channel numbers until reaching a fixed number and then stick to it till the end. Continually doubling channel numbers usually improves the performance of the CNN, but has two drawbacks. First, large channel numbers lead to the large parameter number, which in turn leads to overfitting issue. The second drawback is the expensive computational cost in both training and evaluation.

(a) Network Structure
(b) Channel Connectivity
Figure 1: The comparison between CNN and BNet2.

2.2 Butterfly-Net2

BNet2, in contrast to CNN, has the identical convolutional structure and allows continually doubling the channel numbers. For the two drawbacks mentioned above, BNet2 overcomes the second one and partially overcomes the first one.

The Layer 0 in BNet2 is identical to that in CNN. Hence, we only define other layers in the feed forward network as follows.

  • Layer : The in-channels are equally partitioned into parts. For each part, a 1D convolutional layer with kernel size , stride , in-channel size and out-channel size is applied. The connection between the -th layer and the -th layer hidden variables obeys,


    for , , and .

  • Layer : The in-channels are equally partitioned into parts. For each part, a 1D convolutional layer with kernel size , in-channel size and out-channel size is applied. The last layer links the -th layer hidden variables with the output , i.e.,


    for and . If is not used as the final output directly, then the bias term and activation function can be added.

This remark addresses two irregular places in the CNN and BNet2 described above against the conventional CNN. First, all convolutions are performed in a non-overlapping way, i.e., the kernel size equals the stride size. Regular convolutional layer with a pooling layer can be adopted to replace the non-overlapping convolutional layer in both CNN and BNet2. Second, except the Layer 0 and Layer , all kernel sizes are , which can be generalized to other constant for both CNN and BNet2. We adopt such presentations to simplify the notations in Section 3, the FT initialization.

In (4), the in-channel index and the out-channel index of BNet2 are linked through the auxiliary index , whereas in the CNN, the in-channel index and the out-channel index are independent (see (2)). Figure 1 (b) illustrates the connectivity of and in CNN and in BNet2 at the -nd layer. Further, Figure 1 (a) shows the overall structure of CNN and BNet2. If we fill part of the multiplicative weights in CNN to be that in BNet2 according to (4) and set the rest multiplicative weights to be zero, then CNN recovers BNet2. Hence, any BNet2 can be represented by a CNN. The approximation power of CNN is guaranteed to exceed that of BNet2. Surprisingly, according to our numerical experiments, the extra approximation power does not improve the training and testing accuracy much in all examples we have tested.

2.3 Parameter Counts

Parameter counts are explicit for both CNN and BNet2. The numbers of bias are identical for two networks. It is for Layer 0, for Layer and for Layer . Hence the overall number of biases is


The total number of multiplicative parameters are very different for CNN and BNet2. For CNN, the parameter count is for Layer 0, for Layer , and for Layer . Hence the overall number of multiplicative parameters for CNN is


While, for BNet2, the parameter count is for Layer 0, for Layer , and for Layer . Hence the overall number of multiplicative parameters for BNet2 is


If we assume and , which corresponds to doubling channel number till the end, then we have


Let us consider another regime, i.e., and , which can be viewed as an analog of doubling the channel number first and then being fixed to a constant . Under this regime, the total numbers of parameters can be compared as,


where both come from term. Hence, in both regimes of hyper parameter settings, BNet2 has lower order number of parameters comparing against CNN. If the performance in terms of training and testing accuracy remains similar, BNet2 is then much more preferred than the CNN.

3 Fourier Transform Initialization

A good initialization is crucial in training CNNs especially in training highly structured neural networks like BNet2. It is known that CNN with random initialization achieves remarkable results in practical image processing tasks as shown in [27]. However, for synthetic signal data as in Section 4

, in which the high accuracy prediction is possible through a CNN with a set of parameters, we notice that CNN with random initialization and ADAM stochastic gradient descent optimizer is not able to converge to that CNN.

In this section, we aim to initialize both BNet2 and CNN to fulfill the discrete FT operator, which is defined as


for and , where denotes the number of discretization points and denotes the frequency window size. Discrete FT is the traditional computational tool for signal processing and image processing. Almost all related traditional algorithms involve either FT directly or Laplace operator, which can be realized via two FTs, see [28, 29]. Hence, if we can initialize a neural network as such a traditional algorithm involving discrete FT, the training of the neural network would be viewed as refining the traditional algorithm and makes its data adaptive. In another word, neural network solving image processing and signal processing tasks can be guaranteed to outperform traditional algorithms, although it is widely accepted in practice. This section is composed of two parts: preliminary and initialization. We will first introduce related complex number neural network realization, Chebyshev interpolation, FT approximation in the preliminary part. Then the initialization for both CNN and BNet2 is introduced in detail.

3.1 Preliminary

Fourier transform is a linear operator with complex coefficients. In order to realize complex number multiplication and addition via nonlinear neural network, we first represent a complex number as four real numbers, i.e., a complex number is represented as


where and for any . The vector form of contains at most two nonzeros. The complex number addition is the vector addition directly, while the complex number multiplication must be handled carefully. Let be two complex numbers. The multiplication is produced as the activation function acting on a matrix vector multiplication, i.e.,


In the initialization, all prefixed weights are in the role of instead of . In order to simplify the description below, we define an extensive assign operator as such that the 4 by 4 matrix in (13) then obeys . Without loss of generality, (13) can be extended to complex matrix-vector product and notation is adapted accordingly as well.

Another important tool is the Lagrange polynomial on Chebyshev points. The Chebyshev points of order on is defined as,


The associated Lagrange polynomial at is


If the interval is re-centered at and scaled by , then the transformed Chebyshev points obeys and the corresponding Lagrange polynomial at is


where is independent of the transformation of the interval.

Recall the Chebyshev interpolation representation of FT as Theorem 2.1 in [16]. We include part of that theorem here with our notation for completeness.

[Theorem 2.1 in [16]] Let and be two parameters such that . For any , let and denote two connected subdomains of and with length and respectively. Then for any and , there exists a Chebyshev interpolation representation of the FT operator,


where is the centers of , are the Chebyshev points on . Obviously, part of the approximation, , admits the convolutional structure across all . This part will be called the interpolation part in the following. It is the key that we can initialize CNN and BNet2 as a FT.

3.2 Fourier Transform Initialization for CNN and BNet2

Since all weights fit perfectly into the structure of BNet2, we will only introduce the initialization of BNet2 in detail. Assume the input is a function discretized on a uniform grid of with points and the output is the discrete FT of the input at frequency . Throughout all layers, the bias terms are initialized with zero. In the description below, we focus on the initialization of the multiplicative weights. Without loss of generality, we further assume .

  • Layer 0: For , we consider and for and , which satisfies the condition in Theorem 3.1. An index pair for being the index of and being the index of the Chebyshev points can be reindexed as . Hence we abuse the channel index as . Then fixing , the interpolation part is the same for all , which is naturally a non-overlapping convolution. Hence we set


    for , , and is the Chebyshev point on or . Then after applying the 1D convolutional layer as (1), the first hidden variable represents the input vector interpolated to the Chebyshev points on with respect to . The following layers recursively apply Theorem 3.1 to the remaining part.

  • Layer : We concern and for and at the current layer. The hidden variable represents the input interpolated to the Chebyshev points on with respect to , where and is the index of Chebyshev points. is a subinterval of and covers and . For a fixed , the interpolation part is the same for each index . The convolution kernel, hence, is defined as,


    where , , and are Chebyshev points on and respectively, and .

  • Layer : This layer concerns and for . All previous layers take care of the interpolation part. And the current layer applies the FT operator on each . The hidden variable represents the input interpolated to the Chebyshev points on with respect to , where and is the index of Chebyshev points. Define the channel index as an index pair , where is the index of and is the index for uniform points . Then the multiplicative weights are initialized as,


Since BNet2 can be viewed as a CNN with many zero weights, such an initialization can be used to initialize CNN as well. When we set the weights as above and set the rest weights to be zero, the CNN is then initialized as a FT.

As mentioned in Remark 2.2, a few irregular places in the current CNN and BNet2 description can be modified to match conventional CNN. The FT initialization can be updated accordingly. First, when convolutions are performed in a non-overlapping way without pooling layer, we can enlarge the kernel size and embed zeros to eliminate the impact of the overlapping part. Second, when the kernel sizes are a constant different from , the generalization of the initialization is feasible as long as the bipartition is modified to a multi-partition.

The approximation power of FT initialized CNN and BNet2 can be analyzed in an analogy way as that in [16], and the result is similar as well:

Let and denote the size of the input and output respectively. The depth and channel parameter satisfies . Then there exists a BNet2/CNN, , approximating the discrete FT operator such that for any bounded input vector , the error satisfies,


where is a constant depending only on and , for .

The proof Theorem 3.2

is composed of layer by layer estimations on the multiplicative weight matrices, which is analogy to that in 

[16]. Hence we omit the proof. Later, Theorem 3.2 is validated numerically.

4 Numerical Results

This section presents numerical experiments to demonstrate the approximation power of CNN and BNet2, and compare the difference between FT initialization and random initialization. Thus, four different settings, CNN with random initialization (CNN-rand), CNN with FT initialization (CNN-FT)111The Layer is often combined with feature layers. Hence for both CNN, Layer as in BNet2 is adopted., BNet2 with random initialization (BNet2-rand), and BNet2 with FT initialization (BNet2-FT) are tested on three different sets of problems: (1) approximation of FT operator; (2) energy and solution maps of elliptic equations; (3) 1D signals de-blurring and de-noising tasks.

4.1 Approximation of Fourier Transform Operator

This section repeats experiments as in the original BNet [16]

on BNet2, namely approximation power before training, approximation power after training, transfer learning capability, and robustness to adversarial attack.

4.1.1 Approximation Power Before Training

The first experiment aims to validate the exponential decay of the approximation error of the BNet2 as either the depth increases or the number of Chebyshev points increases. We construct and initialize a BNet2 to approximate a discrete FT operator, which has length of input and length of output representing integer frequency on . The approximation power is measured under the relative operator -norm, i.e., , where and denote BNet2 and FT operator respectively.

16384 6 3.48e-02 5.25e-02 6.30e-02 8 3.80e-02 7.26e-02 6.94e-02
7 2.18e-03 4.18e-03 6.36e-03 9 2.39e-03 6.05e-03 6.95e-03
8 1.37e-04 2.84e-04 5.30e-04 10 1.54e-04 4.31e-04 5.73e-04
9 8.96e-06 1.79e-05 4.08e-05 11 1.05e-05 2.89e-05 4.37e-05
10 6.41e-07 1.16e-06 3.11e-06 12 7.64e-07 1.86e-06 3.30e-06
Table 1: Relative error of BNet2 before training with in approximating FT operator.

In Table 1, we fix the number of Cheybsev points being and varying for two choices of . All errors with respect to different norms decay exponentially as increases. The decay rates for different s remain similar, while the prefactor is slightly larger for larger .

3 -2.98 -2.89 -2.71 -3.00 -2.82 -2.72 -3.03 -2.83 -2.74 4 -3.92 -3.78 -3.47 -3.92 -3.85 -3.60 -3.90 -3.82 -3.68 5 -4.78 -4.71 -4.45 -4.93 -4.75 -4.62 -4.78 -4.65 -4.52 6 -5.70 -5.54 -5.30 -5.70 -5.62 -5.33 -5.71 -5.66 -5.34

Figure 2: (Left plot) Exponential convergence rate when = 64, = 2. (Right table) Convergence rate of BNet2 before training in approximating FT operator for s. , , and are the logarithms of convergence rates under different norms, .

In the table in Figure 2, we calculate the logarithms of rates of convergence for different s and s under different norms. The table shows that for all choices of the convergence rates measured under different norms stay similar for any fixed . And the convergence rate decreases as increases.

All of these above convergence behaviors agree with the analysis in [16]. And all rates we obtained are better than the corresponding theoretical ones. In summary, when approximating FT operator using FT initialized BNet2, the approximation accuracy decays exponentially as increases and the rate of convergence decreases as increases.

4.1.2 Approximation Power After Training

The second numerical experiment aims to demonstrate the approximation power of the four networks in approximating FT operator after training.

Each data point used in this section is generated as follows. We first generate an array of random complex numbers with each component being uniformly random in . The zero frequency is a random real number. Second, we apply a Gaussian mask with mean

and standard deviation

on the array. The array then is complexly symmetrized to be a frequency vector and the inverse discrete FT is applied to obtain the real input vector. The constant is chosen such that the two norm of the output vector is close to .

In this experiment, we have input length being , output length being , level number being , channel parameter being . All networks are trained under the infinity data setting, i.e., the training data is randomly generated on the fly. ADAM optimizer with batch size and exponential decay learning rate is adopted. For FT initialized networks, the maximum training steps is 10,000, whereas for random initialization we train 20,000 steps. The reported relative error in vector two norm is calculated on a testing data set of size . Default values are used for any unspecified hyper parameters.

Network Parameters Pre-Train Rel Err Test Rel Err
BNet2-FT 9252 1.84e-3 1.33e-5
BNet2-rand 1.38e+0 1.34e-2
CNN-FT 49572 1.84e-3 9.29e-6
CNN-rand 5.09e+0 7.41e-2
Table 2: Relative errors of networks in approximating FT operator before and after training.

Table 2 shows the pre-training and testing relative errors for BNet2-FT, BNet2-rand, CNN-FT, and CNN-rand. Comparing the results, BNet2 and CNN have similar performance for both initializations, while BNet2 has only about parameters as CNN. Hence those extra coefficients in CNN do not improve the approximation to FT operator. On the other hand, FT initialization achieves an accuracy better than that of BNet2-rand and CNN-rand after training. After training FT initialized networks, extra two digits accuracy is achieved for both BNet2-FT and CNN-FT. We conjecture that the local minima found through the training from the FT initialization has a narrow and deep well on the energy landscape such that the random initialization with stochastic gradient descent is not able to find it efficiently.

4.1.3 Transfer Learning Capability

This numerical experiment compares the transfer learning capability of four networks. The training and testing data are generated in a same way as in Section 4.1.2 with different choices of and . We have three training sets: low frequency training set ( and ), high frequency training set ( and ) and mixture training set (no Gaussian mask). A sequence of testing sets of size are generated with and .

The networks used here have the same structure and hyper-parameters as in Sec 4.1.2 while the channel parameter . Each experiment is repeated times. Then the mean and standard deviation of the error in two norm are reported below.

(a) low frequency training set
(b) high frequency training set
(c) mixture training set
Figure 3: Figures show the transfer learning results of four networks trained on three different training sets. The horizontal axis represents testing sets with different . The testing results on the mixture testing set are plotted as the isolated error bars. Each error bar represents the mean and standard deviation across repeating experiments. The horizontal dash lines indicate the testing error before training.

As in Figure 3, for both initializations, BNet2 and CNN have similar accuracy especially on testing sets away from the training set. Taking the FT initialization before training as a reference, we also notice that even if randomly initialized networks can reach the accuracy of the reference on some testing sets, they lose accuracy on transferred testing sets. On the other side, FT initialized networks after training maintain accuracy better than that of the reference on all testing sets. In terms of the stability after training, BNet2-FT and CNN-FT are much more stable than BNet2-rand and CNN-rand, which is due to the randomness in initializers. This phenomenon also emphasizes the advantage of FT initialization in stability and repeatability.

4.2 Energy and Solution Map of Elliptic PDEs

This section focus on the elliptic PDE of the following form,


with periodic boundary condition, where denotes coefficients and denotes the strength of nonlinearity. Such an equation appears in a wide range of physical models governed by Laplace’s equation, Stokes equation, etc. Equation (22) is discretized on a uniform grid with points.

4.2.1 Energy of Laplace Operator

In this section, we aim to construct an approximation of the energy functional of 1D Poisson’s equations, i.e., and . The energy functional of Poisson’s equation is defined as the negative inner product of and , which can also be approximated by a quadratic form of the leading low-frequency Fourier components. Hence, Here we adopt BNet2-FT, BNet2-rand, CNN-FT, and CNN-rand with an extra square layer, which is called task-dependent layer.

In this numerical example, the input has the same distribution as that in Section 4.1.2. All other hyper parameters of networks and the training setting are also identical to that in Section 4.1.2.

Parameters Pre-Train Rel Err Test Rel Err
BNet2-FT 9268 2.11e-3 8.10e-6
BNet2-rand 7.97e-1 4.62e-3
CNN-FT 49588 2.11e-3 4.79e-6
CNN-rand 5.53e-1 6.21e-3
Table 3: Training results for networks in representing the energy of the Laplace operator.

Table 3 shows the results for energy of 1D Laplace operators, which has similar property as Table 2. Hence all conclusions in Section 4.1.2 apply here.

4.2.2 End-to-end Linear Elliptic PDE Solver

In this section, we aim to represent the end-to-end solution map of linear elliptic PDEs by U-Nets. The linear elliptic PDE is (22) with high contrast coefficient as,


for being the uniform point in and .

It is well known that the inverse of linear constant coefficient Laplace operator can by represented by where denotes the FT and is a diagonal operator. Therefore, we design our network in the same sprite. Our network contains three parts: a BNet2/CNN with input length , output length , a fully connected dense layer with bias terms and activation function, and another BNet2/CNN with input length , output length . Since the input

is real function, here we apply odd symmetry to it and initialize the first BNet2s/CNNs as FT to make the first part of the network serves as sine transform. Then we initialize

according to , and initialize the third part to be an inverse sine transform so that the overall network is an approximation of the inverse operator. In this example, we set , , and . Both BNet2s/CNNs are constructed with layers and channel parameter .

Each training and testing data is generated as follows. We first generate an array of length with random numbers. The first entry is fixed to be 0 to incorporate the periodic boundary condition, whereas the following entries are uniform sampled from . Then an inverse discrete sine transform is applied to obtain the input vector. The reference solution is calculated through traditional spectral methods on a finer grid of nodes. The training and testing data set contain and points respectively. Other settings are the same as in Section 4.1.2.

Parameters Linear PDE Nonlinear PDE
Pre-Train Rel Err Test Rel Err Pre-Train Rel Err Test Rel Err
BNet2-FT 17856 5.16e-2 4.86e-3 3.48e+0 2.02e-2
BNet2-rand 9.75e+0 4.43e-2 4.37e+2 1.00e+0
CNN-FT 82368 5.16e-2 3.96e-3 3.48e+0 1.52e-2
CNN-rand 3.53e+0 2.03e-2 5.65e+2 1.00e+0
Table 4: Relative errors in approximating the solution map of the linear and nonlinear elliptic PDE.
Figure 4: The left figure shows an example solution and the output from four networks for the linear elliptic PDE. The right figure is a zoom-in of the green box in the left.

Table 4 and Figures 4 show that in this end-to-end task, CNN performs slightly better than BNet2 at the cost of times more parameters. Training from FT initialization in both cases provides one more digit of accuracy. Figures 4 further shows that BNet2-FT and CNN-FT significantly outperform BNet2-rand and CNN-rand near sharp changing areas in .

4.2.3 End-to-end Nonlinear Elliptic PDE Solver

In this section, we focus on a highly nonlinear elliptic PDE (22) with and . The reference solution for nonlinear PDEs in general is difficult and expensive to obtain. Hence, in this section, we apply the solve-train framework proposed in [30] to avoid explicitly solving the nonlinear PDE.

Denoting the nonlinear PDE as an operator acting on , i.e.,

, our loss function here is defined as


where denotes the used neural network. The reported relative error is calculated on testing data as follows,


The same networks and other related settings as in Section 4.2.2 are used here.

Table 4 shows that under solve-train framework randomly initialized networks are not able to converge to a meaningful result, whereas FT initialized networks find a representation for the solution map with digits accuracy. Partially, this is due to the extra condition number of introduced by solve-train framework in training. Comparing BNet2-FT with CNN-FT, we find similar conclusions as before, i.e., CNN-FT achieves slightly better accuracy with higher cost in the number of parameters.

4.3 Denoising and Deblurring of 1D Signals

In this section, we aim to apply networks to the denoising and deblurring tasks in signal processing. The network structure used in this experiment is U-Net, which concatenates two networks, i.e., two BNet2-FT, two BNet2-rand, two CNN-FT or two CNN-rand. Such a structure with FT initialization reproduces a low pass filter.

The low frequency true signal is generated as the input vector in Section 4.1.2. Two polluted signals, and , are generated by adding a Gaussian noise with standard deviation and convolving a Gaussian with standard deviation , respectively. The mean relative errors of and are and respectively.

Regarding the U-Net structure, the first part has input length and output length

in representing frequency domain

. After that, the output of the first part is complex symmetrized to frequency domain . Then the second part has input length , output length . In both parts, we adopt layers. The other hyper parameters are the same as that in Section 4.1.2. All relative errors are measured in two norm.

Network para
Pre-Train Rel Err Test Rel Err Pre-Train Rel Err Test Rel Err
BNet2-FT 19,392 9.56e-2 7.54e-3 1.64e-1 8.02e-4
BNet2-rand 1.07e+0 1.52e-2 1.02e+0 1.07e-2
CNN-FT 83,904 9.56e-2 7.74e-3 1.64e-1 8.19e-4
CNN-rand 1.23e+0 1.28e-2 1.05e+0 9.95e-3
Table 5: Relative error of denoising and deblurring of 1D signals.
(a) Example of signal denoising
(b) Example of signal deblurring
Figure 5: (a) and (b) show an example of denoising and deblurring respectively. The right figures are zoom-in of green boxes in the left figures.

Table 5 lists all relative errors of four networks and Figure 5 shows the performance of four networks on an example signal. We observe that, for both tasks, the FT initialized networks have better accuracy than their randomly initialized counterparts. Under the same initialization, BNet2 achieves similar accuracy as CNN with much fewer parameters. Comparing two tasks, we notice that the improvement of FT initialization over random initialization is more significant on deblurring task than that on denoising task. For denoising tasks, as we enlarge additive noise level, BNet2-rand and CNN-rand perform almost as good as BNet2-FT and CNN-FT. However, for deblurring tasks, we always observe significant improvement from FT initialization.

5 Conclusion and Discussion

This paper proposes BNet2, a new kind of structured CNN based on Butterfly Algorithm, together with Fourier transform initialization for CNNs. With sparse across-channel connections and assuming the number of layers is , the total number of trainable parameters in BNet2 is , in contrast to in the conventional CNN, for being the length of input and being the number of channels. Also, approximation accuracy of the Fourier transform initialized networks to the Fourier kernel is proved to exponentially converge as the number of layers increases.

The new BNet2 is tested on multiple experiments, including approximation to Fourier transform operator and the transfer learning capability when data distribution shifts from training to testing, linear and nonlinear elliptic PDEs, denoising and deblurring tasks in 1D signals, which demonstrate the efficiency of BNet2 and the advantage of Fourier transform initialization over random initialization. Through these experiments, we have found that BNet2, typically with only parameters, have similar training and testing performance as its CNN counterpart under the same initializations. On the other hand, training Fourier transform initialized networks achieves better accuracy than randomly initialized ones and maintains accuracy better in different testing settings.

The work can be extended in several directions. First, the 2D version of BNet2 can be constructed and compared with mainstream CNN in solving PDEs, image processing tasks and inverse problems in 2D. Second, one may consider the case where the input data contain noise, in theory and in experiments. More experiments concerning local and global adversarial attacks can be done as well.


The authors thank Bin Dong and Yiping Lu for discussions on image inverse problems. The work of YL is supported in part by National Science Foundation via grants DMS-1454939 and ACI-1450280. XC is partially supported by NSF (DMS-1818945 and DMS-1820827), NIH and the Alfred P. Sloan Foundation.


  • [1] Pietro Perona and Jitendra Malik. Scale-space and edge detection using anisotropic diffusion. IEEE Transactions on pattern analysis and machine intelligence, 12(7):629–639, 1990.
  • [2] Chourmouzios Tsiotsios and Maria Petrou. On the choice of the parameters for anisotropic diffusion in image processing. Pattern recognition, 46(5):1369–1381, 2013.
  • [3] Jian-Feng Cai, Bin Dong, Stanley Osher, and Zuowei Shen. Image restoration: total variation, wavelet frames, and beyond. Journal of the American Mathematical Society, 25(4):1033–1089, 2012.
  • [4] Bin Dong, Qingtang Jiang, and Zuowei Shen. Image restoration: Wavelet frame shrinkage, nonlinear evolution pdes, and beyond. Multiscale Modeling & Simulation, 15(1):606–660, 2017.
  • [5] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
  • [6] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros.

    Image-to-image translation with conditional adversarial networks.


    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 1125–1134, 2017.
  • [7] Tobias Plotz and Stefan Roth. Benchmarking denoising algorithms with real photographs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1586–1595, 2017.
  • [8] Abdelrahman Abdelhamed, Stephen Lin, and Michael S Brown. A high-quality denoising dataset for smartphone cameras. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1692–1700, 2018.
  • [9] François Chollet.

    Xception: Deep learning with depthwise separable convolutions.

    In Proc. - 30th IEEE Conf. Comput. Vis. Pattern Recognition, CVPR 2017, pages 1800–1807. Institute of Electrical and Electronics Engineers Inc., nov 2017.
  • [10] Jonghoon Jin, Aysegul Dundar, and Eugenio Culurciello. Flattened convolutional neural networks for feedforward acceleration, 2015.
  • [11] Franck Mamalet and Christophe Garcia. Simplifying ConvNets for fast learning. In Lect. Notes Comput. Sci., volume 7553 LNCS, pages 58–65, 2012.
  • [12] Min Wang, Baoyuan Liu, and Hassan Foroosh. Design of efficient convolutional layers using single intra-channel convolution, topological subdivisioning and spatial "bottleneck" structure, 2017.
  • [13] Gaihua Wang, Guoliang Yuan, Tao Li, and Meng Lv. An multi-scale learning network with depthwise separable convolutions. IPSJ Trans. Comput. Vis. Appl., 10(11), dec 2018.
  • [14] Yuwei Fan, Lin Lin, Lexing Ying, and Leonardo Zepeda-Núnez. A multiscale neural network based on hierarchical matrices. Multiscale Modeling & Simulation, 17(4):1189–1213, 2019.
  • [15] Davis Gilton, Greg Ongie, and Rebecca Willett. Neumann networks for inverse problems in imaging. arXiv preprint arXiv:1901.03707, 2019.
  • [16] Yingzhou Li, Xiuyuan Cheng, and Jianfeng Lu. Butterfly-Net: Optimal function representation based on convolutional neural networks, aug 2019.
  • [17] Lexing Ying. Sparse Fourier transform via butterfly algorithm. SIAM J. Sci. Comput., 31(3):1678–1694, jan 2009.
  • [18] Emmanuel J. Candès, Laurent Demanet, and Lexing Ying. A fast butterfly algorithm for the computation of Fourier integral operators. Multiscale Model. Simul., 7(4):1727–1750, jan 2009.
  • [19] Laurent Demanet, Matthew Ferrara, Nicholas Maxwell, Jack Poulson, and Lexing Ying. A butterfly algorithm for synthetic aperture radar imaging. SIAM Journal on Imaging Sciences, 5(1):203–243, jan 2012.
  • [20] Yingzhou Li, Haizhao Yang, Eileen R. Martin, Kenneth L. Ho, and Lexing Ying. Butterfly factorization. Multiscale Model. Simul., 13(2):714–732, jan 2015.
  • [21] Yingzhou Li, Haizhao Yang, and Lexing Ying. A multiscale butterfly algorithm for multidimensional Fourier integral operators. Multiscale Model. Simul., 13(2):1–18, jan 2015.
  • [22] Yuehaw Khoo and Lexing Ying. Switchnet: A neural network model for forward and inverse scattering problems. SIAM Journal on Scientific Computing, 41(5):A3182–A3201, 2019.
  • [23] Yuwei Fan, Jordi Feliu-Faba, Lin Lin, Lexing Ying, and Leonardo Zepeda-Núnez. A multiscale neural network based on hierarchical nested bases. Research in the Mathematical Sciences, 6(2):21, 2019.
  • [24] Yuwei Fan, Cindy Orozco Bohorquez, and Lexing Ying. Bcr-net: A neural network based on the nonstandard wavelet form. Journal of Computational Physics, 384:1–15, 2019.
  • [25] Gregory Beylkin, Ronald Coifman, and Vladimir Rokhlin. Fast wavelet transforms and numerical algorithms i. Communications on pure and applied mathematics, 44(2):141–183, 1991.
  • [26] Yingzhou. Li and Haizhao. Yang. Interpolative butterfly factorization. SIAM Journal on Scientific Computing, 39(2):A503–A531, 2017.
  • [27] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet classification with deep convolutional neural networks. In F Pereira, C J C Burges, L Bottou, and K Q Weinberger, editors, Adv. Neural Inf. Process. Syst. 25, pages 1097–1105. Curran Associates, Inc., 2012.
  • [28] Antoni Buades, Bartomeu Coll, and Jean-Michel Morel. A review of image denoising algorithms, with a new one. Multiscale Model. Simul., 4(2):490–530, 2005.
  • [29] Tony Chan and Jianlong Shen. Image processing and analysis variational, PDE, wavelet, and stochastic methods. Society for Industrial and Applied Mathematics, 2005.
  • [30] Yingzhou Li, Jianfeng Lu, and Anqi Mao. Variational training of neural network approximations of solution maps for physical models, 2019.