# Butterfly-Net: Optimal Function Representation Based on Convolutional Neural Networks

Deep networks, especially Convolutional Neural Networks (CNNs), have been successfully applied in various areas of machine learning as well as to challenging problems in other scientific and engineering fields. This paper introduces Butterfly-Net, a low-complexity CNN with structured hard-coded weights and sparse across-channel connections, which aims at an optimal hierarchical function representation of the input signal. Theoretical analysis of the approximation power of Butterfly-Net to the Fourier representation of input data shows that the error decays exponentially as the depth increases. Due to the ability of Butterfly-Net to approximate Fourier and local Fourier transforms, the result can be used for approximation upper bound for CNNs in a large class of problems. The analysis results are validated in numerical experiments on the approximation of a 1D Fourier kernel and of solving a 2D Poisson's equation.

## Authors

• 15 publications
• 24 publications
• 63 publications
• ### Approximation and Non-parametric Estimation of ResNet-type Convolutional Neural Networks

Convolutional neural networks (CNNs) have been shown to achieve optimal ...
03/24/2019 ∙ by Kenta Oono, et al. ∙ 0

• ### Butterfly-Net2: Simplified Butterfly-Net and Fourier Transform Initialization

Structured CNN designed using the prior information of problems potentia...
12/09/2019 ∙ by Zhongshu Xu, et al. ∙ 0

• ### Fast Fourier Transformation for Optimizing Convolutional Neural Networks in Object Recognition

This paper proposes to use Fast Fourier Transformation-based U-Net (a re...
10/08/2020 ∙ by Varsha Nair, et al. ∙ 134

• ### Non-asymptotic Excess Risk Bounds for Classification with Deep Convolutional Neural Networks

In this paper, we consider the problem of binary classification with a c...
05/01/2021 ∙ by Guohao Shen, et al. ∙ 0

• ### On the unreasonable effectiveness of CNNs

Deep learning methods using convolutional neural networks (CNN) have bee...
07/29/2020 ∙ by Andreas Hauptmann, et al. ∙ 0

• ### Optimal Approximation with Sparse Neural Networks and Applications

We use deep sparsely connected neural networks to measure the complexity...
08/14/2021 ∙ by Khay Boon Hong, et al. ∙ 0

• ### Sparse Linear Networks with a Fixed Butterfly Structure: Theory and Practice

Fast Fourier transform, Wavelets, and other well-known transforms in sig...
07/17/2020 ∙ by Nir Ailon, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Deep network is a central tool in machine learning and data analysis nowadays Bengio et al. (2015)

. In particular, Convolutional Neural Network (CNN) has been proved to be a powerful tool in image recognition and representation. Deep learning has also emerged to be successfully applied in solving PDEs

Berg and Nyström (2017); Khoo et al. (2018); Long et al. (2017) and physics problems Behler and Parrinello (2007); E et al. (2017); Schneider et al. (2017); Zhang et al. (2018), showing the potential of becoming a tool of great use for computational mathematics and physics as well. Given the wide application of PDE and wavelet based methods in image and signal processing Cai et al. (2012); Chan and Shen (2005); Mallat (2008), an understanding of CNN’s ability to approximate differential and integral operators will lead to an explanation of CNN’s success in these fields, as well as possible improved network architectures.

The remarkable performance of deep networks across various fields relies on their ability to accurately represent functions of high-dimensional input data. Approximation analysis has been a central topic to the understanding of the neural networks. The classical theory developed in 80’s and early 90’s Barron (1993); Cybenko (1989); Hornik et al. (1989) approximates a target function by a linear combination of sigmoids, which is equivalent to a fully connected neural network with one hidden layer. While universal approximation theorems were established for such shallow networks, the research interest in neural networks only revived in 00’s after observing the successful applications of deep networks, particular the superior performance of CNNs in image and audio signal processing.

Motivated by the empirical success, the approximation advantage of deep networks over shallow ones has been theoretically analyzed in several places. However, most results assume stacked fully connected layers and do not apply to CNN which has specific geometrical constraints: (1) the convolutional scheme, namely local-supported filters with weight sharing, and (2) the hierarchical multi-scale architecture. The approximation power of deep networks with hierarchical geometrically-constrained structure has been studied recently Cohen et al. (2016); Mhaskar et al. (2016); Mhaskar and Poggio (2016), yet the network architecture differ from the standard CNN. We review the related literature more below.

The current paper proposes a specific architecture of convolutional neural networks based on the Butterfly scheme originally developed for the fast computation of special function transforms Michielssen and Boag (1996); O’Neil et al. (2010); Ying (2009) and Fourier integral operators Candès et al. (2007, 2009); Li and Yang (2017); Li et al. (2015a, b). Butterfly

algorithm provides a hierarchical structure with locally low-rank interpolation of kernel functions, and can be applied to solve many PDE problems. In terms of computational complexity, the scheme is near optimal for Fourier kernels and a large class of Fourier integral operators. The proposed

Butterfly-Net explicitly hard-codes the stacked convolutional layers, which collectively computes the Fourier coefficients of the input signal with guaranteed numerical accuracy. Unlike regular CNN which is fully connected across the unordered channels, the channels in the Butterfly-Net has a clear correspondence with the frequency band, namely the position in the spectral representation of the signal, and meanwhile, the inter-channel weights are sparsely connected. Butterfly-Net is much lighter than the usual CNN: the model complexity (in terms of number of parameters in the parametrization) is and computational complexity is , where is the length of the discrete input signal. The approximation error of Butterfly-Net is proved to exponentially decay as the network depth increases, also is numerically validated in Sec. 4. Due to the efficient approximation of Fourier kernels, Butterfly-Net thus possesses all approximation properties of the Fourier representation of input signals, which is particularly useful for solving PDEs and (local) Fourier-based algorithms in image and signal processing. Our theoretical result provides an approximation upper bound of the CNNs which are trained in practice, an exemplar case of which is shown in Sec. 4.

The contribution of our work is summarized as follows: A low-complexity CNN, motivated by the Butterfly algorithm, is proposed with the following properties.

• [leftmargin=*]

• The network has structured hard-coded weights and sparse across-channel connections. The channel indexing is in order of the frequency band, and the intermediate representations in the network have the interpretation of local Fourier transforms.

• The networks gives a near-optimal hierarchical representation of the Fourier kernel, with model complexity and computational complexity .

• The approximation accuracy to the Fourier Kernel is theoretically proved to be exponentially decay as the depth increases. Combined with fully-connected layers, the approximation analysis of Butterfly-Net provides an approximation upper bound of CNNs in a large class of problems in PDEs and image and signal processing.

Before we explain all these in more detail in the rest of the paper, we review some related works.

Classical approximation results of neural networks. Universal approximation theorems for fully-connected neural networks with one hidden layer were established in Cybenko (1989); Hornik et al. (1989) showing that such network can approximate a target function with arbitrary accuracy if the hidden layer is allowed to be wide enough. In theory, the family of target functions can include all measurable functions Hornik et al. (1989)

, when exponentially many hidden neurons are used.

Gallant and White (1988) proposed “Fourier network”, proving universal approximation to squared-integrable functions by firstly constructing a Fourier series approximation of the target function in a hard-coded way. These theorems are firstly proved for one-dimensional input, and when generalizing to the multivariate case the complexity grows exponentially.

Using the Fourier representation of the target function supported on a sphere in , Barron (1993) showed that the mean squared error of the approximation, integrated with arbitrary data distribution on the sphere, decays as when hidden nodes are used in the single hidden layer. The results for shallow networks are limited, and the approximation power of depth in neural networks has been advocated in several recent works, see below. Besides, while the connection to Fourier analysis was leveraged, at least in Gallant and White (1988) and Barron (1993), it is different from the hierarchical function representation scheme as what we consider here.

Approximation power of deep networks.

The expressive power of deep networks has drawn many research interests after 00’s. The approximation power of multi-layer Restricted Boltzmann Machines (RBM) was studied in

Le Roux and Bengio (2008), which showed that RBMs are universal approximators of discrete distributions and more hidden layers improves the approximation power. Relating to the classical approximation results in harmonic analysis, Bölcskei et al. (2017) derived lower bounds for the uniform approximation of square-integrable functions, and proved the asymptotic optimality of the sparsely connected deep networks as a universal approximator. However, the network complexity also grows exponentially when the input dimension increases.

The advantage of deep architecture over shallow ones has been studied in several works. Delalleau and Bengio (2011) identified a deep sum-product network which can only be approximated by an exponentially larger number of shallow ones. The exponential growth of linear regions as the number of layers increases was studied in Montufar et al. (2014); Telgarsky (2016). Eldan and Shamir (2016) constructed a concrete target function which distinguishes three and two-layer networks. Liang and Srikant (2016)

showed that shallow networks require exponentially more neurons than deep networks to obtain a given approximation error for a large class of functions. The advantage of deep ReLU networks over the standard single-layer ones was analyzed in

Yarotsky (2017) in the context of approximation in Sobolev spaces. The above works address deep networks with fully-connected layers, instead of having geometrically-constrained constructions like CNNs.

Deep networks with these constraints are relatively less analyzed. The approximation power of a hierarchical binary tree network was studied in Mhaskar et al. (2016); Mhaskar and Poggio (2016) which supports the potential advantage of deep CNNs. Cohen et al. (2016)

used convolutional arithmetic circuits to show the equivalence between a deep network and a hierarchical Tucker decomposition of tensors, and proved the advantage of depth in function approximation. The networks being studied differ from the regular CNNs widely used in the typical real world applications.

## 2 Butterfly-Net

Butterfly-Net is a novel structured CNN which requires far less number of parameters to represent functions that can be expressed in the frequency domain. The essential building block of Butterfly-Net is the interpolation convolutional layer, illustrated in Fig. 2 and described in Sec. 2.1, which by itself is also interesting as it gives another way of interpreting channel mixing. Sec. 2.2 assembles interpolation convolutional layers together and form the Butterfly-Net. For simplicity of the notation and indexing, we will limit the discussion to 1D signals, while the extension to 2D and higher dimensional data is straightforward by using a tensor product construction.

### 2.1 Interpolation convolutional layer

To introduce the interpolation convolutional operation, we first introduce an equivalent formula of the usual convolutional layer by “channel unfolding”, and then illustrate the layer through a hierarchical interpolation example.

Let be the input data with length and channels. Assume the 1D convolutional layer maps input channels to output channels and the convolution filter is of size , hence the parameters can be written as where , , and . The output data, under these notations, is written as

 g(j,k2)=∑1≤i≤w,1≤k1≤c1Wk2,i,k1f(i+s(j−1),k1), (1)

where is the stride size. In many problems, it is more convenient to unfold the channel index into the data length, i.e., , , and . Hence one site of

with all channels can be represented as the matrix vector product,

 g[(j−1)c2+(1:c2)]=W[1:c2,1:wc1]f[sc1(j−1)+(1:wc1)]. (2)

Without considering the weight sharing in the convolution layer, all channel direction can be unfolded into the data dimension and the convolution is modified as a block convolution. Such an unfolded convolutional layer will be called the interpolation convolutional layer.

The representation of interpolation convolutional layer is motivated by the observation that function interpolation (source transfer) can be naturally represented as multi-channel convolution. In this setting, unfolding channels is natural. Let be a uniform partition of , i.e., , and denote the -th discretization point in for . We further assume that the locations of relative to are the same for all .

The input data will be viewed as the function value of evaluated at the points , i.e., . Let be the interpolation points on for , with the Lagrange basis polynomial given by

 \calLk2(x)=∏p≠k2x−zBjpzBjk2−zBjp. (3)

The equivalent source of at is then defined as,

 g(zBjk2)=g(j,k2)=c1∑k1=1\calLk2(xBjk1)f(xBjk1)=c1∑k1=1∏p≠k2xBjk1−zBjpzBjk2−zBjpf(xBjk1). (4)

We note that the product of the fractions in (4) depends only on the relative distance and is thus independent of thanks to our assumption on the interpolation points. Therefore, we could denote , and thus the source transfer formula (4) can be interpreted as convolution (1) with the stride size being the same as the filter size, i.e., . In this representation, the two channel indices and denote the original points and interpolation points within each interval . Therefore, unfolding the channel index of both and leads to the natural ordering of the index of points on .

For a CNN with multiple convolutional layers, the unfolding of the channel index could be done recursively. Fig. 2 (a) illustrates 1D interpolation convolutional layers whereas Fig. 2 (b) shows its unfolded representation. Gray zones in both figures indicate instances of the data dependency between layers.

Fig. 2 (b) can also be understood in the view of function interpolation. The domain is first divided into four parts and the first layer interpolates the input function within each part to its three interpolation points. The layer afterwards merges two adjacent parts into one and interpolates the function defined on the previous 6 grid points to the new 3 interpolation points on the merged part.

### 2.2 Butterfly-Net Architecture

Let be the input data viewed as signal in time (extension to higher dimension signals is carried out through tensor product construction). Time-frequency analysis usually splits the signal into different modes according to frequency range, e.g., high-, medium-, low-frequency modes. Most importantly, once the signal is decomposed into different modes, they will be analyzed separately and will not be mixed again. This motivates us to propose non-mixing channels in our Butterfly-Net, since the channel has a correspondence with frequency modes in our setting. Assume that the input vector is of length and the output is a feature vector of length . Let denote the number of major layers in the Butterfly-Net, denote the size mixing channels. Without loss of generality, we assume that is an even integer such that .

We propose the Butterfly-Net architecture as follows (see Fig. 1). The butterfly shape of the network, which is motivated by the hierarchical structure of the Butterfly scheme, results from the complementary low-rank structure of time-frequency analysis. It plays an essential role in the approximation power of the Butterfly-Net, as will be shown in the theoretical analysis.

• [leftmargin=*]

• Preparation Layer. Assume is multiple of , i.e., and the input vector is . A 1D convolution layer with filter size , stride and output channel is added with ReLU activation. The output vector of the preparation layer is denoted as , where is the index of data and is the index of channel. Both second and third indices correspond to channels, the former denotes the mixing channel whereas the later denotes the non-mixing channel (with a single channel for the preparation layer).

• Layer . The input vector of Layer is for , , and . For each of the non-mixing channel , two 1D convolution layer with filter size , stride and output channel is added with ReLU activation. The output vector for each is then denoted as and . Hence, the output vector of Layer is denoted as where is the index of data, is the index of mixing channel, and is the index of non-mixing channel. The output vector matches the input vector at next layer.

• Layer M. The middle layer, Layer M, is a special layer of local operations. Denote the input and output vector of Layer M is and resp. At this level, for each and , , where is a matrix of size .

• Layer . The input vector of Layer is for , , and . For each of the non-mixing channel , two 1D convolution layers with filter size , stride and output channel is added and then the ReLU activations are applied. The output vector for each is then denoted as and . Hence, the output vector of Layer is denoted as where is the index of data, is the index of non-mixing channel, and is the index of mixing channel.

• Feature Layer. The input vector is where and . A 1D convolution layer interpolates the back into output vector of length .

The output of the feature layers can be used as efficient low-dimensional representation of the input function; the approximation power of the Butterfly-Net and its applications will be discussed in Sec. 3.

Fig. 1 demonstrates an example of the Butterfly-Net with input vector being partitioned into 16 parts. We adopts the unfolded representation of the mixing channel as in Fig. 2 (b), and the channel direction only contains non-mixing channels. A matrix representation of the Butterfly-Net is given in Supp. A to facilitate the theoretical analysis of approximation power.

The main advantage of the proposed Butterfly-Net is the reduction of model size and computational complexity. As can be seen from a simple calculation, the overall number of parameters involved is , where is the size of input, and is the size of mixing channels. Such a parametrization is near optimal for many well-known kernels, for instance, discrete Fourier kernel, discrete smooth Fourier integral operator, etc., since it gives almost linear scaling algorithms up to logarithmic factors. And the overall computational cost for a evaluation of the Butterfly-Net is .

The extension of the Butterfly-Net to dimensional input signals, for , is straightforward. Under the same architecture, we can simply replace the index by multi dimensional index, e.g., . This means instead of 1D mixing channels and 1D non-mixing channels, now we have D mixing channels and D non-mixing channels. The filter size and stride size should also be adapted to dimension. The overall number of parameters in this case is , where is the total size of the dimensional input. If filters are assumed to maintain tensor product structure, then both the number of parameters and computational cost can be reduced.

## 3 Analysis of Approximation Power

In this section, we analyze the approximation power of the Butterfly-Net on a specific kernel, discrete Fourier kernel, whose matrix entry is defined as where and

and () respectively. The analysis result shows that though Butterfly-Net by construction has very low complexity as the number of parameters is on the order of the input data size, it exhibits full approximation power in terms of function representations. The generalization to a wider arrange of function representations is possible and will be commented in the sequel.

Let denote the size of the input and denote the length of the domain of the output in the Butterfly-Net. and are two parameters such that . is the depth of the Butterfly-Net and is the size of mixing channels. There exists a parametrized Butterfly-Net, , approximating the discrete Fourier kernel such that for any bounded input vector , the error of the output of the Butterfly-Net satisfies that for any

 \norm\calKf−\calB(f)p≤m1−1pCr,K⎛⎜ ⎜⎝√r(2πlnr+1)2r−2⎞⎟ ⎟⎠L\normfp, (5)

where , and is a constant depending only on and .

The proof of Theorem 3 is constructive. We first fill the Butterfly-Net with a specific set of parameters based on the complementary low-rank property of the discrete Fourier kernel (see Theorem B.1 in Supp. B for a precise statement of the complementary low-rank property). Once the parameters of Butterfly-Net are constructed in this way, which means entries in the matrix representation of the Butterfly-Net are explicitly known, -norm and -norm of each matrix can be bounded. Combined with the low-rank approximation error at different levels, we derive the -norm and -norm upper bound for the Butterfly-Net matrix representation. Applying Riese-Thorin interpolation theorem, we reach to the conclusion of Theorem 3 for general index .

Previously, in the context of fast algorithms, Kunis and Melzer (2012) analyzed the approximations of a simplified Butterfly scheme and Demanet et al. (2012) analyzed general Butterfly

scheme under different error measures on the input and output. While as a side product of our proof, we also obtain the error estimate of the matrix approximation of the general

Butterfly schemes in terms of matrix norms. Supp. B provides the detailed proof of the theorem.

For a problem with fixed input and output size, we can tune two parameters and to reach desired accuracy. As increases, which corresponds to increase the width of each layer, two parts of the error bound, the explicit power and the constant , both decrease. The base of the power decays exponentially as the increase of whereas the constant decays as . Interestingly, when increases, which corresponds to increase the depth of the Butterfly-Net, the error bound decays exponentially in when as can be verified by a simple calculation. Hence we conclude that the error of the Butterfly-Net decays exponentially in for any .

Generalization of the approximation result. While Theorem 3 only considers the Fourier kernel, the result in it can be much generalized by a careful investigation of the proof. First the domain of the discrete Fourier kernel, interval and , can be scaled and shifted. Theorem 3 holds as long as the product of the length of two intervals remain . Thus local Fourier transforms and wavelets type multi-resolution analysis can be easily accommodated. Further, through a parallel reading of the proof of Theorem 3 and Candès et al. (2009); Li and Yang (2017), we can show that similar theorem can be provided for smooth Fourier integral operators (FIOs), i.e., with smooth satisfying homogeneity condition of degree 1. In image and signal processing, Laplace operator, which corresponds to discrete Fourier kernel, is often used for diffusion related process. The usage of Butterfly-Net further enables processing with elliptic operators, whose symbols can be written as smooth FIOs Demanet and Ying (2011), so that to obtain better performance in image and signal processing. The wide applicability of the Butterfly-Net could provide insights of the success of the CNN in image and signal processing.

Butterfly-Net can also be used as a module in a larger network, serving to efficiently approximate a discrete Fourier kernel or Fourier integral operator by a low-complexity CNN sub-network. Many layers can be added before/after Butterfly-Net. For example, in order to estimate the energy functional of a Poisson’s equation, which can be approximated by a quadratic form of the low-frequency Fourier components of the input function, we can add a fully connected layer on the top of the Butterfly-Net to approximate the quadratic form thanks to the power of the universal approximation theorem Barron (1993); Cybenko (1989); Hornik et al. (1989). In this case, we can claim that the Butterfly-Net with a fully connected layer provides more efficient representation than plain fully connected layers. For many other functionals involving the Fourier components, the Butterfly-Net with fully connected layer is capable to represent them. The proof is simply a combined usage of Theorem 3 and the universal approximation theorem of shallow networks.

## 4 Numerical Results

We present two numerical results to demonstrate the approximation power of the Butterfly-Net and its application to energy functional. The first numerical experiment shows that the approximation error of an initialized Butterfly-Net decays exponentially as the increases of the network depth , which verifies the conclusion of Theorem 3. In the second experiment, we construct a CNN with two convolutional layers and a fully connect layer interlacing with ReLU layers, which is a over parametrized version of Butterfly-Net plus fully connected layers. The neural network well approximates the energy functional of a 2D Poisson’s equation.

Butterfly-Net for discrete Fourier transform. The first numerical result aims to verify the exponential decay of the approximation error of the Butterfly-Net as the depth increases. We construct a Butterfly-Net to approximate the discrete Fourier kernel with fixed number of mixing channels, , which is sufficient large for the decay factor in Theorem 3 being smaller than one. The Butterfly-Net is filled with the parameters in the constructive proof of Theorem 3. The input vector in this example is of size whereas different sizes of the output vector are tested. The output vector represents integer frequency of the input function in the frequency domain . The approximation error of the Butterfly-Net is measured against the dense discrete Fourier kernel matrix, and relative matrix -norm error is reported, , where denotes the Butterfly-Net matrix.

Tab. 1 shows for both choices of , the relative approximation errors measured in matrix 1-norm, 2-norm, and -norm decay exponentially as increases. The decay factors for different remain similar, while the prefactor is much larger for large . All of these observations agree with the error bound in Theorem 3.

Butterfly-Net for Discrete Laplace Operator. The second numerical example aims to construct an approximation of the energy functional of 2D Poisson’s equation, as explained in the Section 3 about the general function representation power of the Butterfly-Net. Here we use a regular CNN which can be viewed as Butterfly-Net with all channels mixed. We aims to obtain the energy of Poisson’s equation with periodic boundary condition, where is the Laplace operator, is the input function, and is the solution function. The energy functional of Poisson’s equation is defined as the negative inner product of and , which can also be approximated by a quadratic form of the leading low-frequency Fourier components, which can be rewritten as,

 \calE(f)=−⟨f,u⟩≈∑k∈[−\nicefracK2,\nicefracK2−1)21\absk2\absˆfk2, (6)

where is the -th Fourier component of . If the input function is a linear combination of the Fourier components with , equality is achieved in (6). In this numerical example, we assume the domain of , , is discredited by a uniform grid with 64 points on each dimension. is a smooth periodic random function. It is generated from the Fourier interpolation of a fully random function defined on grid. The reference energy is calculated via the discrete Laplace operator. random instances of is used as training data whereas random instances are used as testing data. Detailed parameters of the network and training parameters can be found in Supp. C.

Fig. 3

shows the decay of relative mean square error (MSE) of the network, which is the mean square error divided by the square of the averaged energy values. The loss of training data flatted after about 300 epochs and the loss of testing data stop improving after 150 epochs. The CNN is able to achieve 3 digits accuracy of the relative MSE, that is about 97% accuracy of the value of the energy.

## 5 Conclusion and Future Works

A low-complexity CNN with structured hard-coded weights and sparse across-channel connections is proposed, motivated by the Butterfly scheme. The hierarchical functional representation by Butterfly-Net is optimal in the sense that the model complexity is and the computational complexity is . The approximation accuracy to the Fourier kernel is proved to exponentially converge as the depths of the Butterfly-Net increases, which provides an approximation upper bound for CNNs in a large class of problems in scientific computing and image and signal processing.

The work can be extended in several directions. First, more applications of the Butterfly-Net should be explored such as those in image analysis and to compare the performance with other CNN architectures. Second, our current theoretical analysis does not address the case when the input data contains noise. In particular, adding rectified layers in Butterfly-Net can be interpreted as a thresholding denoising operation applied to the intermediate representations; a statistical analysis is desired. Finally, the training of the proposed Butterfly-Net with large data set should be addressed, in particular, the convergence of the training and the analysis of generalization of the trained network.

## References

• Barron (1993) Barron, A. R. (1993).

Universal approximation bounds for superpositions of a sigmoidal function.

IEEE Transactions on Information theory, 39(3):930–945.
• Behler and Parrinello (2007) Behler, J. and Parrinello, M. (2007). Generalized neural-network representation of high-dimensional potential-energy surfaces. Physical review letters, 98(14):146401.
• Bengio et al. (2015) Bengio, Y., Goodfellow, I. J., and Courville, A. (2015). Deep learning. Nature, 521(7553):436–444.
• Berg and Nyström (2017) Berg, J. and Nyström, K. (2017). A unified deep artificial neural network approach to partial differential equations in complex geometries. arXiv preprint arXiv:1711.06464.
• Bölcskei et al. (2017) Bölcskei, H., Grohs, P., Kutyniok, G., and Petersen, P. (2017). Optimal approximation with sparsely connected deep neural networks. arXiv preprint arXiv:1705.01714.
• Cai et al. (2012) Cai, J.-F., Dong, B., Osher, S., and Shen, Z. (2012). Image restoration: total variation, wavelet frames, and beyond. Journal of the American Mathematical Society, 25(4):1033–1089.
• Candès et al. (2007) Candès, E. J., Demanet, L., and Ying, L. (2007). Fast computation of Fourier integral operators. SIAM J. Sci. Comput., 29(6):2464–2493.
• Candès et al. (2009) Candès, E. J., Demanet, L., and Ying, L. (2009). A fast butterfly algorithm for the computation of Fourier integral operators. Multiscale Model. Simul., 7(4):1727–1750.
• Chan and Shen (2005) Chan, T. F. and Shen, J. J. (2005). Image processing and analysis: variational, PDE, wavelet, and stochastic methods, volume 94. SIAM.
• Cohen et al. (2016) Cohen, N., Sharir, O., and Shashua, A. (2016). On the expressive power of deep learning: A tensor analysis. In Conference on Learning Theory, pages 698–728.
• Cybenko (1989) Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303–314.
• Delalleau and Bengio (2011) Delalleau, O. and Bengio, Y. (2011). Shallow vs. deep sum-product networks. In Advances in Neural Information Processing Systems, pages 666–674.
• Demanet et al. (2012) Demanet, L., Ferrara, M., Maxwell, N., Poulson, J., and Ying, L. (2012). A butterfly algorithm for synthetic aperture radar imaging. SIAM Journal on Imaging Sciences, 5(1):203–243.
• Demanet and Ying (2011) Demanet, L. and Ying, L. (2011). Discrete symbol calculus. SIAM Rev., 53(1):71–104.
• E et al. (2017) E, W., Han, J., and Jentzen, A. (2017).

Deep learning-based numerical methods for high-dimensional parabolic partial differential equations and backward stochastic differential equations.

Communications in Mathematics and Statistics, 5(4):349–380.
• Eldan and Shamir (2016) Eldan, R. and Shamir, O. (2016). The power of depth for feedforward neural networks. In Conference on Learning Theory, pages 907–940.
• Gallant and White (1988) Gallant, A. R. and White, H. (1988). There exists a neural network that does not make avoidable mistakes. In Proceedings of the Second Annual IEEE Conference on Neural Networks, San Diego, CA, I.
• Hornik et al. (1989) Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366.
• Khoo et al. (2018) Khoo, Y., Lu, J., and Ying, L. (2018). Solving for high dimensional committor functions using artificial neural networks. arXiv preprint arXiv:1802.10275.
• Kunis and Melzer (2012) Kunis, S. and Melzer, I. (2012). A stable and accurate butterfly sparse Fourier transform. SIAM J. Numer. Anal., 50(3):1777–1800.
• Le Roux and Bengio (2008) Le Roux, N. and Bengio, Y. (2008).

Representational power of restricted Boltzmann machines and deep belief networks.

Neural computation, 20(6):1631–1649.
• Li and Yang (2017) Li, Y. and Yang, H. (2017). Interpolative butterfly factorization. SIAM J. Sci. Comput., 39(2):A503–A531.
• Li et al. (2015a) Li, Y., Yang, H., Martin, E. R., Ho, K. L., and Ying, L. (2015a). Butterfly factorization. Multiscale Model. Simul., 13(2):714–732.
• Li et al. (2015b) Li, Y., Yang, H., and Ying, L. (2015b). A multiscale butterfly algorithm for multidimensional Fourier integral operators. Multiscale Model. Simul., 13(2):1–18.
• Liang and Srikant (2016) Liang, S. and Srikant, R. (2016). Why deep neural networks for function approximation? arXiv preprint arXiv:1610.04161.
• Long et al. (2017) Long, Z., Lu, Y., Ma, X., and Dong, B. (2017). PDE-Net: Learning PDEs from data. arXiv preprint arXiv:1710.09668.
• Mallat (2008) Mallat, S. (2008). A wavelet tour of signal processing: the sparse way. Academic press.
• Mhaskar et al. (2016) Mhaskar, H., Liao, Q., and Poggio, T. (2016). Learning functions: when is deep better than shallow. arXiv preprint arXiv:1603.00988.
• Mhaskar and Poggio (2016) Mhaskar, H. N. and Poggio, T. (2016). Deep vs. shallow networks: An approximation theory perspective. Analysis and Applications, 14(06):829–848.
• Michielssen and Boag (1996) Michielssen, E. and Boag, A. (1996). A multilevel matrix decomposition algorithm for analyzing scattering from large structures. IEEE Trans. Antennas Propag., 44(8):1086–1093.
• Montufar et al. (2014) Montufar, G. F., Pascanu, R., Cho, K., and Bengio, Y. (2014). On the number of linear regions of deep neural networks. In Advances in neural information processing systems, pages 2924–2932.
• O’Neil et al. (2010) O’Neil, M., Woolfe, F., and Rokhlin, V. (2010). An algorithm for the rapid evaluation of special function transforms. Appl. Comput. Harmon. Anal., 28(2):203–226.
• Rivlin (1990) Rivlin, T. J. (1990). Chebyshev polynomials: from approximation theory to algebra and number theory. Wiley-Interscience, 2nd edition.
• Schneider et al. (2017) Schneider, E., Dai, L., Topper, R. Q., Drechsel-Grau, C., and Tuckerman, M. E. (2017). Stochastic neural network approach for learning high-dimensional free energy surfaces. Physical review letters, 119(15):150601.
• Telgarsky (2016) Telgarsky, M. (2016). Benefits of depth in neural networks. arXiv preprint arXiv:1602.04485.
• Yarotsky (2017) Yarotsky, D. (2017). Error bounds for approximations with deep ReLU networks. Neural Networks, 94:103–114.
• Ying (2009) Ying, L. (2009). Sparse Fourier transform via butterfly algorithm. SIAM J. Sci. Comput., 31(3):1678–1694.
• Zhang et al. (2018) Zhang, L., Han, J., Wang, H., Car, R., and E, W. (2018). DeePCG: constructing coarse-grained models via deep neural networks. arXiv preprint arXiv:1802.08549.

## Appendix A Matrix Representation of Butterfly-Net

We first show the matrix representation of the interpolation convolutional layer and the matrix representation of the Butterfly-Net simply stacks the interpolation convolutional layer together with an extra middle level local connection. Fig. 4 (a) represents (2) when both and are vectorized. If we permute the row ordering of the matrix, we would result blocks of convolution matrix with the number of blocks being the size of the output channels. Fig. 4 (b) assumes that and the matrix is further simplified to be a block diagonal matrix. According to the figures, we note that when , the transpose of the matrix is also a representation of a interpolation convolutional layer with replaced by .

Since the CNN with mixing channel can already be represented as Fig. 4, Butterfly-Net simply stack interpolation convolutional layer together and drives non-mixing channel, which is equivalent to stack the matrix row-wise. We would explain the matrix representation for each step of the Butterfly-Net.

• [leftmargin=*]

• Preparation Layer. The matrix representation is Fig. 4 (b) with diagonal blocks and each block is of size . The matrix is denoted as .

• Layer . Let the convolution kernel from and to is denoted as . Similarly the convolution kernel from and to is denoted as . Assemble the above small matrices together to get

 H(ℓ)i=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝W(ℓ)2i−1⋱W(ℓ)2i−1\cline1−3W(ℓ)2i⋱W(ℓ)2i⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠,and H(ℓ)=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝H(ℓ)1H(ℓ)2⋱H(ℓ)2L−ℓ⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠.

Each block and of the top and bottom part of is of form Fig. 4 (b).

• Layer M. The matrix representation of layer M is

 M=⎛⎜ ⎜⎝M1⋱M2\nicefracL2⎞⎟ ⎟⎠,and Mj=⎛⎜ ⎜ ⎜⎝Wj,1⋱Wj,2\nicefracL2⎞⎟ ⎟ ⎟⎠.
• Layer . The matrix representation here, denoted as has exactly the same structure as that of .

• Feature Layer. The matrix representation of this layer is a block diagonal matrix, denoted as .

Based on the matrix representation of each level of the Butterfly-Net, we could write down the overall matrix representation together with ReLU activation layer as,

 g=\calB(f)=UF[G(L)F[⋯G(\nicefracL2+1)F[MF[H(\nicefracL2)F[⋯H(1)F[Vf]]]]]], (7)

where are defined in Supp. and denotes the operation of adding the bias and apply the ReLU activation.

## Appendix B Proof of Theorem 3

This section first derives a low-rank interpolation of the discrete Fourier kernel, which can be understood as a convolution with mixing and non-mixing channels. Then in Sec. B.2, we fill the matrix representation of Butterfly-Net with the interpolation representation. This implies that the discrete Fourier transform (DFT) matrix can be approximately written as Butterfly-Net. Later in Sec. B.3, we prove Theorem 3

### b.1 Low-rank approximation of Fourier kernel

It is well-known that DFT matrix has orthonormal rows and hence, is a full rank matrix. In order to obtain the local convolutional property, we first define the level hierarchical partition of and , respectively. Let be the domain on level . On level , the domain is evenly partitioned into and . We conduct the partition recursively, i.e., is evenly partitioned into and . In the end, the partition on level is and each is of length . Similarly, we conduct the hierarchical bipartition on , denoted them as , and is of length . As we will see, the partition of domain will be used in a complementary way, will be paired with for all choices of and , such that the Fourier kernel restricted to and permits a low-rank approximation, see Theorem B.1. Theorem B.1 actually depends only on the length of the domains of and , but does not depend on the location.

We first give a brief introduction of the Chebyshev interpolation with Chebyshev points. The Chebyshev grid of order on is defined as,

 {zi=12cos((i−1)πr)}ri=1. (8)

The points Chebyshev interpolation of any function on is defined as,

 Πrf(x)=r∑k=1f(zk)\calLk(x), (9)

where is the Lagrange polynomial as in (3).

One important property is that for any fixed , depends only on the relative location of and and is independent of the length and location of the interval. This is essential for convolutional representation.

Several earlier works Candès et al. (2007, 2009); Li et al. (2015b) proved the Chebyshev interpolation representation for Fourier integral kernel, which is a generalized Fourier kernel. Theorem B.1 is a special case of these earlier work but with more precise estimation of the prefactors.

Let and be two parameters such that . For any and , there exists an Chebyshev interpolation representation of the Fourier kernel,

 \abse−2πıξ⋅t−r∑k=1e−2πıξ⋅tke−2πıξ0⋅(t−tk)\calLk(t)≤(2+2πlnr)(2πeK4r2L)r, (10)

and

 \abse−2πıξ⋅t−r∑k=1e−2πı(ξ−ξk)⋅t0\calLk(ξ)e−2πıξk⋅t≤(2+2πlnr)(2πeK4r2L)r, (11)

where and are the centers of and respectively, and are Chebyshev points on and respectively.

Let and be the space spanned by the monomials . The projection operator mapping into its Lagrange interpolation on the Chebyshev grid obeys,

 \normf−Πrf∞≤(2+2πlnr)infg∈\calPr\normf−g∞. (12)

The proof of Lemma B.1 can be found in Rivlin (1990).

###### Proof of Theorem b.1.

At level , we assume and . According to the definition of domain partition, at level , we have , which implies , where denote the function of length.

The Fourier kernel can be factorized as,

 \func\calKξ,t=e−2πı(ξ⋅t−ξ0⋅t−ξ⋅t0+ξ0⋅t0)⋅e−2πıξ0⋅t⋅e−2πıξ⋅t0⋅e2πıξ0⋅t0=e−2πıR(ξ,t)⋅e−2πıξ0⋅t⋅e−2πıξ⋅t0⋅e2πıξ0⋅t0, (13)

where .

Next, we show the -term truncation error of the first term in the second line of (13). Based on the power expansion of , i.e.,

 e−2πıR(ξ,t)=∞∑k=0(−2πıR(ξ,t))kk!, (14)

the -term truncation error can be bounded as,

 δ=\abse−2πıR(ξ,t)−r∑k=0(−2πıR(ξ,t))k)k!=\abs∞∑k=r+1(−2πıR(ξ,t))kk!≤∞∑k=r+11k!(2πK4⋅2L)k≤∞∑k=r+1ekkk(2πK4⋅2L)k≤∞∑k=r+1(2πeK4r2L)k≤(2πeK4r2L)r, (15)

where the last inequality uses . We also notice that, for any fixed , . Applying Lemma B.1, we obtain,

 \norme−2πıR(ξ,t)−r∑k=1e−2πıR(ξ,tk)\calLk(t)≤(2+2πlnr)δ. (16)

By substituting the explicit expression of , we obtain one of the conclusion,

 \norme−2πıξ⋅t−r∑k=1e−2πıξ⋅tke−2πıξ0⋅(t−tk)\calLk(t)≤(2+2πlnr)(2πeK4r2L)r, (17)

for any and .

Similarly, for any fixed , we have . Hence the second conclusion can be obtained through the same procedure. ∎

### b.2 Forward Butterfly-Net

Theorem B.1 provides a low-rank approximation of the Fourier kernel restricted in and . When , we adopt (10) and depends only on relative difference . Summing over all and all , is a convolution kernel with filter size being equal to the stride size. When , we adopt (11) and depends only on relative difference . Summing over all and all , is a block diagonal matrix as the transpose of Fig. 4 (b) and hence is a convolution. We assume the input data is evaluated at equally distributed on and the output data is provided at equally distributed on . Further, we denote and as the centers of and respectively; and as the Chebyshev points of and respectively; and and as the data points in and respectively.

• [leftmargin=*]

• Preparation Layer. Let the diagonal block of be . Then is a matrix of size with entry , where and are row and column index respectively.

• Layer . is a matrix of size with entry and , where is the row index, and and are column index. The former generates the left submatrix of , whereas the later generates the right half submatrix. is of the same format with being replaced by .

• Layer M. is a matrix of size with entry .

• Layer . is a matrix of size with entry and , where is the row index, and is column index. The former generates the left submatrix of , whereas the later generates the right half submatrix. is of the same format with being replaced by .

• Feature Layer. Let the diagonal block of be . Then is a matrix of size with entry , where and are row and column index respectively.

### b.3 Proof

Let be Chebyshev points and be the Lagrange polynomial of order . For any , the Lebesgue constant is bounded as

 Λr=max−1≤x≤1r∑i=1\abs\calLi(x)≤2πlnr+1.

Lemma B.3 is a standard result of Chebyshev interpolation and the proof can be found in Rivlin (1990).

Let be Chebyshev points and be the Lagrange polynomial of order . For any and ,

 max−1≤x≤1\abs\calLi(x)≤2πlnr+1.

Corollary B.3 is an immediate result of Lemma B.3.

Let be the block diagonal matrix defined in the preparation layer, then

where is the number of Chebyshev points and .

###### Proof.

is a block diagonal matrix with block for . are the same with entry . By the definition of matrix 1-norm, we have

 \normV1=\normV11≤maxt∈BL1∑tBL1k\abse−2πıξA000⋅(t−tBL1k)\calLk(t)≤maxt∈BL1∑tBL1k\abs\calLk(t)≤2πlnr+1. (18)

By the definition of matrix -norm, we have

 \normV∞=\normV1∞≤∑t∈BL1∑tBL1k\abse−2πıξA000⋅(t−tBL1k)\calLk(t)≤m(2πlnr+