In the last decade, deep learning algorithms achieved unprecedented success and state-of-the-art results in various machine learning and artificial intelligence tasks, most notably image recognition, speech recognition, text analysis and Natural Language Processing. Deep Neural Networks (DNNs) are general in the sense of their mechanism for learning features of the data. Nevertheless, in numerous cases, results obtained with DNNs outperformed previous state-of-the-art methods, often requiring significant domain knowledge, manifested in hand-crafted features.
Despite the great success of DNNs in many practical applications, the theoretical framework of DNNs is still lacking; along with some decades-old well-known results, developing aspects of such theoretical framework are the focus of much recent academic attention. In particular, some interesting topics are (1) specification of the network topology (i.e., depth, layer sizes), given a target function, in order to obtain certain approximation properties, (2) estimating the amount of training data needed in order to generalize to test data with high accuracy, and also (3) development of training algorithms with performance guarantees.
1.1 The contribution of this work
In this manuscript we discuss the first topic. Specifically, we prove a formal version of the following result:
Theorem (informal version) 1.1.
Let be a smooth -dimensional manifold, and let be an approximation level. Then there exists a depth-4 sparsely-connected neural network with units where , computing the function such that
The number depends on the complexity of , in terms of its wavelet representation, the curvature and dimension of the manifold and only weakly on the ambient dimension , thus taking advantage of the possibility that , which seems to be realistic in many practical applications. Moreover, we specify the exact topology of such network, and show how it depends on the curvature of , the complexity of , and the dimensions , and . Lastly, for two classes of functions we also provide approximation error rates: error rate for functions with sparse wavelet expansion and point-wise error rate for functions in :
if has wavelet coefficients in then there exists a depth-4 network and a constant so that
if and has bounded Hessian, then there exists a depth-4 network so that
1.2 The structure of this manuscript
The structure of this manuscript is as follows: in Section 2 we review some of the fundamental theoretical results in neural network analysis, as well as some of the recent theoretical developments. In Section 3 we give quick technical review of the mathematical methods and results that are used in our construction. In Section 4 we describe our main result, namely construction of deep neural nets for approximating functions on smooth manifolds. In Section 5 we specify the size of the network needed to learn a function , in view of the construction of the previous section. Section 6 concludes this manuscript.
denotes a -dimensional manifold in . denotes an atlas for . Tangent hyper-planes to are denoted by . and variants of it stand for the function to be approximated. are scaling (aka ”father”) and wavelet (aka ”mother”) functions, respectively. The wavelet terms are indexed by scale and offset . The support of a function is denoted by .
2 Related work
There is a huge body of theoretical work in neural network research. In this section, we review some classical theoretical results on neural network theory, and discuss several recent theoretical works.
and others states that Artificial Neural Networks (ANNs) with a single hidden layer of sigmoidal functions can approximate arbitrary closely any compactly supported continuous function. This result is known as the “Universal Approximation Property”. It does not relate, however, the number of hidden units and the approximation accuracy; moreover, the hidden layer might contain a very large number of units. Several works propose extensions of the universal approximation property (see, for example[9, 8]
, for a regularization perspective and also using radial basis activation functions, for all activation functions that achieve the universal approximation property).
The first work to discuss the approximation error rate was done by Barron , who showed that given a function
there exists a neural net with a single hidden layer of sigmoid units, so that the output of the network satisfies
where is proportional to . We note that the requirement (4) gets more restrictive when the ambient dimension is large, and that the constant might scale with . The dependence on is improved in , . In particular, in  the constant is improved to be polynomial in . For times differentiable functions, Mahskar  constructs a network with a single hidden layer of sigmoid units (with weights that do not depend on the target function) that achieves an approximation error rate
which is known to be optimal. This rate is also achieved (point-wise) in this manuscript, however, with respect to the dimension of the manifold, instead of , which might be a significant difference when .
During the decade of s, a popular direction in neural network research was to construct neural networks in which the hidden units compute wavelets functions (see, for example ,  and ). These works, however, do not give any specification of network architecture to obtain desired approximation properties.
Several most interesting recent theoretical results consider the representation properties of neural nets. Eldan and Shamir  construct a radial function that is efficiently expressible by a 3-layer net, while requiring exponentially many units to be represented accurately by shallower nets. In , Montufar et al. show that DNNs can represent more complex functions than can represent a shallow network with the same number of units, where complexity is defined as the number of linear regions of the function. Tishby and Zaslavsky  propose to evaluate the representations obtained by deep networks via the information bottleneck principle, which is a trade-off between compression of the input representation and predictive ability of the output function, however do not provide any theoretical results.
A recent work by Chui and Mhaskar brought to our attention  constructs a network with similar functionality to the network we construct in this manuscript. In their network the low layers map the data to local coordinates on the manifold and the upper ones approximate a target function on each chart, however using B-splines.
3.1 Compact manifolds in
In this section we review the concepts of smooth manifolds, atlases and partition of unity, which will all play important roles in our construction.
Let be a compact -dimensional manifold. We further assume that is smooth, and that there exists so that for all , is diffeomorphic to a disc, with a map that is close to the identity.
A chart for is a pair such that is open and
where is a homeomorphism and is an open subset of a Euclidean space.
One way to think of a chart is as a tangent plane at some point , such that the plane defines a Euclidean coordinate system on via the map .
An atlas for is a collection of charts such that .
Let be a smooth manifold. A partition of unity of w.r.t an open cover is a family of nonnegative smooth functions such that for every , and for every , .
(Proposition in ) Let be a compact manifold and be an open cover of . Then there exists a partition of unity such that for each , is in , has compact support and .
3.2 Harmonic analysis on spaces of homogeneous type
3.2.1 Construction of wavelet frames
In this section we cite several standard results, mostly from , showing how to construct a wavelet frame of , and discuss some of its properties.
(Definition in )
A space of homogeneous type is a set together with a measure and a quasi-metric (satisfies triangle inequality up to a constant ) such that for every
There exists a constant such that
In this manuscript, we are interested in constructing a wavelet frame on , which, equipped with Lebesgue measure and the Euclidean metric, is a space of homogeneous type.
By standard wavelet terminology, we denote
The kernels need to be such that for every , is sufficiently large. This is discussed in great generality in chapter 3 in .
The functions are called dual elements, and are also a wavelet frame of .
3.3 Approximation of functions with sparse wavelet coefficients
In this section we cite a result from  regarding approximating functions which have sparse representation with respect to a dictionary using finite linear combinations of dictionary elements.
Let a function in some Hilbert space with inner product and norm , and let be a dictionary, i.e., any family of functions with unit norm. Assume that can be represented as a linear combination of elements in with absolutely summable coefficients, and denote the sum of absolute values of the coefficients in the expansion of by .
In , it is shown that functions can be approximated using dictionary terms with squared error proportional to . As a bonus, we also get a greedy algorithm (though not always practical) for selecting the corresponding dictionary terms. OGA is a greedy algorithm that at the ’th iteration computes the residual
finds the dictionary element that is most correlated with it
and defines a new approximation
where is the orthogonal projection operator onto .
(Theorem 2.1 from ) The error of the OGA satisfies
Clearly, for we can choose the dictionary to be the wavelet frame given by
Let be a wavelet frame that satisfies the regularities in conditions in . Then if a function is in with respect to , it is also in with respect to any other wavelet frame that satisfies the same regularities. In other words, having expansion coefficients in does not depend on the specific choice of wavelets (as long as the regularities are satisfied). The idea behind the proof of this claim is explained in appendix A.
Section in  gives a way to check whether a function has sparse coefficients without actually calculating the coefficients:
i.e., one can determine if without explicitly computing its wavelet coefficients; rather, by convolving with non-shifted wavelet terms in all scales.
4 Approximating functions on manifolds using deep neural nets
In this section we describe in detail the steps in our construction of deep networks, which are designed to approximate functions on smooth manifolds. The main steps in our construction are the following:
We construct a frame of in which the frame elements can be constructed from rectified linear units (see Section 4.1).
Given a -dimensional manifold , we construct an atlas for by covering it with open balls (see Section 4.2).
We use the open cover to obtain a partition of unity of and consequently represent any function on as a sum of functions on (see section 4.3).
We show how to extend the wavelet terms in the wavelet expansion, which are defined on , to in a way that depends on the curvature of the manifold (see Section 4.4).
4.1 Constructing a wavelet frame from rectifier units
In this section we show how Rectified Linear Units (ReLU) can be used to obtain a wavelet frame of . The construction of wavelets from rectifiers is fairly simple, and we refer to results from Section 3.2 to show that they obtain a frame of .
The rectifier activation function is defined on as
we define a trapezoid-shaped function by
We then define the scaling function by
where the constant is such that
for example, . Following the construction in Section 3.2, we define
The family is a family of averaging kernels.
The proof is given in Appendix B. Next we define the (“mother”) wavelet as
Figure 1 shows the construction of and in for .
We can see that
With the above construction, can be computed using a network with rectifier units in the first layer and a single unit in the second layer. Hence every wavelet term can be computed using rectifier units in the first layer, 2 rectifier units in the second layer and a single linear unit in the third layer. From this, the sum of wavelet terms can be computed using a network with rectifiers in the first layer, rectifiers in the second layer and a single linear unit in the third layer.
From Theorem 3.7 and the above construction we then get the following lemma:
is a frame of .
Next, the following lemma uses properties of the above frame to obtain point-wise error bounds in approximation of compactly supported functions .
Let be compactly supported, twice differentiable and let be bounded. Then for every there exists a combination of terms up to scale so that for every
The proof is given in Appendix C.
4.2 Creating an atlas
In this section we specify the number of charts that we would like to have to obtain an atlas for a compact -dimensional manifold .
For our purpose here we are interested in a small atlas. We would like the size of such atlas to depend on the curvature of : the lower the curvature is, the smaller is the number of charts we will need for .
Following the notation of Section 3.1, let so that for all , is diffeomorphic to a disc, with a map that is close to the identity. We then cover with balls of radius . The number of such balls that are required to cover is
where is the surface area of , and is the thickness of the covering (which corresponds to by how much the balls need to overlap).
The thickness scales with however rather slowly: by , there exist covering with . For example, in there exist covering with thickness of .
A covering of by such a collection of balls defines an open cover of by
denote the tangent hyperplane tangent toat . We can now define an atlas by , where is the orthogonal projection from onto .
The above construction is sketched in Figure 2.
Let be the extension of to , i.e., the orthogonal projection onto . The above construction has two important properties, summarized in Lemma 4.7
For every ,
and for every such that
4.3 Representing a function on manifold as a sum of functions in
Let be a compact -dimensional manifold in , let , let be an atlas obtained by the covering in Section 4.2, and let be the extension of to .
is an open cover of , hence by Theorem 3.4 there exists a corresponding partition of unity, i.e., a family of compactly supported functions such that
Let be defined by
and observe that . We denote the image by . Note that , i.e., lies in a -dimensional hyperplane which is isomorphic to . We define on as
and observe that is compactly supported. This construction gives the following Lemma
For all ,
4.4 Extending the wavelet terms in the approximation of to
Assume that and let
be its wavelet expansion, where and is defined on .
We now show how to extend each to . Let’s assume (for now) that the coordinate system is such that the first coordinates are the local coordinates (i.e., the coordinates on ) and the remaining coordinates are of the directions which are orthogonal to .
Intuitively, we would like to extend the wavelet terms on to so that they remain constant until they ”hit” the manifold, and then die off before they ”hit” the manifold again. By Lemma 4.7 it therefore suffices to extend each to so that in each of the orthogonal directions, will be constant in and will have a support which is contained in .
and recall that as in Equation (19), the scaling function was defined on on by
We extend to by
and is a trapezoid function which is supported on and its top (small) base is between and has height 2. This definition of gives it a constant height for distance from , and then a linear decay, until it vanishes at distance . Then by construction we obtain the following lemma
For every chart and every such that , is outside the support of every wavelet term corresponding to the ’th chart.
Finally, in order for this construction to work for all the input of the network can be first mapped to
by a linear transformation so that the each of theblocks of coordinates gives the local coordinates on in the first coordinates and on the orthogonal subspace in the remaining coordinates. These maps are essentially the orthogonal projections .
5 Specifying the required size of the network
In the construction of Section 4, we approximate a function using a depth network, where the first layer computes the local coordinates in every chart in the atlas, the second layer computes functions that are to form trapezoids, the third layer computes scaling functions of the form for various and the fourth layer consists of a single node which computes
where is a wavelet term on the ’th chart. This network is sketched in Figure 3.
From this construction, we obtain the following theorem, which is the main result of this work:
Let be a -dimensional manifold in , and let . Let be an atlas of size for , as in Section 4.2. Then can be approximated using a 4-layer network with linear units in the first hidden layer, rectifier units in the second hidden layer, rectifier units in the third layer and a single linear unit in the fourth (output) layer, where is the number of wavelet terms that are used for approximating on the ’th chart.
As in Section 4.3, we construct functions on as in Equation (39), which, by Lemma 4.8, have the property that for every , . The fact that is compactly supported means that its wavelet approximation converges to zero outside . Together with Lemma 4.9, we then get that an approximation of is obtained by summing up the approximations of all the ’s.
A first layer of the network will consist linear units and will compute the map as in the last paragraph of Section 4.4, i.e., linearly transform the input to blocks, each of dimension , so that in each block the first coordinates are with respect to the tangent hyperplane (i.e., will give the representation ) and the remaining coordinates are with respect to directions orthogonal to .
For each , we approximate each to some desired approximation level using wavelet terms. By Remark 4.3, can be approximated using rectifiers in the second layer, rectifiers in the third layer and a single unit in the fourth layer. By Remark 4.10, on every chart the wavelet terms in all scales and shifts can be extended to using (the same) rectifiers in the second layer.
Putting this together we get that to approximate one needs a 4-layer network with linear units in the first hidden layer rectifier units in the second hidden layer, rectifier units in the third layer and a single linear unit in the fourth (output) layer. ∎
For sufficiently small radius in the sense of section 3.1, the desired properties of (i.e., being in and possibly having sparse coefficient or being twice differentiable) imply similar properties of .
We observe that the dependence on the dimension of the ambient space in the first and second layers is through , which depends on the curvature of the manifold. The number of wavelet terms in the ’th chart affects the number of units in the second layer only through the dimension of the manifold, not through . The sizes of the third and fourth layers do not depend on at all.
Finally, assuming regularity conditions on the , allows us to bound the number of wavelet terms needed for the approximation of . In particular, we consider two specific cases: and , with bounded second derivative.
If (i.e., has expansion coefficients in ), then by Theorem 3.10, can be approximated by a combination of wavelet terms so that
Consequently, denoting the output of the net by , and , we obtain
using units, where and .
If for each ’s is twice differentiable and is bounded, then by Lemma 4.5 can be approximated by using all terms up to scale so that for every
Observe that the grid spacing in the ’th level is . Therefore, since is compactly supported, there are terms in the ’th level. Altogether, on the ’th chart there are terms in levels less than . Writing , we get a point-wise error rate of using units, where and .
The unit count in Theorem 5.1 and Corollaries 5.4 and 5.5 is overly pessimistic, in the sense that we assume that the sets of wavelet terms in the expansion of , do not intersect, where are chart indices. A tighter bound can be obtained if we allow wavelet functions be shared across different charts, in which case the term in Theorem 5.1 can be replaced by the total number of distinct wavelet terms that are used on all charts, hence decreasing the constant . In particular, in Corollary 5.5 we are using all terms up to the ’th scale on each chart. In this case the constant .
The linear units in the first layer can be simulated using ReLU units with large positive biases, and adjusting the biases of the units in the second layer. Hence the first layer can contain ReLU units instead of linear units.
The construction presented in this manuscript can be divided to two main parts: analytical and topological. In the analytical part, we constructed a wavelet frame if , where the wavelets are computed from Rectified Linear units. In the topological part, given training data on a -dimensional manifold we constructed an atlas and represented any function on as sum of functions that are defined on the charts. We then used Rectifier units to extend the wavelet approximation of the functions from to the ambient space . This construction allows us to state the size of a depth 4 neural net given a function to be approximated on the manifold . We show how the specified size depends on the complexity of the function (manifested in the number of wavelet terms in its approximation) and the curvature of the manifold (manifested in the size of the atlas). In particular, we take advantage of the fact that can possibly be much smaller than to construct a network with size that depends more strongly on . In addition, we also obtain squared error rate in approximation of functions with sparse wavelet expansion and point-wise error rate for twice differentiable functions.
The network architecture and corresponding weights presented in this manuscript is hand-made, and is such that achieves the approximation properties stated above. However, it is reasonable to assume that such network is unlikely to be the result of a standard training process. Hence, we see the importance of the results presented in this manuscript by describing the theoretical approximation capability of neural nets, and not by describing trained nets which are used in practice.
Several extensions of this work can be considered. First, a more efficient wavelet representation can be obtained on each chart if one allows its wavelets to be non-isotropic (that is, to scale differently in every dimension) and not necessarily axis aligned, but rather, to correspond to the level sets of the function being approximated. When the function is relatively constant in certain directions, the wavelet terms can be ”stretched” in these directions. Such thing can be done using curvelets.
Second, we conjecture that in the representation obtained as an output of convolutional and pooling layers, the data concentrates near a collection of low dimensional manifolds embedded in a high dimensional space, which is our starting point in the current manuscript. We think that this is a result of the application of the same filters to all data points. Assuming our conjecture is true, one can apply our construction to the output of convolutional layers, and by that obtain a network topology which is similar to standard convolutional networks, namely fully connected layers on top of convolutional ones. This will make or arguments here applicable to cases where the data in its initial representation does not concentrate near low dimensional manifold, but its hidden representation does.
Finally, we remark that the choice of using rectifier units to construct our wavelet frame is convenient, however somewhat arbitrary. Similar wavelet frames can be constructed by any function (or combination of functions) that can be used to construct “bump” functions i.e., functions which are localized and have fast decay. For example, general sigmoid functions , which are monotonic and have the properties
can used to construct a frame in a similar way, by computing “smooth” trapezoids. Recall also that by Remark 3.11, any two such frames are equivalent.
The authors thank Stefan Steinerberger, Roy Lederman for their help, and to Andrew Barron, Ed Bosch, Mark Tygert and Yann LeCun for their comments. Alexander Cloninger is supported by NSF Award No. DMS-1402254.
-  Andrew R Barron. Universal approximation bounds for superpositions of a sigmoidal function. Information Theory, IEEE Transactions on, 39(3):930–945, 1993.
-  Andrew R Barron, Albert Cohen, Wolfgang Dahmen, and Ronald A DeVore. Approximation and learning by greedy algorithms. The annals of statistics, pages 64–94, 2008.
-  Charles K. Chui and H.N Mhaskar. Deep nets and manifold learning. Personal Communication, 2015.
-  John Horton Conway, Neil James Alexander Sloane, Etsuko Bannai, J Leech, SP Norton, AM Odlyzko, RA Parker, L Queen, and BB Venkov. Sphere packings, lattices and groups, volume 3. Springer-Verlag New York, 1993.
-  George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303–314, 1989.
-  Donggao Deng and Yongsheng Han. Harmonic analysis on spaces of homogeneous type. Number 1966. Springer Science & Business Media, 2009.
-  Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. arXiv preprint arXiv:1512.03965, 2015.
-  Federico Girosi, Marshall B Jones, and Tomaso Poggio. Regularization theory and neural networks architectures. Neural computation, 7(2):219–269, 1995.
-  Federico Girosi and Tomaso Poggio. Networks and the best approximation property. Biological cybernetics, 63(3):169–176, 1990.
-  Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural networks, 4(2):251–257, 1991.
-  Vera Kurková and Marcello Sanguineti. Comparison of worst case errors in linear and neural network approximation. IEEE Transactions on Information Theory, 48(1):264–275, 2002.
-  Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
-  Moshe Leshno, Vladimir Ya Lin, Allan Pinkus, and Shimon Schocken. Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural networks, 6(6):861–867, 1993.
-  W Tu Loring. An introduction to manifolds, 2008.
-  HN Mhaskar. Neural networks for optimal approximation of smooth and analytic functions. Neural Computation, 8(1):164–177, 1996.
-  Hrushikesh Narhar Mhaskar. On the tractability of multivariate integration and approximation by neural networks. Journal of Complexity, 20(4):561–590, 2004.
-  Guido F Montufar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks. In Advances in Neural Information Processing Systems, pages 2924–2932, 2014.
-  Yagyensh C Pati and Perinkulam S Krishnaprasad. Analysis and synthesis of feedforward neural networks using discrete affine wavelet transformations. Neural Networks, IEEE Transactions on, 4(1):73–85, 1993.
-  Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. arXiv preprint arXiv:1503.02406, 2015.
-  Qinghua Zhang and Albert Benveniste. Wavelet networks. Neural Networks, IEEE Transactions on, 3(6):889–898, 1992.
-  Jinsong Zhao, Bingzhen Chen, and Jingzhu Shen. Multidimensional non-orthogonal wavelet-sigmoid basis function neural network for dynamic process fault diagnosis. Computers & chemical engineering, 23(1):83–92, 1998.
Appendix A Equivalence of representations in different wavelet frames
Consider to frames and . Any element can be represented as
Observe that in case , the inner product is of large magnitude only for a small number of s. In case or , the inner product is between peaked function which integrates to zero and a flat function, hence has small magnitude. This idea is formalized in a more general form in Section in .
Appendix B Proof of Lemma 4.1
In order to show that the family in Equation (21) is a valid family of averaging kernel functions, we need to verify that conditions in  are satisfied. Here is the volume of the smallest Euclidean ball which contains and , namely , for some constant . Our goal is to show that there exist constants , and such that for every , and
WLOG we can assume , and let be arbitrary positive number. It can be easily verified that there exists a constant such that
(55) (56) (57) (58) (59)
where . ∎
: Since depends only on and is symmetric about the origin, it suffices to prove only . We want to show that if then
WLOG ; we will prove for every . Let be arbitrary positive number, and let . By the mean value theorem we get
As in the proof of condition , it can be easily verified that there exists a constant such that
We then get
(65) (66) (67) (68) (69) (70)
where . ∎
: Since depends only on and is symmetric about the origin, it suffices to prove only .
By Equation (19)
and consequently for every and
: we want to show if and then