Technological innovations on data mining bring massive data in diverse areas of modern scientific research . Deep learning [15, 2] is recognized to be a state-of-the-art scheme to take advantage of massive data, due to their unreasonable effective
empirical evidence. Theoretical verifications for such effectiveness of deep learning is a hot topic in recent years’ statistical and machine learning.
One of the most important reasons for the success of deep learning is the utilization of deep nets, a.k.a., neural networks with more than one hidden layers. In the classical neural network approximation literature, deep nets were shown to outperform shallow nets, i.e., neural networks with one hidden layer, in terms of providing localized approximation and breaking through some lower bounds for shallow nets approximation. Besides these classical assertions, recent focus [18, 12, 43, 35, 26] on deep nets approximation is to provide various functions expressible for deep nets but cannot be approximated by shallow nets with similar number of neurons. All these results present theoretical verifications for the necessity of deep nets from the approximation theory viewpoint.
Since deep nets can approximate more functions than shallow nets, the capacity of deep nets seems to be larger than that of shallow nets with similar number of neurons. This argument was recently verified under some specified complexity measurements such as the number of linear regions , Betti numbers , number of monomials  and so on . The large capacity of deep nets inevitably comes with the downside of increased overfitting risk according to the bias and variance trade-off principle . For example, deep nets with finitely many neurons were proved in  to be capable of approximating arbitrary continuous function within arbitrary accuracy, but the pseudo-dimension  for such deep nets is infinite, which usually leads to extremely large variance in the learning process. Thus the existing necessity of deep nets in the approximation theory community cannot be used directly to explain the feasibility of deep nets in machine learning.
In this paper, we aim at studying the learning performance for implementing empirical risk minimization (ERM) on some specified deep nets. Our analysis starts with the localized approximation property as well as the sparse approximation ability of deep nets to show their expressive power. We then conduct a refined estimate for the covering number of deep nets, which is closely connected to learning theory , to measure the capacity. The result shows that, although deep nets possess localized and sparse approximation while shallow nets fail, their capacities measured by the covering number are similar, provided there are comparable number of neurons in both nets. As a consequence, we derive almost optimal learning rates for the proposed ERM algorithms on deep nets when the so-called regression function  is Liptchiz continuous. Furthermore, we prove that deep nets can reflect the sparse property of the regression functions via breaking through the established almost optimal learning rates. All these results show that learning schemes based on deep nets can learn more (complicated) functions than those based on shallow nets.
The rest of this paper is organized as follows. In the next section, we present some results on the expressivity and capacity of deep nets. These properties were utilized in Section III to show outperformance of deep nets in the machine learning community. In Section IV, we present some related work and comparisons. In the last section, we draw a simple conclusion of our work.
Ii Expressivity and Capacity
Expressivity  of deep nets usually means that deep nets can represent some functions that cannot be approximated by shallow nets with similar number of neurons. Generally speaking, expressivity implies the large capacity of deep nets. In this section, we firstly show the expressivity of deep nets in terms of localized and sparse approximation, and then prove that the capacity measured by covering number is not essentially enlarged when the number of hidden layer increases.
Ii-a Localized approximation for deep nets
be the set of shallow nets with activation functionand neurons. Denote by the set of deep nets with two hidden layers
where . The aim of this subsection is to show the outperformance of over to verify the necessity of depth in providing localized approximation.
The localized approximation of a neural network  shows that if the target function is modified only on a small subset of the Euclidean space, then only a few neurons, rather than the entire network, need to be retrained. As shown in Figure 1, a neural network with localized approximation should recognize the location of the input in a small region.
Mathematically speaking, localized approximation means that for arbitrary hypercube and arbitrary , it is capable of finding a neural network such that where is the input space and denotes the indicator function of the set , i.e., when and when .
Let and be the heaviside function, i.e. , when and when . It can be found in [4, Theorem 5] (see also [7, 38]) that cannot provide localized approximation, implying that functions in with finite number of neurons cannot catch the position information of the input. However, in the following, we will construct a deep net in with some activation function and totally neurons to recognize the location of the input.
be a sigmoidal function, i.e.,
Then, for arbitrary , there exists a depending only on and such that
Let . Denote by the cubic partition of with centers and side length
, where we write arbitrary vectoras and . Then, for and arbitrary , we construct a deep net by
In the following proposition proved in Appendix A, we show that deep nets possess totally different property from shallow nets in localized approximation.
(a) For arbitrary , there holds
(b) For arbitrary there holds .
If we set , Proposition 1 shows that is an indicator function for , and consequently provides localized approximation. Furthermore, as , it follows from Proposition 1 that can recognize the location of in an arbitrarily small region. In the prominent paper , the localized approximation property of deep nets with two hidden layers and sigmoidal activation functions was established in a weaker sense. The difference between Proposition 1 and results in  is that we adopt the heaviside activation function in the first hidden layer to guarantee the equivalence of and . In the second hidden layer, it will be shown in Section II.C that some smoothness assumptions should be imposed on the activation function to derive a tight bound of the covering number. Thus, we do not recommend the use of heaviside activation. In short, we require different activation functions in different hidden layers to show excellent expressivity and small capacity of deep nets.
Compared with shallow nets in , the constructed deep net introduces a second hidden layer to act as a judger to discriminate the location of inputs. Figure 2 below numerically exhibits the localized approximation of with , being the center of the yellow zone in Figure 1 and being the logistic function, i.e., . As shown in Figure 2, we can construct deep net that control a small region of the input space but is independent of other regions. Thus, if the target function changes only on a small region, then it is sufficient to tune a few neurons, rather than retrain the entire network. Since the locality of the data abound in sparse coding , statistical physics  and image processing , the localized approximation makes deep nets be effective and efficient in the related applications.
Ii-B Sparse approximation for deep nets
The localized approximation property of deep nets shows their power to recognize functions defined on small regions. A direct consequence is that deep nets can reflect the sparse property of the target functions in the spacial domain. In this part, based on the localized approximation property established in Proposition 1, we focus on developing a deep net with sparse approximation property in the spacial domain.
Sparseness in the spacial domain means that the response of some actions happens only on several small regions in the input space, just as sparse coding  purports to show. As shown in Figure 3, sparseness studied in this paper means the response (or function) vanishes in a large number of regions and requires neural networks to recognize where the response does not vanish.
Mathematically speaking, denote by the cubic partitions of with center and side length . For with , define
It is easy to see that contains arbitrary regions consisting at most sub-cubes (such as the yellow zones in Figure 3 with ). We then say is a sparse subset of of sparseness . For some function defined on , if the support of is , we then say that is -sparse in partitions.
As discussed above, the sparseness depends on the localized approximation property. We thus can construct a deep net to embody the spareness by the help of the constructed deep net in (2). For arbitrary and with , define
where is the cubic partition defined in the previous subsection. Obviously, we have which possesses neurons. In the following Proposition 2, we will show that can embody the sparseness of the target function by exhibiting a fast approximation rate which breaks through the bottleneck of shallow nets.
For this purpose, we should at first introduce some a-priori information on the target function. We say a function is -Lipschitz if satisfies
where and denotes the Euclidean norm of . Denote by the family of -Lipschitz functions satisfying (6). The Lipschitz property describes the smoothness information of and has been adopted in vast literature [7, 28, 38, 22, 9] to quantify the approximation ability of neural networks. Denote by the set of all which is -sparse in partitions. It is easy to check that quantifies both smoothness information and sparseness in the spacial domain of the target function.
Then, we introduce the support set of . Note that the number of neurons of controls the side length of the cubic partition , while is supported on cubes in . Since is fixed, we need to tune such that the constructed deep net can recognize each with . Under this circumstance, we take and for each , define
The set corresponds to the family of cubes where is not vanished. Since each can be recognized by neuron of as given in Proposition 1, actually describes the support of . With these helps, we exhibit in the following proposition that possesses the spare approximation ability, whose proof will be presented in Appendix A.
It can be derived from (8) with and that the deep net constructed in (5) satisfies the well known Jackson-type inequality  for multivariate functions. This property shows that in approximating Lipschitz functions, deep nets perform at least not worse than shallow nets . If additional sparseness information is presented, i.e. with , by setting , (9) illustrates that for every ,
implying the sparseness of in the spacial domain. It should be highlighted that for each the cardinality of , denoted by , satisfies
Therefore, there are at least
neurons satisfying (9), which is large when is small with respect to . The aforementioned sparse approximation ability reduces the complexity of deep nets in approximating sparse functions, which makes deep-net-based learning breaks though some limitations of shallow-net-based learning, as shown in Section III.
Ii-C Covering number of deep nets
Proposition 1 and Proposition 2 show the expressive power of deep nets. In this subsection, we exhibit that the capacity of deep nets, measured by the well known covering number, is similar as that of shallow nets, implying that deep nets can approximate more functions than shallow nets but do not bring additional costs.
Let be a Banach space and be a compact set in . Denote by the covering number  of under the metric of , which is the number of elements in least -net of . If , the space of continuous functions, we denote for brevity. The estimate of covering number of shallow nets is a classical research topic in approximation and learning theory [32, 17, 14, 30, 31]. Our purpose is to present a refined estimate for the covering number of deep nets to show whether there are additional costs required by deep nets to embody the localized and sparse approximation.
where . Define be the family of such deep nets whose parameters are bounded, i.e.,
where and are positive numbers. We can see for sufficient large and . To present the covering number of , we need the following smoothness assumption on .
is a non-decreasing sigmoidal function satisfying
Assumption 1 has already been adopted in [17, Theorem 5.1] and [32, Lemma 2] to quantify the covering number of some shallow nets. It should be mentioned that there are numerous functions satisfying Assumption 1, including the widely used functions presented in Figure 4. With these helps, we present a tight estimate for the covering number of in the following proposition, whose proof will be given in Appendix B.
with denoting some norm including the uniform norm and satisfying Assumption 1 was derived. It is obvious that is a shallow net of only one neurons. Based on this interesting result, [14, Chap.16] and  presented a tight estimate for as
If , , and are not very large, i.e., do not grow exponentially with respect to , then it follows from Proposition 3 that
which is the same as (13). Comparing with , we find that adding a layer with bounded parameters does not enlarge the covering number. Thus, Proposition 3 together with Proposition 1 yields that deep nets can approximate more functions than shallow nets without increasing the covering number of shallow nets. Proposition 3 and Proposition 2 show that deep nets can approximate sparse function better than shallow nets within the same price.
Iii Learning Rate Analysis
In this section, we present the ERM algorithm on deep nets and provide its near optimal learning rates in learning Lipschitz functions and sparse functions in the framework of learning theory .
Iii-a Algorithm and assumptions
In learning theory , samples are assumed to be drawn independently according to
, a Borel probability measure onwith and for some . The primary objective is the regression function defined by
which minimizes the generalization error
where denotes the conditional distribution at induced by . Let be the marginal distribution of on and be the Hilbert space of square integrable functions on . Then for arbitrary , there holds 
We devote to deriving learning rate for the following ERM algorithm
where is the set of deep nets defined by (11). Before presenting the main results, we should introduce some assumptions.
We assume .
Assumption 2 is the -Lipschitz continuous condition for the regression function, which is standard in learning theory [14, 16, 30, 10, 23, 26]. To show the advantage of deep nets learning, we should add the sparseness assumption on .
We assume .
and computer vision.
There exists some constant such that .
Then it can be found in [14, Theorem 3.2] that
where is a constant depending only on , , , and .
Let , and , where satisfies
Iii-B Learning rate analysis
Since almost everywhere, we have . It is natural for us to project an output function onto the interval by the projection operator
Thus, the estimate we studied in this paper is .
The main results of this paper are the following two learning rate estimates. In the first one, we present the learning rate for algorithm (16) when the smoothness information of the regression function is given.
From Theorem 1, we can derive the following corollary, which states the near optimality of the derived learning rate for .
The proofs of Theorem 1 and Corollary 1 will be postponed to Appendix C. It is shown in Theorem 1 and Corollary 1 that implementing ERM on can reach the near optimal learning rates (up to a logarithmic factor) provided , and are not very large. In fact, neglecting the solvability of algorithm (16), we can set , and . Due to (18), the concrete value of depends on . Taking the logistic function for example, we can set . Theorem 1 and Corollary 1 yield that for some easy learning task (exploring only the smoothness information of ), deep nets perform at least not worse than shallow nets and can reach the almost optimal learning rates for all learning schemes.
In the following theorem, we show that for some difficult learning task (exploring sparseness and smoothness information of ), deep nets learning can break through the bottleneck of shallow nets learning via establishing a learning rate much faster than (17).
Similarly, we can obtain the following corollary, which exhibits the derived learning rate in expectation.
Theorem 2 and Corollary 2, whose proofs will be given in Appendix C, show that if the additional sparseness information is imposed, then ERM based on deep nets can break through the optimal learning rates in (17) for shallow nets. To be detailed, if is 1-sparse in partitions, then we can take be the logistic function and , and to get a learning rate of order This shows the advantage of deep nets in learning sparse functions.
Iv Related Work and Discussions
Stimulated by the great success of deep learning in applications understanding deep learning as well as its theoretical verification becomes a hot topic in approximation and statistical learning theory. Roughly speaking, the studies of deep net approximation can be divided into two categories: deducing the limitations of shallow nets and pursuing the advantages of deep nets.
Limitations of the approximation capabilities of shallow nets were firstly proposed in  in terms of their incapability of localized approximation. Five years later,  described their limitations via providing lower bounds of approximation of smooth functions in the minimax sense, which was recently highlighted by  via showing that there exists a probabilistic measure, under which, all smooth functions cannot be approximated by shallow nets very well with high confidence. In 
, Bengio et al. also pointed out the limitations of some shallow nets in terms of the so-called “curse of dimensionality”. In some recent interesting papers[20, 20, 21], limitations of shallow nets were presented in terms of establishing lower bound of approximating functions with different variation restrictions.
Studying advantages of deep nets is also a classical topic in neural networks approximation. It can date back to 1994, where Chui et al.  deduced the localized approximation property of deep nets which is far beyond the capability of shallow nets . Recently, more and more advantages of deep nets were theoretical verified in the approximation theory community. In particular,  showed the power of depth of neural network in approximating hierarchical functions;  demonstrated that deep nets can improve the approximation capability of shallow nets when the data are located on a manifold;  presented the necessity of deep nets in physical problems which possess symmetry, locality or sparsity;  exhibited the outperformance of deep nets in approximating radial functions and so on. Compared with these results, we focus on show the good performance of deep nets in approximation sparse functions in the spacial domain and also study the cost for the approximation, just as Propositions 2 and 3 exhibited.
In the learning theory community, learning rates for ERM on shallow nets with certain activation functions were studied in . Under Assumption 2,  derived a near optimal learning rate of order . The novelty of our Theorem 1 is that we focus on learning rates of ERM on deep nets rather than shallow nets, since deep nets studied in this paper can provide localized approximation. Our result together with  demonstrates that deep nets can learn more functions (such as the indicator function) than shallow nets without sacrificing the generalization capability of shallow nets. However, since deep nets possess the sparse approximation property, it is stated in Theorem 2 that if additional a-priori information is given, then deep nets can breakthrough the optimal learning rate for shallow nets, showing the power of depth in neural networks learning. Learning rates for shallow nets equipped with a so-called complexity penalization strategy were presented in [14, Chapter 16]. However, only variance estimate rather than the learning rate were established in . More importantly, their algorithms and network architectures are different from our paper.
In the recent work , a neural network with two hidden layers was developed for the learning purpose and the optimal learning rates of order were presented. It should be noticed that the main idea of the construction in  is the local average argument rather than any optimization strategy such as (16). Furthermore, 
’s network architecture is a hybrid of feed-forward neural network (second hidden layer) and radial basis function networks (first hidden layer). The constructed network in the present paper is a standard deep net possessing the same network architectures in both hidden layers.
In our previous work , we constructed a deep net with three hidden layers when is in a dimensional sub-manifold and provided a learning rate of order . The construction in  were based on the local average argument . The main difference between the present paper and  is that we used the optimization strategy in determining the parameters of deep nets rather than construct them directly. In particular, the main tool in this paper is a refined estimate for the covering number.
Another related work is , which provided error analysis of a complexity regularization scheme whose hypothesis space is deep nets with two hidden layers proposed in . They derived a learning rate of under Assumption 2, which is the same as the rate in Theorem 1 up to a logarithmic factor. Neglecting the algorithmic factor, the main novelty of our work is that our analysis combines the expressivity (localized approximation) and generalization capability, while ’s result concerns only the generalization capability. We refer the readers to [5, 7] for some advantages of localized approximation and sparse approximation in the spacial domain.
To finalize the discussion, we mention that the present paper only compares deep nets with two hidden layers with shallow nets and demonstrates the advantage of the former architecture from approximation learning theory viewpoints. As far as the optimal learning rate is concerned, to theoretically provide the power of depth, more restrictions on the regression function should be imposed. For example, shallow nets are capable of exploring the smoothness information , deep nets with two hidden layers can tackle both sparseness and smoothness information (Theorem 2 in this paper), and deep nets with more hidden layers succeed in handling sparseness information, smoothness information and manifold features of the input space (combining Theorem 2 in this paper with Theorem 1 in ). In a word, deep nets with more hidden layers can embody more information for the learning task. It is interesting to study the power of depth along such flavor and determine which information can (or cannot) be explored by deepening the networks.
In this paper, we analyzed the expressivity and generalization of deep nets. Our results showed that without essentially enlarging the capacity of shallow nets, deep nets possess excellent expressive power in terms of providing localized approximation and sparse approximation. Consequently, we proved that for some difficult learning tasks (exploring both sparsity and smoothness), deep nets could break though the optimal learning rates established for shallow nets. All these results showed the power of depth from the learning theory viewpoint.
When , there exists an such that If then
The above assertions together with the definition of yield
This finishes the proof of part (a). We turn to prove assertion (b) in Proposition 1. Since , for all , there holds . Thus, for all , there holds
It follows from the definition of that
Hence, (1) implies
Since is non-decreasing, we have for all . The proof of Proposition 1 is finished. ∎
Since , for each , there exists a such that . Here, if lies on the boundary of some , we denote by an arbitrary but fixed satisfying . Then, it follows from (5) that
Appendix B: Proofs of Proposition 3
The aim of this appendix is to prove Proposition 3. Our main idea is to decouple different hidden layers by using Assumption 1 and the definition of the covering number. For this purpose, we need the following five lemmas. the first two can be found in [14, Lemma 16.3] and [14, Theorem 9.5], respectively. The third one can be easily deuced from [14, Lemma 9.2], [14, Theorem 9.4] with and the fact . The last two are well known, and we present their proofs for the sake of completeness.
Let be a family of real functions and let be a fixed nondecreasing function. Define the class . Then