is recognized to be a state-of-the-art scheme in artificial intelligence and machine learning and has recently triggered enormous research activities. Deep neural networks (deep nets for short) is believed to be capable of discovering deep features of data which are important but are impossible to be found by shallow neural networks (shallow nets for short). It, however, simultaneously produces a series of challenges such as the efficient computation, algorithmic solvability, robustness, interpretability and so on. A direct consequence of these challenges is that users hesitate to utilize deep learning in learning tasks with high risk such as the clinical diagnosis and financial investment, since it is not clear whether deep nets perform essentially better than the scheme in hand. Thus, it is urgent and crucial to provide the theoretical guidance on “when do deep nets perform better than shallow nets?”
Generally speaking, there are three steps to study the above problem. The first step is to correspond specific real-world applications to some data features. For example, figures are assumed to be local similarity ; earthquake forecasting is related to rotation-invariant features 48]. The second step is to connect these data features with a-priori information which can be mathematically reflected by specific properties of functions. In particular, local similarity usually corresponds to piece-wise smooth functions ; rotation-invariance generally corresponds to radial functions  and sparseness on the receptive field frequently corresponds to sparseness in the spacial domain . The last step is to pursue the outperformance of deep nets in approximating or learning these application-related functions. In fact, the outperformance of deep nets has been rigorously verified in approximating piece-wise smooth functions , rotation-invariant functions  and sparse functions , which coincides with the empirical evidences on image classification , earthquake prediction  and computer vision .
With the rapid development in deep nets approximation theory, there are numerous features that are proved to be realizable by deep nets [5, 25, 33, 36, 38, 45] with much less neurons than shallow nets. Different from these encouraging results, studies in learning theory showed that, however, to realize these features, capacities of deep nets are much larger than those of shallow nets with comparable number of free parameters. In particular, under some specified capacity measurements such as the number of linear regions , Betti numbers , number of monomials , it was proved that the capacity of deep nets increases exponentially with respect to the depth but polynomially with respect to the width. An extreme case is that there exist deep nets with two hidden layers whose capacity measured by the pseudo-dimension is infinite [28, 29]. The large capacity of deep nets inevitably makes the deep nets learner sensitive to noise and requires a large amount of computations to find a good estimator.
In a nutshell, previous studies on advantages of deep nets showed that deep nets are capable of realizing various application-related data features, but it requires additional capacity costs. The first purpose of our study is to figure out whether the large capacity of deep nets to realize data features is necessary. Our study is based on two interesting observations from the literature [3, 25, 33, 36, 38, 45, 49]. One is that the number of layers of deep nets to realize various data features is small, the order of which is at most the logarithm of the number of free parameters. The other is that the magnitude of free parameters is relatively small, which is at most a polynomial with respect to the number of free parameters. With these two findings, we adopt the well known covering number [51, 52] to measure the capacity of deep nets with controllable number of layers and magnitude of weights and present a refined estimate of the covering number of deep nets. In particular, we prove that the covering number of deep nets with controllable depth and magnitude of weights is similar as that of shallow nets with comparable free parameters. This finding together with existing results in approximation theory shows that, to realize various features such as sparseness, hierarchy, rotation-invariance and manifold structures, deep nets improve the performance of shallow nets without bringing additional capacity costs.
As is well known, advantages of deep nets in realizing some special features do not mean that deep nets are always better than shallow nets. Our second purpose is to demonstrate the necessity of deepening networks in realizing some simple data features. After building a close relation between approximation rates and covering number estimates, we prove that if only the smoothness feature is explored, then up to a logarithmic factor, approximation rates of shallow nets and deep nets with controllable depth and magnitude of weights are asymptotically identical. Combining the above two statements, we indeed present rigorous theoretical verifications to support that deep nets are necessary in a large number of applications corresponding to complex data features, in the sense that deep nets realize data features without any additional capacity costs, but not all.
The rest of paper is organized as follows. In the next section, after reviewing some advantages of deep nets in approximation, we present a refined covering number estimate for deep nets. In Section III, we give a lower bound for deep nets approximation to show the limitation for deep nets in realizing simple features. In the last section, we draw a simple conclusion of this paper.
Ii Advantages of Deep Nets in Realizing Feature
In this section, we study advantages of deep nets in approximating classes of functions with complex features. After introducing some mathematical concepts associated with deep nets, we review some important results in approximation theory which show that deep nets can realize some application-related features that cannot be approximated by shallow nets with comparable free parameters. Then, we present a refined covering number estimate for deep nets to show that deepening networks in some special way does not enlarge the capacity of shallow nets.
Ii-a Deep nets with fixed structures
Great progress of deep learning is built on deepening neural networks with structures. Deep nets with different structures have been proved to be universal, i.e., [53, 54] for deep convolutional nets,  for deep nets with tree structures and  for deep fully-connected neural networks.
Let and . Let and with . Assume , , be univariate nonlinear functions. For , define . Deep nets with depth and width in the -th hidden layer can be mathematically represented as
, and be a matrix. Denote by the set of all these deep nets. When , the function defined by (1) is the classical shallow net.
The structure of deep nets can be reflected by structures of the weight matrices
and parameter vectorsand ,
. For examples, deep convolutional neural networks corresponds to Toeplitz-type weight matrices and deep nets with tree structures usually correspond extremely sparse weight matrices . Throughout this paper, a deep net with specific structures refers to a deep nets with specific structures of all , and . Figure 1 shows two structures for deep nets.
Although deep fully-connected neural networks possess better approximation ability than other networks, the number of free parameters of this type networks is
which is huge when the width and depth are large. A recent focus in deep nets approximation is to pursue the approximation ability of deep nets with fixed structures. Up till now, numerous theoretical results [39, 54, 6, 38] showed that the approximation ability of deep fully-connected neural networks can be maintained by deep nets with some special structures with much less free parameters.
In this paper, we are interested in deep nets with structures. For , we assume that the structure of deep nets is fixed and there are free parameters in , free thresholds in and free parameters in . Then, there are totally
free parameters in the deep nets. We assume further . Throughout the paper, we say there are free parameters in , if the weight matrix is generated through the following three ways. The first way is that the matrix has entries that can be determined freely, while the reminder entries are fixed, e.g., the weight matrix in deep nets with tree structures. The second way is that the weight matrix is exactly generated by free parameters, e.g., the Toeplitz-type weight matrix in deep convolutional neural networks. The third way is that the weight matrix is generated jointly by both way above, that is, part of the weight matrix is fixed, while the remaining part are totally generated by free parameters. Denote by the set of all these deep nets with hidden layers, fixed structure and free parameters. Denote further
the set of deep nets whose weights and thresholds are uniformly bounded by , where is some positive number which may depend on , , and . We aim at studying the approximation ability and capacity of .
It should be mentioned that the boundedness assumption in (II-A) is necessary. In fact, without such an assumption, [28, 13] proved that for arbitrary and arbitrary continuous function , a deep net with two hidden layers and finitely many free parameters is fully able to generate an approximation , such that
This implies that the capacity of deep nets with two hidden layers and finitely many free parameters is comparable with that of , showing its extremely large capacity. Therefore, to further control the capacity of deep nets, the boundedness assumption has been employed in large literature [14, 25, 38].
Ii-B A fast review for realizing data features by deep nets
In approximation and learning theory, data features are usually formulated by a-priori information for corresponding functions, like the target function  for approximation, regression function  for regression and Bayes decision function  for classification. Studying advantages of deep nets in approximating functions with different a-priori information is a classical topic. It can date back to 1994, when  deduced the localized approximation property of deep nets which is far beyond the capability of shallow nets.
The localized approximation of a neural network shows that if the target function is modified only on a small subset of the Euclidean space, then only a few neurons, rather than the entire network, need to be retrained. We refer to [3, Def.2.1] for a formal definition of localized approximation. Since the localized approximation is an important step-stone in approximating piecewise smooth functions  and sparse functions in spacial domains , deep nets perform much better than shallow nets in related applications such as image processing and computer vision . The following proposition, which can be found in [3, Theorem 2.3] (see also ), shows the localized approximation property of deep nets.
Suppose that is a bounded measurable function with the sigmoidal property
Then there exists a deep net with two hidden layers, neurons
and activation function
neurons and activation functionprovides localized approximation.
Rotation-invariance, is another popular data feature, which abounds in statistical physics , earthquake early warning  and image rendering . Mathematically, rotation-invariant property corresponds to a radial function which is by definition a function whose value at each point depends only on the distance between that point and the origin. In the nice papers [15, 16], shallow nets were proved to be incapable of embodying rotation-invariance features. To show the power of depth in approximating radial functions, we present the definition of smooth radial function as follows.
Let , and with and . We say a univariate function is -Lipschitz continuous if is -times differentiable and its -th derivative satisfies the Lipschitz condition
Denote by the set of all -Lipschitz continuous functions defined on . Denote also by the set of radial functions with .
The following proposition, which can be found in , shows that deep nets can realize rotation-invariance and smoothness features of target functions, simultaneously.
Let and . If is the logistic function, i.e. , then for arbitrary , there is an such that
Furthermore, for arbitrary , there always exists a function satisfying
where , are constants independent of or .
Numerous learning problems 
in computer vision, gene analysis and speech processing involve high dimensional data. These data are often governed by many fewer variables, producing manifold-structure features in a high dimensional ambient space. A large number of theoretical studies[5, 45, 50] have revealed that shallow nets are difficult to realize smooth and manifold-structure features simultaneously. Conversely, deep nets, as studied in [45, 5], is capable of reflecting these features, which is shown by the following proposition  (see also ).
Let be a smooth -dimensional compact
manifold (without boundary) with . If is the ReLU
activation function, i.e.
is the ReLU activation function, i.e., and is defined on and twice differentiable, then there exists a such that
where is a constant independent of or .
The previous studies showed that, compared with shallow nets, deep nets equipped with fewer parameters are enough to approximate functions with complex features to the same accuracy. In the following Table I, we list some literature on studying the advantages of realizing data futures.
|[3, 5]||Localized approximation||Sigmoidal||2|
|[22, 40]||Sparse (frequency)||Analytic|
Ii-C Covering number estimates
In the above subsection, we have reviewed some results on the advantages of deep nets in realizing data features. However, it does not mean that deep nets are better than shallow nets, since we do not know what price is paid for such advantages in approximation. In this subsection, we use the covering number, which is widely used in learning theory [23, 43, 44, 51, 52], to measure the capacity of and then unify the comparison within the same framework to show the outperformance of deep nets.
Let be a Banach space and be a subset of . Denote by the -covering number of under the metric of , which is the minimal number of elements in an -net of . If , we denote for brevity. Our purpose is a tight bound for covering numbers of . To this end, we need the following assumption.
For arbitrary and every , assume
for some .
to quantify covering numbers of neural networks with different structures. We can see that almost all widely used activation functions such as the logistic function, hyperbolic tangent sigmoidal functionwith , arctan sigmoidal function Gompertz function with , ReLU , and Gaussian function satisfy Assumption 1. With this assumption, we present our first main result in the following theorem, whose proof can be found in Appendix A.
where is a constant independent of or . From Theorem 1, we can derive
for some independent of , , or . Comparing (15) with (14), we find that, up to a logarithmic factor, deep nets do not essentially enlarge the capacity of shallow nets, provided that they have same number of free parameters and the depth of deep nets is at most . Noting that the depths of deep nets in Table I all satisfy this constraint, Theorem 1 shows that to realize various data features presented in Table I, deep nets can improve the performance of shallow nets without imposing additional capacity costs. Therefore, Theorem 1 together with Table I yields the reason why deep nets perform much better than shallow nets in some complex learning tasks such as image processing and computer vision.
Recently,  presented a tight VC-dimension bounds for piecewise linear neural networks. In particular, they proved that
where denotes the VC-dimension of the set and , where if and otherwise. Using the standard approach in [9, Chap.9], we can derive
provided that are piecewise linear, where is a constant independent of or . Comparing (17), there is an additional in our analysis. The reason is that we focus on all activation functions satisfying (1) rather than piecewise activation functions. It should be also mentioned that similar covering number estimates for deep nets with tree structures has been studied in [6, 14, 25]. We highlight that different structures yield essentially non-trivial approaches. In fact, due to tree structures, the approach in [6, 14, 25] is just to decouple layers by using the boundedness and Liptchiz property of activation functions. However, in estimating covering number of deep nets with arbitrarily fixed structure, we need a novel matrix-vector transformation technique, as presented in Appendix A.
Iii Necessity of the Depth
Previous studies showed that, to realize some complex data features, deep nets can improve the performance of shallow nets without additional capacity costs. In this section, we study in a different direction to prove that, to realize some simple data features, deep nets are not essentially better than shallow nets.
Iii-a Limitations of deep nets approximation
Smoothness or regularity is a widely used feature that has been adopted in a vast literature [3, 4, 15, 16, 28, 29, 49]. To present the approximation result, we at first introduce the following definition.
Let and with and . We say a function is -smooth if is -times differentiable and for every , with , its -th partial derivative satisfies the Lipschitz condition
where and denotes the Euclidean norm of . Denote by the set of all -smooth functions defined on .
Approximating smooth functions is a classical topic in neural networks approximation. It is well known that the approximation rate can be as fast as for neural networks with free parameters. In particular, the Jackson-type error estimate
has been established  for shallow nets with analytic activation functions, where
denotes the deviations of from in for . Similar results has been derived in  with deep nets with two hidden layers and a sigmoidal activation function. Recently,  derived an error estimate taking the form of
for deep nets with and ReLU activation functions. We would like to point out that, for shallow nets with ReLU activation functions, estimates (20) holds only for , which is also considered as the approximation bottleneck of shallow nets. The paper  showed that deepening the networks can overcome this bottleneck for shallow nets. However, it should be mentioned from (19) that for other activation functions except the ReLU activation functions, such a bottleneck does not exist. Thus, the paper  indeed conduct a nice analysis on the necessity of deepening ReLU nets. However, their established results can not illustrate the necessity of depth.
In the following theorem that will be proved in Appendix C, we show that deep nets cannot be essentially better than shallow nets in realizing the smoothness feature.
Let , . Then
where is a constant depending only on and .
with a constant depending only on and , we see that, when is not too large, deep nets cannot essentially improve the approximation rate if one only considers the smoothness feature. When is too large, it follows from Theorem 1 that we will need additional capacity cost for deep nets to improve the approximation ability of shallow nets. In other words, the smoothness feature is not sufficient to judge whether the depth of neural networks is necessary.
Iii-B Remarks and discussions
Limitations of the approximation capabilities of shallow nets were firstly studied in  in terms of providing lower bounds of approximation of smooth functions in the minimax sense. Recently,  highlighted that there exists a probabilistic measure, under which, all smooth functions cannot be approximated by shallow nets very well with high confidence. In another two interesting papers [18, 19], limitations of shallow nets were presented in terms of establishing lower bound of approximating functions with some variation restrictions. However, due to these results, it is still not clear whether the depth of neural networks is necessary, if only the smoothness information is given.
Theorem 2 goes further along this direction and presents a negative answer. In Theorem 2, to realize smoothness features, deep nets perform almost the same as shallow nets. This result verifies the common consensus that deep learning outperforms shallow learning in some “difficult” learning tasks , but not always. Moreover, our result also implies that whether deep nets can help to improve the performance of the existing learning schemes depends on what features for data we are exploring. Combing our work with [35, 36, 14, 38, 33, 5, 6, 25], we can illustrate the comparison between shallow and deep nets in Figure 2.
In this paper, we study the advantages and limitations of deep nets in realizing different data features. Our results showed that, in realizing some complex data features such as the rotation-invariance, manifold structure, hierarchical structure, sparseness, deep nets can improve the performance of shallow nets without additional capacity costs. We also exhibit that for some simple data features like the smoothness, deep nets performs essentially similar as shallow nets.
Appendix A: Proof of Theorem 1
For , let be the set of matrices with fixed structures and total free parameters and be the set of -dimensional vectors with fixed structures and total free parameters. Denote
For , let and define iteratively
For each , define . The following lemma devotes to the uniform bound of functional vectors in . In our analysis, we always assume that the activation functions satisfy Assumption 1 with uniform constants and . Moreover, we also suppose that and .
For each and , if satisfies (13), then there holds
Our second lemma aims at deriving covering number of some matrix and vector with fixed free parameters.
For arbitrary and , we have
where denotes the 1-norm of the matrix .
For arbitrary matrix, we can rewrite it as a -dimensional vector as . Without loss of generality, we assume that the first elements of the -dimensional vector are free parameters. Let be the -cover nets of , that is, for each , there is a such that
Then, for arbitrary with the matrices corresponding to the vector and respectively, there holds
If the reminder are fixed constants, we have . If the weight matrix is generated by the other two ways, which implies some elements in the remainder terms sharing the same values as some elements in the first terms, then we have
Both cases yield
Hence -covers for sets with constitute a -cover for , which together with , implies
where denotes the cardinality of the set . This completes the first estimate. The second estimates can be derived by using the similar approach. With these, we completes the proof of Lemma 2. ∎
Based on the previous lemmas, we can derive the following iterative estimates for the covering number associated with the affine mapping .
If satisfies Assumption 1 for each , then
holds for and
where , and .