Deep Neural Networks Learn Non-Smooth Functions Effectively

We theoretically discuss why deep neural networks (DNNs) performs better than other models in some cases by investigating statistical properties of DNNs for non-smooth functions. While DNNs have empirically shown higher performance than other standard methods, understanding its mechanism is still a challenging problem. From an aspect of the statistical theory, it is known many standard methods attain optimal convergence rates, and thus it has been difficult to find theoretical advantages of DNNs. This paper fills this gap by considering learning of a certain class of non-smooth functions, which was not covered by the previous theory. We derive convergence rates of estimators by DNNs with a ReLU activation, and show that the estimators by DNNs are almost optimal to estimate the non-smooth functions, while some of the popular models do not attain the optimal rate. In addition, our theoretical result provides guidelines for selecting an appropriate number of layers and edges of DNNs. We provide numerical experiments to support the theoretical results.

Authors

• 13 publications
• 45 publications
• Approximating smooth functions by deep neural networks with sigmoid activation function

We study the power of deep neural networks (DNNs) with sigmoid activatio...
10/08/2020 ∙ by Sophie Langer, et al. ∙ 0

• Adaptive Approximation and Estimation of Deep Neural Network to Intrinsic Dimensionality

We theoretically prove that the generalization performance of deep neura...
07/04/2019 ∙ by Ryumei Nakada, et al. ∙ 0

• A Differential Topological View of Challenges in Learning with Feedforward Neural Networks

Among many unsolved puzzles in theories of Deep Neural Networks (DNNs), ...
11/26/2018 ∙ by Hao Shen, et al. ∙ 0

• DeepKriging: Spatially Dependent Deep Neural Networks for Spatial Prediction

In spatial statistics, a common objective is to predict the values of a ...
07/23/2020 ∙ by Yuxiao Li, et al. ∙ 0

• The gap between theory and practice in function approximation with deep neural networks

Deep learning (DL) is transforming whole industries as complicated decis...
01/16/2020 ∙ by Ben Adcock, et al. ∙ 0

• Analysis of Invariance and Robustness via Invertibility of ReLU-Networks

Studying the invertibility of deep neural networks (DNNs) provides a pri...
06/25/2018 ∙ by Jens Behrmann, et al. ∙ 0

• Fast convergence rates of deep neural networks for classification

We derive the fast convergence rates of a deep neural network (DNN) clas...
12/10/2018 ∙ by Yongdai Kim, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Deep neural networks (DNNs) have shown outstanding performance on various tasks of data analysis (Schmidhuber, 2015; LeCun et al., 2015)

. Enjoying their flexible modeling by a multi-layer structure and many elaborate computational and optimization techniques, DNNs empirically achieve higher accuracy than many other machine learning methods such as kernel methods

(Hinton et al., 2006; Le et al., 2011; Kingma and Ba, 2014). Hence, DNNs are employed in many successful applications, such as image analysis (He et al., 2016), medical data analysis (Fakoor et al., 2013)(Collobert and Weston, 2008), and others.

Despite such outstanding performance of DNNs, little is yet known why DNNs outperform the other methods. Without sufficient understanding, practical use of DNNs could be inefficient or unreliable. To reveal the mechanism, numerous studies have investigated theoretical properties of neural networks from various aspects. with approximation theory, the expressive power of neural networks have been analyzed(Cybenko, 1989; Barron, 1993; Bengio and Delalleau, 2011; Montufar et al., 2014; Yarotsky, 2017; Petersen and Voigtlaender, 2017), statistics and learning theories have elucidated generalization errors (Barron, 1994; Neyshabur et al., 2015; Schmidt-Hieber, 2017; Zhang et al., 2017; Suzuki, 2018), and optimization theory has discussed the landscape of the objective function and dynamics of learning(Baldi and Hornik, 1989; Fukumizu and Amari, 2000; Dauphin et al., 2014; Kawaguchi, 2016; Soudry and Carmon, 2016).

One limitation in the existing statistical analysis of DNNs is a smoothness assumption for data generating processes. It makes one of the reasons for difficulties, when we try to reveal the advantage of DNNs. In the statistical theory, it is assumed that data are generated from smooth (i.e. differentiable) functions, namely, data are given

 Yi=f(Xi)+ξi, ξi∼N(0,σ2),

where is a -times differentiable function with -dimensional input. With this setting, however, not only DNNs but also other popular methods (kernel methods, Gaussian processes, series methods, and so on) achieve generalization errors bounded as

 O(n−2β/(2β+D)),

which is known to be optimal in the minimax sense (Stone, 1982; Tsybakov, 2009; Giné and Nickl, 2015). Hence, as long as we employ the smoothness assumption, it is not possible to show a theoretical evidence for the empirical advantage of DNNs.

This paper considers learning of non-smooth functions for the data generating processes to break the difficulty. We prove that DNNs certainly have a theoretical advantage under the non-smooth setting. Specifically, we discuss a nonparametric regression problem with a class of piecewise smooth functions which are non-smooth on boundaries of pieces in their domains. Then, we derive convergence rates of least square and Bayes estimators by DNNs with a ReLU activation as

 O(max{n−2β/(2β+D),n−α/(α+D−1)}),

up to log factors (Theorems 1, 2, and Corollary 1). Here, and denote a degree of smoothness of piecewise smooth functions, and is the dimensionality of inputs. We prove also that the convergence rate by DNNs is optimal in the minimax sense (Theorem 3). In addition, we show that some of other popular methods, such as kernel methods and orthogonal series methods with some specified bases, cannot estimate the piecewise smooth functions with the optimal convergence rate (Proposition 1 and 2). Notably, in contrast to these models, our result shows that DNNs with a ReLU achieve the optimal rate in estimating non-smooth functions, although the DNN realizes smooth functions. We provide some numerical results supporting our results.

Contributions of this paper are as follows:

• We derive the convergence rates of the estimators by DNNs for the class of piecewise smooth functions. Our convergence results are more general than existing studies, since the class is regarded as a generalization of smooth functions.

• We prove that DNNs theoretically outperform other standard methods for data from non-smooth generating processes, as a consequence the proved convergence rates.

• We provide a practical guideline on the structure of DNNs; namely, we show a necessary number of layers and parameters of DNNs to achieve the optimal convergence rate. It is shown in Table 1.

All of the proofs are deferred to the supplementary material.

1.1. Notation

We use notations and for natural numbers. The

-th element of vector

is denoted by , and is the -norm (). is a vectorization operator for matrices. For , is the set of positive integers no more than . For a measure on and a function , denotes the norm.

denotes a tensor product, and

for a sequence .

2. Regression with Non-Smooth Functions

We formulate a regression problem when a function for generating data is non-smooth. Firstly, we summarize a brief outline of the regression problem, and secondly, we introduce a class of non-smooth functions.

2.1. Regression Problem

In this paper, we use the -dimensional cube () for the domain of data. Suppose we have a set of observations for which is independently and identically distributed with the data generating process

 Yi=f∗(Xi)+ξi, (1)

where is an unknown true function and is Gaussian noise with mean

and variance

for . We assume that the marginal distribution of on has a positive and bounded density function .

The goal of the regression problem is to estimate from the set of observations . With an estimator , its performance is measured by the norm:

 ∥ˆf−f∗∥2L2(PX)=EX[(ˆf(X)−f∗(X))2].

There are various methods to estimate and their statistical properties are extensively investigated (For summary, see Wasserman (2006) and Tsybakov (2009)).

A classification problem can be also analyzed through the regression framework. For instance, consider a -classes classification problem with covariates and labels for . To describe the classification problem, we consider a -dimensional vector-valued function and a generative model for as

 Zi=\operatornamewithlimitsargmaxq∈[Q]f∗q(Xi).

Here, estimating can solve the classification problem. (For summary, see Steinwart and Christmann (2008)).

2.2. Piecewise Smooth Functions

To describe non-smoothness of functions, we introduce a notion of piecewise smooth functions which have a support divided into several pieces and smooth only within each of the pieces. On boundaries of the pieces, piecewise smooth functions are non-smooth, i.e. non-differentiable and even discontinuous. Figure 1 shows an example of piecewise smooth functions.

As preparation, we introduce notions of (i) smooth functions and (ii) pieces in supports. Afterwards, we combine them and provide the notion of (iii) piecewise smooth functions.

(i). Smooth Functions

We introduce the Hölder space to describe smooth functions. With a parameter , the Hölder norm for is defined as

 ∥f∥Hβ:= max|a|≤⌊β⌋supx∈ID|∂af(x)|+max|a|=⌊β⌋supx,x′∈ID,x≠x′|∂af(x)−∂af(x′)||x−x′|β−⌊β⌋,

where denotes a multi-index of differentiation and denotes a partial derivative. Then, the Hölder space on is defined as

 Hβ:={f:ID→R | ∥f∥Hβ<∞}.

Intuitively, contains functions such that they are -times differentiable and the -th derivatives are -Hölder continuous.

The Hölder space is popularly used for representing smooth functions, and many statistical methods can effectively estimate functions in the Hölder space. (For summary, see Giné and Nickl (2015).)

(ii). Pieces in Supports

To describe pieces in supports, we introduce an extended notion of a boundary fragment class which is developed by Dudley (1974) and Mammen et al. (1999).

Preliminarily, we consider a sphere in and its center is the origin. With , let be sets in such as and .

We provide a notion of boundaries of a piece in using . Let be an open ball in , and be a surjective function for . With a parameter , let be the set of boundaries, defined by

 Gα,J:= {(g1,...,gD)∣injective,gd:SD−1→I,gd∘Fj∈Hα(¯SD−1),j∈[J],d∈[D],},

where denotes the Hölder space of smooth functions on . Intuitively, boundaries is -times differentiable expect at frontier points of .

Given as the boundary of a piece, we define as the interior of (detailed definition is provided by Dudley (1974)). At last, we define as a set of pieces in such as

 Rα,J:={\operatornamewithlimitsInt(g):g∈Gα,J}.

Figure 2 shows a brief example.

We mention that can describe a wide range of pieces (Dudley, 1974): with is dense in a set of all convex sets in .

(iii). Piecewise Smooth Functions

Using and , we define piecewise smooth functions. Let be a number of pieces of the support . With a piece , let be the indicator function such that

 1R(x)={1,~{}if~{}x∈R,0,~{}if~{}x∉R.

We define a set of piecewise smooth functions as

 FM,J,α,β={M∑m=1fm⊗1Rm:fm∈Hβ,Rm∈Rα,J}.

Since realizes only when , the notion of can express a combination of smooth functions on each piece . Hence, functions in are non-smooth (and even discontinuous) on boundaries of . Obviously, with and , hence the notion of piecewise smooth functions can describe a wider class of functions.

3. Analysis for Estimation by DNNs

In this section, we provide estimators for the regression problem by DNNs and derive their theoretical properties. Firstly, we define a statistical model by DNNs. Afterwards, we investigate two estimators by DNNs; a least square estimator and a Bayes-estimator.

3.1. Models by Deep Neural Networks

Let be the number of layers in DNNs. For , let be the dimensionality of variables in the -th layer. For brevity, we set , i.e., the output is one-dimensional. We define and be matrix and vector parameters to give the transform of -th layer. The architecture of DNN is a set of pairs of :

 Θ:=((A1,b1),...,(AL,bL)).

We define be a number of layers in , as a number of non-zero elements in , and be the largest absolute value of the parameters in .

For an activation function

for each , this paper considers the ReLU activation .

The model of neural networks with architecture and activation is the function , which is defined inductively as

 Gη[Θ](x)=x(L+1),

with

 x(1):=x, x(ℓ+1):=η(Aℓx(ℓ)+bℓ),~{}for~{}ℓ∈[L],

where is the number of layers. The set of model functions by DNNs is thus given by

 FNN,η(S,B,L′):={Gη[Θ]:ID→R∣∥Θ∥0≤S,∥Θ∥∞≤B,|Θ|≤L′},

with , , and . Here, bounds the number of non-zero parameters of DNNs by , namely, the number of edges of an architecture in the networks. This also describes sparseness of DNNs. is a bound for scales of parameters.

3.2. Least Square Estimator

Using the model of DNNs, we define a least square estimator by empirical risk minimization. Using the observations , we consider the minimization problem with respect to parameters of DNNs as

 ˆfL∈\operatornamewithlimitsargminf∈FNN,η(S,B,L)1n∑i∈[n](Yi−f(Xi))2, (2)

and use for an estimator of .

Note that the problem (2) has at least one minimizer since the parameter set is compact and is continuous. If necessary, we can add a regularization term for the problem (2), because it is not difficult to extend our results to an estimator with regularization. Furthermore, we can apply the early stopping techniques, since they play a role as the regularization (LeCun et al., 2015). However, for simplicity, we confine our arguments of this paper in the least square.

We investigate theoretical aspects of convergence properties of with a ReLU activation.

Theorem 1.

Suppose . Then, there exist constants , and satisfying

1. ,

2. ,

3. ,

such that provides

 ∥ˆfL−f∗∥2L2(PX)≤CLmax{n−2β/(2β+D),n−α/(α+D−1)}logn, (3)

with probability at least

.

Proof of Theorem 1 is a combination of a set estimation (Dudley, 1974; Mammen and Tsybakov, 1995), an approximation theory of DNNs (Yarotsky, 2017; Petersen and Voigtlaender, 2017), and an applications of the empirical process techniques (Koltchinskii, 2006; Giné and Nickl, 2015; Suzuki, 2018).

The convergence rate in Theorem 1 is simply interpreted as follows. The first term describes an effect of estimating for . The rate corresponds to the minimax optimal rate for estimating smooth functions in (For a summary, see Tsybakov (2009)). The second term reveals an effect from estimation of for through estimating the boundaries of . The same convergence rate appears in a problem for estimating sets with smooth boundaries (Mammen and Tsybakov, 1995).

We remark that a larger number of layers decreases . Considering the result by Bartlett (1998), which shows that large values of parameters make the performance of DNNs worse, the above theoretical result suggests that a deep structure can avoid the performance loss caused by large parameters.

We also mention that our theoretical result is independent of the non-convex optimization problem. Suppose an optimization method fails to obtain the minimizer, i.e. we obtain a solution such that with an error . Then, an error of is evaluated as

 Ef∗[∥\widecheckf−f∗∥2L2(PX)]≤CLmax{n−2β/(2β+D),n−α/(α+D−1)}logn+Δ,

since we can evaluate the estimation error and the optimization error independently. Here, denotes an expectation with respect to the true distribution of . Thus, combining the results on the magnitude of (e.g. Kawaguchi (2016)), we can evaluate the error in the cases of non-convex optimization.

3.3. Bayes Estimator

We define a Bayes estimator for DNNs which can avoid the non-convexity problem in optimization. Fix architecture and with given and . Then, a prior distribution for is defined through providing distributions for the parameters contained in . Let and be distributions of and as

 Aℓ∼Π(A)ℓ, bℓ∼Π(b)ℓ

for . We set and such that each of the parameters of

, and the other parameters degenerate at . Using these distributions, we define a prior distribution on by

 ΠΘ:=⨂ℓ∈[L]Π(A)ℓ⊗Π(b)ℓ.

Then, a prior distribution for is defined by

 Πf(f):=ΠΘ(Θ:Gη[Θ]=f).

We consider the posterior distribution for . Since the noise in (1) is Gaussian with its variance , the posterior distribution is given by

 dΠf(f|Dn)=exp(−∑i∈[n](Yi−f(Xi))2/σ2)dΠf(f)∫exp(−∑i∈[n](Yi−f′(Xi))2/σ2)dΠf(f′).

Note that we do not discuss computational issues of the Bayesian approach since the main focus is a theoretical aspect. To solve the computational problems, see Hernández-Lobato and Adams (2015) and others.

We provide theoretical analysis on the rate of contraction for the posterior distribution. Same as the least square estimator cases, we consider a ReLU activation function.

Theorem 2.

Suppose . Then, there exist constants , architecture satisfying following conditions:

1. ,

2. ,

3. ,

and a prior distribution , such that the posterior distribution provides

 Ef∗[Πf(f:∥f−f∗∥2L2(PX)≥rCB×max{n−2β/(2β+D),n−α/(α+D−1)}logn|Dn)] ≤exp(−r2c2max{nD/(2β+D),n(D−1)/(α+D−1)}), (4)

for all .

To provide proof of Theorem 2, we additionally apply studies for statistical analysis for Bayesian nonparametrics (van der Vaart and van Zanten, 2008, 2011).

Based on the result, we define a Bayes estimator as

 ˆfB:=∫fdΠf(f|Dn),

by the Bochner integral in . Then, we obtain the convergence rate of by the following corollary.

Corollary 1.

With the same setting in Theorem 2, consider . Then, we have

 Ef∗[∥ˆfB−f∗∥2L2(PX)]≤CBmax{n−2β/(2β+D),n−α/(α+D−1)}logn.

This result states that the Bayes estimator can achieve the same convergence rate as the least square estimator shown in Theorem 1. Since the Bayes estimator does not use optimization, we can avoid the non-convex optimization problem, while the computation of the posterior and mean are not straightforward.

4. Discussion: Why DNNs work better?

We discuss why DNNs work better than some other popular methods. Firstly, we show that the convergence rates by DNNs in Theorem 1 and 2 are optimal for estimating a function in the piecewise smooth function class. Secondly, we provide additional shreds of evidence that other methods are not suitable for the piecewise smooth functions. At last, we add some discussions.

4.1. Optimality of the DNN Estimators

We will show optimality of the convergence rates by the DNN estimators in Theorem 1 and Corollary 1. To this end, we employ a theory of minimax optimal rate which is known in the field of mathematical statistics (Giné and Nickl, 2015). The theory derives a lower bound of a convergence rate with arbitrary estimators, thus we can obtain a theoretical limitation of convergence rates.

The result of the minimax optimal rate for the class of piecewise smooth functions is shown in the following theorem.

Theorem 3.

Consider is an arbitrary estimator for . Then, there exists a constant such that

 inf¯fsupf∗∈FM,J,α,βEf∗[∥¯f−f∗∥2L2(PX)]≥Cmmmax{n−2β/(2β+D),n−α/(α+D−1)}.

Proof of Theorem 3 employs techniques in the minimax theory developed by Yang and Barron (1999) and Raskutti et al. (2012).

We show that the convergence rates by the estimators with DNNs are optimal in the minimax sense, since the rates in Theorems 1 and 2 correspond to the lower bound of Theorem 3 up to a log factor. In other words, for estimating , no other methods could achieve a better convergence rate than the estimators by DNNs.

4.2. Inefficiency of Other Methods

We consider kernel methods and orthogonal series methods as representatives of other standard methods, then show that these methods are not optimal for estimating piecewise smooth functions.

Kernel methods are popular to estimate functions in the field of machine learning (Rasmussen and Williams, 2006; Steinwart and Christmann, 2008). Also, it is well known that theoretical aspects of kernel methods are equivalent to that of the Gaussian process regression (van der Vaart and van Zanten, 2008). An estimator by the kernel method is defined as

where is a kernel function, is a reproducing kernel Hilbert space given by with its norm , and is a regularization coefficient as a hyper-parameter. Here, we consider two standard kernel functions such as the Gaussian kernel and the polynomial kernel. In the Gaussian kernel case, it is known that are optimal when (Steinwart and Christmann, 2008). We provide a theoretical result about for estimating non-smooth functions.

Proposition 1.

Fix and arbitrary. Let be the kernel estimator with the Gaussian kernel or the polynomial kernel. Then, there exists and a constant such that

 Ef∗[∥ˆfK−f∗∥2L2(PX)]→CK,

as .

Since the kernel functions are not appropriate to express smooth structure of , a set of functions by the kernel functions do not contain some . Although the Gaussian kernel is universal kernel, i.e. the RKHS by the Gaussian kernel is dense in a class of continuous functions, some has a discontinuous structure, hence kernel methods with the kernel functions cannot estimate consistently. Similar properties hold for other smooth kernel functions.

Orthogonal series methods, which is known as Fourier methods, estimate functions using an orthonormal basis. It is one of the most fundamental methods for nonparametric regression (For an introduction, see Section 1.7 in Tsybakov (2009)). Let for be an orthonormal basis function in . An estimator for by the orthogonal series method is defined as

 ˆfS(x):=∑j∈[J]ˆγjϕj(x),

where is a hyper-parameter and is a coefficient calculated as . When the true function is smooth, i.e. , is known to be optimal in the minimax sense (Tsybakov, 2009). About estimation for , we can obtain the following proposition.

Proposition 2.

Fix and arbitrary. Let be the estimator by the orthogonal series method. Suppose are the trigonometric basis or the Fourier basis. Then, with sufficient large , there exist , , a constant , and a parameter

 −κ>max{−2β/(2β+D),−α/(α+D−1)},

such that

 Ef∗[∥ˆfF−f∗∥2L2(PX)]>CFn−κ.

Proposition 2 shows that can estimate consistently since the orthogonal basis in can reveal all square integrable functions. Its convergence rate is, however, strictly worse than the optimal rate. Intuitively, the method requires many basis functions to express the non-smooth structure of , and a large number of bases increases variance of the estimator, hence they lose efficiency.

4.3. Interpretation on Our Result

According to the results, we can see that the estimators by DNNs have the theoretical advantage than the others for estimating , since the estimators by DNNs achieve the optimal convergence rate and the others do not.

We provide an intuition on why DNNs are optimal and the others are not. The most notable fact is that DNNs can realize non-smooth functions with a small number of parameters, due to activation functions and multi-layer structures. A combination of two ReLU functions can approximate step functions, and a composition of the step functions in a combination of other parts of the network can easily express smooth functions restricted to pieces. In contrast, even though the other methods have the universal approximation property, they require a larger number of parameters to represent non-smooth structures. By the statistical theory, a larger number of parameters increases variance of estimators and worsens the performance, hence the other methods lose the optimality.

About the inefficiency of the other methods, we do not claim that every statistical method except DNNs misses the optimality for estimating piecewise smooth functions. Our argument is the advantage of DNNs against the commonly used methods, such as the orthogonal series methods and the kernel methods. There may exist some other models which can achieve the optimality as DNNs, and this is an interesting future work.

An estimation using non-smooth kernels or basis functions is also an interesting direction. While some studies have investigated properties in such situations(van Eeden, 1985; Wu and Chu, 1993a, b; Wolpert et al., 2011; Imaizumi et al., 2018), these works focus on different settings such as density estimation or univariate data analysis, hence their setting does not fit problems discussed here.

5. Experiments

We carry out simple experiments to support our theoretical results.

5.1. Non-smooth Realization by DNNs

We show how the estimators by DNNs can estimate non-smooth functions. To this end, we consider the following data generating process with a piecewise linear function. Let , be an independent Gaussian variable with a scale , and

be a uniform random variable on

. Then, we generate pairs of from (1) with a true function as piecewise smooth function such that

 f∗(x) =1R1(x)(0.2+x21+0.1x2)+1R2(x)(0.7+0.01|4x1+10x2−9|1.5), (5)

with a set and . A plot of in figure 4 shows its non-smooth structure.

About the estimation by DNNs, we employ the least square estimator (2). For the architecture of DNNs, we set and dimensionality of each of the layers as for , and . We use a ReLU activation. To mitigate an effect of the non-convex optimization problem, we employ

initial points which are generated from the Gaussian distribution with an adjusted mean. We employ Adam

(Kingma and Ba, 2014) for optimization.

We generate data with a sample size f and obtain the least square estimator for . Then, we plot in Figure 4 which minimize an error from the trials with different initial points. We can observe that succeeds in approximating the non-smooth structure of .

5.2. Comparison with the Other Methods

We compare performances of the estimator by DNNs, the orthogonal series method, and the kernel methods. About the estimator by DNNs, we inherit the setting in Section 5.1. About the kernel methods, we employ estimators by the Gaussian kernel and the polynomial kernel. A bandwidth of the Gaussian kernel is selected from and a degree of the polynomial kernel is selected from . Regularization coefficients of the estimators are selected from . About the orthogonal series method, we employ the trigonometric basis which is a variation of the Fourier basis. All of the parameters are selected by a cross-validation.

We generate data from the process (1) with (5) with a sample size and measure the expected loss of the methods. In figure 5

, we report a mean and standard deviation of a logarithm of the loss by

replications. By the result, the estimator by DNNs always outperforms the other estimators. The other methods cannot estimate the non-smooth structure of , although some of the other methods have the universal approximation property.

6. Conclusion and Future Work

In this paper, we have derived theoretical results that explain why DNNs outperform other methods. To this goal, we considered a regression problem under the situation where the true function is piecewise smooth. We focused on the least square and Bayes estimators, and derived convergence rates of the estimators. Notably, we showed that the rates are optimal in the minimax sense. Furthermore, we proved that the commonly used orthogonal series methods and kernel methods are inefficient to estimate piecewise smooth functions, hence we show that the estimators by DNNs work better than the other methods for non-smooth functions. We also provided a guideline for selecting a number of layers and parameters of DNNs based on the theoretical results.

Investigating selection for architecture of DNNs has remained as a future work. While our results show the existence of an architecture of DNNs that achieves the optimal rate, we did not discuss how to learn the optimal architecture from data effectively. Practically and theoretically, this is obviously an important problem for analyzing a mechanism of DNNs.

References

• Anthony and Bartlett (2009) Anthony, M. and Bartlett, P. L. (2009) Neural network learning: Theoretical foundations, cambridge university press.
• Baldi and Hornik (1989)

Baldi, P. and Hornik, K. (1989) Neural networks and principal component analysis: Learning from examples without local minima,

Neural networks, 2, 53–58.
• Barron (1993)

Barron, A. R. (1993) Universal approximation bounds for superpositions of a sigmoidal function,

IEEE Transactions on Information theory, 39, 930–945.
• Barron (1994) Barron, A. R. (1994) Approximation and estimation bounds for artificial neural networks, Machine learning, 14, 115–133.
• Bartlett (1998) Bartlett, P. L. (1998) The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network, IEEE transactions on Information Theory, 44, 525–536.
• Bengio and Delalleau (2011) Bengio, Y. and Delalleau, O. (2011) On the expressive power of deep architectures, in Algorithmic Learning Theory, Springer, pp. 18–36.
• Berlinet and Thomas-Agnan (2011) Berlinet, A. and Thomas-Agnan, C. (2011) Reproducing kernel Hilbert spaces in probability and statistics, Springer Science & Business Media.
• Collobert and Weston (2008) Collobert, R. and Weston, J. (2008) A unified architecture for natural language processing: Deep neural networks with multitask learning, in Proceedings of the 25th international conference on Machine learning, ACM, pp. 160–167.
• Cover and Thomas (2012) Cover, T. M. and Thomas, J. A. (2012) Elements of information theory, John Wiley & Sons.
• Cybenko (1989) Cybenko, G. (1989) Approximation by superpositions of a sigmoidal function, Mathematics of control, signals and systems, 2, 303–314.
• Dauphin et al. (2014) Dauphin, Y. N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S. and Bengio, Y. (2014) Identifying and attacking the saddle point problem in high-dimensional non-convex optimization, in Advances in neural information processing systems, pp. 2933–2941.
• Dudley (1974) Dudley, R. M. (1974) Metric entropy of some classes of sets with differentiable boundaries, Journal of Approximation Theory, 10, 227–236.
• Fakoor et al. (2013)

Fakoor, R., Ladhak, F., Nazi, A. and Huber, M. (2013) Using deep learning to enhance cancer diagnosis and classification, in

Proceedings of the International Conference on Machine Learning.
• Fukumizu and Amari (2000)

Fukumizu, K. and Amari, S.-i. (2000) Local minima and plateaus in hierarchical structures of multilayer perceptrons,

Neural networks, 13, 317–327.
• Giné and Nickl (2015) Giné, E. and Nickl, R. (2015) Mathematical foundations of infinite-dimensional statistical models, vol. 40, Cambridge University Press.
• He et al. (2016) He, K., Zhang, X., Ren, S. and Sun, J. (2016) Deep residual learning for image recognition, in

Proceedings of the IEEE conference on computer vision and pattern recognition

, pp. 770–778.

Hernández-Lobato, J. M. and Adams, R. (2015) Probabilistic backpropagation for scalable learning of bayesian neural networks, in

International Conference on Machine Learning, pp. 1861–1869.
• Hinton et al. (2006) Hinton, G. E., Osindero, S. and Teh, Y.-W. (2006) A fast learning algorithm for deep belief nets, Neural computation, 18, 1527–1554.
• Imaizumi et al. (2018) Imaizumi, M., Maehara, T. and Yoshida, Y. (2018) Statistically efficient estimation for non-smooth probability densities, in Artificial Intelligence and Statistics.
• Kawaguchi (2016) Kawaguchi, K. (2016) Deep learning without poor local minima, in Advances in Neural Information Processing Systems, pp. 586–594.
• Kingma and Ba (2014) Kingma, D. P. and Ba, J. (2014) Adam: A method for stochastic optimization., CoRR, abs/1412.6980.
• Koltchinskii (2006) Koltchinskii, V. (2006) Local rademacher complexities and oracle inequalities in risk minimization, The Annals of Statistics, 34, 2593–2656.
• Le et al. (2011) Le, Q. V., Ngiam, J., Coates, A., Lahiri, A., Prochnow, B. and Ng, A. Y. (2011) On optimization methods for deep learning, in Proceedings of the 28th International Conference on International Conference on Machine Learning, Omnipress, pp. 265–272.
• LeCun et al. (2015) LeCun, Y., Bengio, Y. and Hinton, G. (2015) Deep learning, Nature, 521, 436–444.
• Mammen and Tsybakov (1995) Mammen, E. and Tsybakov, A. (1995) Asymptotical minimax recovery of sets with smooth boundaries, The Annals of Statistics, 23, 502–524.
• Mammen et al. (1999) Mammen, E., Tsybakov, A. B. et al. (1999) Smooth discrimination analysis, The Annals of Statistics, 27, 1808–1829.
• Montufar et al. (2014) Montufar, G. F., Pascanu, R., Cho, K. and Bengio, Y. (2014) On the number of linear regions of deep neural networks, in Advances in neural information processing systems, pp. 2924–2932.
• Neyshabur et al. (2015) Neyshabur, B., Tomioka, R. and Srebro, N. (2015) Norm-based capacity control in neural networks, in Conference on Learning Theory, pp. 1376–1401.
• Petersen and Voigtlaender (2017) Petersen, P. and Voigtlaender, F. (2017) Optimal approximation of piecewise smooth functions using deep relu neural networks, arXiv preprint arXiv:1709.05289.
• Raskutti et al. (2012) Raskutti, G., Wainwright, M. J. and Yu, B. (2012) Minimax-optimal rates for sparse additive models over kernel classes via convex programming, Journal of Machine Learning Research, 13, 389–427.
• Rasmussen and Williams (2006) Rasmussen, C. E. and Williams, C. K. (2006) Gaussian processes for machine learning, vol. 1, MIT press Cambridge.
• Schmidhuber (2015) Schmidhuber, J. (2015) Deep learning in neural networks: An overview, Neural networks, 61, 85–117.
• Schmidt-Hieber (2017) Schmidt-Hieber, J. (2017) Nonparametric regression using deep neural networks with relu activation function, arXiv preprint arXiv:1708.06633.
• Soudry and Carmon (2016) Soudry, D. and Carmon, Y. (2016) No bad local minima: Data independent training error guarantees for multilayer neural networks, arXiv preprint arXiv:1605.08361.
• Steinwart and Christmann (2008) Steinwart, I. and Christmann, A. (2008) Support vector machines, Springer Science & Business Media.
• Stone (1982) Stone, C. (1982) Optimal global rates of convergence for nonparametric regression, The Annals of Statistics, 10, 1040–1053.
• Suzuki (2018) Suzuki, T. (2018) Fast generalization error bound of deep learning from a kernel perspective, in Artificial Intelligence and Statistics.
• Tsybakov (2009) Tsybakov, A. B. (2009) Introduction to nonparametric estimation.
• van der Vaart and van Zanten (2011) van der Vaart, A. and van Zanten, H. (2011) Information rates of nonparametric gaussian process methods, Journal of Machine Learning Research, 12, 2095–2119.
• van der Vaart and van Zanten (2008) van der Vaart, A. and van Zanten, J. (2008) Rates of contraction of posterior distributions based on gaussian process priors, The Annals of Statistics, 36, 1435–1463.
• van der Vaart and Wellner (1996) van der Vaart, A. and Wellner, J. (1996) Weak Convergence and Empirical Processes: With Applications to Statistics, Springer Science & Business Media.
• van Eeden (1985) van Eeden, C. (1985) Mean integrated squared error of kernel estimators when the density and its derivative are not necessarily continuous, Annals of the Institute of Statistical Mathematics, 37, 461–472.
• Wasserman (2006) Wasserman, L. A. (2006) All of nonparametric statistics: with 52 illustrations, Springer.
• Wolpert et al. (2011) Wolpert, R. L., Clyde, M. A. and Tu, C. (2011) Stochastic expansions using continuous dictionaries: Lévy adaptive regression kernels, The Annals of Statistics, 39, 1916–1962.
• Wu and Chu (1993a) Wu, J. and Chu, C. (1993a) Kernel-type estimators of jump points and values of a regression function, The Annals of Statistics, 21, 1545–1566.
• Wu and Chu (1993b) Wu, J. and Chu, C. (1993b) Nonparametric function estimation and bandwidth selection for discontinuous regression functions, Statistica Sinica, pp. 557–576.
• Yang and Barron (1999) Yang, Y. and Barron, A. (1999) Information-theoretic determination of minimax rates of convergence, The Annals of Statistics, 27, 1564–1599.
• Yarotsky (2017) Yarotsky, D. (2017) Error bounds for approximations with deep relu networks, Neural Networks, 94, 103–114.
• Zhang et al. (2017) Zhang, C., Bengio, S., Hardt, M., Recht, B. and Vinyals, O. (2017) Understanding deep learning requires rethinking generalization, in ICLR.

Appendix A Proof of Theorem 1

We provide additional notation. denotes the Lebesgue measure. For a function , is a supremum norm. is a -norm with the Lebesgue measure.

With the set of observations, let and be an empirical norm such as

 ∥f∥n=n−1n∑i=1f(Xi).

Also, we define the empirical norm of random variables such as

 ∥Y∥n:=⎛⎝n−1∑i∈[n]Y2i⎞⎠1/2% ~{}and~{}~{}∥ξ∥n:=⎛⎝n−1∑i∈[n]ξ2i⎞⎠1/2.

With a set and a radius , we introduce a covering number as

 N(ϵ,F,∥⋅∥):=inf{N∣{fj}j∈[N],∥f−fj∥≤ϵ,∀f∈F},

with a norm .

By the definition of the least square estimator (2), we obtain the following basic inequality as

 ∥Y−ˆfL∥2n≤∥Y−f∥2n,

for all . Since we have , we obtain

 ∥f∗+ξ−ˆfL∥2n≤∥f∗+ξ−f∥2n.

By the simple calculation, it yields

 ∥f∗−ˆfL∥2n≤∥f∗−f∥2n+2nn∑i=1ξi(ˆfL(Xi)−f(Xi)). (6)

In the following, we will fix and evaluate each of the terms each of the terms of the RHS of (6) At the first subsection, we provide a result for approximating by DNNs. At the second subsection, we evaluate a variance of . At the last subsection, we combine the results and derive an overall convergence rate.

a.1. Approximate piecewise functions by DNNs

A purpose of this part is to bound the following value

 ∥f−f∗∥L2(PX),

with a properly selected . To this end, we consider an existing with properly selected and . Our proof of this part is obtained by extending a technique by Yarotsky (2017) and Petersen and Voigtlaender (2017).

Fix such that with and for . To approximate , we consider neural networks and for , and their number of layers and non-zero parameters will be specified later. We also consider a network which approximates a multiplication and a summation, i.e. .

We evaluate a distance between and a combined neural network as

 ∥f∗−Gη[Θ3](Gη[Θf,1](⋅),...,Gη[Θf,M](⋅),Gη[Θr,1](⋅),...,Gη[Θr,M](⋅))∥L2 =∥∥ ∥∥∑m∈[M]f∗m1R∗m−Gη[Θ3](Gη[Θf,1](⋅),...,Gη[Θf,M](⋅),Gη[Θr,1](⋅),...,Gη[Θr,M](⋅))∥∥ ∥∥L2 ≤∥∥ ∥∥∑m∈[M]f∗m⊗1R∗m−∑m∈[M]Gη[Θf,M]⊗Gη[Θr,M]∥∥ ∥∥L2 +∥∥∥∑m∈[M]Gη[Θf,m]⊗Gη[Θr,m] −Gη[Θ3](Gη[Θf,1](⋅),...,Gη[Θf,M](⋅),Gη[Θr,1](⋅),...,Gη[Θr,M](⋅))∥∥∥L2 ≤∑m∈[M]∥∥f∗m⊗1R∗m−Gη[Θf,m]⊗Gη[Θr,m]∥∥L2 +∥∥∥∑m∈[M]Gη[Θf,m]⊗Gη[Θr,m] −Gη[Θ3](Gη[Θf,1](⋅),...,Gη[Θf,M](⋅),Gη[Θr,1](⋅),...,Gη[Θr,M](⋅))∥∥∥L2 ≤∑m∈[M]∥∥(f∗m−Gη[Θf,m])⊗Gη[Θr,m]∥∥L2+∑m∈[M]∥∥f∗m⊗(1R∗m−Gη[Θr,m])∥∥L2 +∥∥∥∑m∈[M]Gη[Θf,m]⊗Gη[Θr,m] −Gη[Θ3](Gη[Θ