1 Introduction
Deep learning has been showing great success in several applications such as computer vision, natural language processing, and many other area related to pattern recognition. Several high-performance methods have been developed and it has been revealed that deep learning possesses great potential. Despite the development of practical methodologies, its theoretical understanding is not satisfactory. Wide rage of researchers including theoreticians and practitioners are expecting deeper understanding of deep learning.
Among theories of deep learning, a well developed topic is its expressive power. It has been theoretically shown that deep neural network has exponentially large expressive power against the number of layers. For example, Montufar et al. (2014) showed that the number of polyhedral regions created by deep neural network can exponentially grow as the number of layers increases. Bianchini and Scarselli (2014)
showed that the Betti numbers of the level set of a function created by deep neural network grows up exponentially against the number of layers. Other researches also concluded similar facts using different notions such as tensor rank and extrinsic curvature
(Cohen et al., 2016; Cohen and Shashua, 2016; Poole et al., 2016).Another important issue in neural network theories is its universal approximation capability. It is well known that 3-layer neural networks have the ability, and thus the deep neural network also does (Cybenko, 1989; Hornik, 1991; Sonoda and Murata, 2015). When we discuss the universal approximation capability, the target function that is approximated is arbitrary and the theory is highly nonparametric in its nature.
Once we knew the expressive power and universal approximation capability of deep neural network, the next theoretical question naturally arises in its generalization error. The generalization ability is typically analyzed by evaluating the Rademacher complexity. Bartlett (1998) studied 3-layer neural networks and characterized its Rademacher complexity using the norm of weights. Koltchinskii and Panchenko (2002) studied deep neural network and derived its Rademacher complexity under norm constraints. More recently, Neyshabur et al. (2015) analyzed the Rademacher complexity based on more generalized norm, and Sun et al. (2015) derived a generalization error bound with a large margine assumption. As a whole, the studies listed above derived convergence of the generalization error where is the sample size. One concern in this line of convergence analyses is that the convergence of the generalization error is only where
is the sample size. Although this is minimax optimal, it is expected that we could show faster convergence rate with some additional assumptions such as strong convexity of the loss function. Actually, in a regular parametric model, we have
convergence of the generalization error (Hartigan et al., 1998). Moreover, the generalization error bound has been mainly given in finite dimensional models. As we have observed, the deep neural network possesses exponential expressive power and universal approximation capability which are highly nonparametric characterizations. This means that the theories are developed separately in the two regimes; finite dimensional parametric model and infinite dimensional nonparametric model. Therefore, theories that connect these two regimes are expected to comprehensively understand statistical performance of deep learning.In this paper, we consider both of empirical risk minimization and Bayesian deep learning and analyze the generalization error using the terminology of kernel methods. Consequently, (i) we derive a faster learning rate than
and (ii) we connect the finite dimensional regime and the infinite dimensional regime based on the theories of kernel methods. The empirical risk minimization is a typical approach to learn the deep neural network model. It is usually performed by applying stochastic gradient descent with the back-propagation technique
(Widrow and Hoff, 1960; Amari, 1967; Rumelhart et al., ). To avoid over-fitting, such techniques as regularization and dropout have been employed (Srivastava et al., 2014). Although the practical techniques for the empirical risk minimization have been extensively studied, there is still much room for improvement in its generalization error analysis. Bayesian deep learning has been recently gathering more attentions mainly because it can deal with the estimation uncertainty in a natural way. Examples of Bayesian deep learning researches include probabilistic backpropagation
(Hernandez-Lobato and Adams, 2015), Bayesian dark knowledge (Balan et al., 2015), weight uncertainty by Bayesian backpropagation (Blundell et al., 2015), dropout as Bayesian approximation (Gal and Ghahramani, 2016). To analyze a sharper generalization error bound, we utilize the so-called local Rademacher complexity technique for the empirical risk minimization method (Mendelson, 2002; Bartlett et al., 2005; Koltchinskii, 2006; Giné and Koltchinskii, 2006), and, as for the Bayesian method, we employ the theoretical techniques developed to analyze nonparametric Bayes methods (Ghosal et al., 2000; van der Vaart and van Zanten, 2008, 2011). These analyses are quite advantageous to the typical Rademacher complexity analysis because we can obtain convergence rate between and which is faster than that of the standard Rademacher complexity analysis . As for the second contribution, we first introduce an integral form of deep neural network as performed in the research of the universal approximation capability of 3-layer neural networks (Sonoda and Murata, 2015). This allows us to have a nonparametric model of deep neural network as a natural extension of usual finite dimensional models. Afterward, we define a reproducing kernel Hilbert space (RKHS) corresponding to each layer like in Bach (2017, 2015). By doing so, we can borrow the terminology developed in the kernel method into the analysis of deep learning. In particular, we define the degree of freedom of the RKHS as a measure of complexity of the RKHS (Caponnetto and de Vito, 2007; Bach, 2015), and based on that, we evaluate how large a finite dimensional model should be to approximate the original infinite dimensional model with a specified precision. These theoretical developments reveal that there appears bias-variance trade-off. That is, there appears trade-off between the size of the finite dimensional model approximating the nonparametric model and the variance of the estimator. We will show that, by balancing the trade-off, a fast convergence rate is derived. In particularly, the optimal learning rate of the kernel method is reproduced from our deep learning analysis due to the fact that the kernel method can be seen as a 3-layer neural network with an infinite dimensional internal layer. A remarkable property of the derived generalization error bound is that the error is characterized by the complexities of the RKHSs defined by the degree of freedom. Moreover, the notion of the degree of freedom gives a practical implication about determination of the width of the internal layers.The obtained generalization error bound is summarized in Table 1111 indicates ..
Error bound | |
---|---|
General setting | |
under an assumption that . | |
Finite dimensional model | |
where is the true width of the -th internal layer. | |
Polynomial decay eigenvalue |
|
where is the decay rate of the eigenvalue of the kernel | |
function on the -th layer. |
2 Integral representation of deep neural network
Here we give our problem settings and the model that we consider in this paper. Suppose that input-output observations are independently identically generated from a regression model
where is an i.i.d. sequence of Gaussian noises with mean 0 and variance , and is generated independently identically from a distribution with a compact support in . The purpose of the deep learning problem we consider in this paper is to estimate from the observations .
To analyze the generalization ability of deep learning, we specify a function class in which the true function is included, and, by doing so, we characterize the “complexity” of the true function in a correct way.
In order to give a better intuition, we first start from the simplest model, the 3-layer neural network. Let
be a nonlinear activation function such as ReLU
(Nair and Hinton, 2010; Glorot et al., 2011); for a-dimensional vector
. The 3-layer neural network model is represented bywhere we denote by the number of nodes in the internal layer, and , , and . It is known that this model is universal approximator and it is important to consider its integral form
(1) |
where is a hidden parameter, is a function version of the weight matrix , and is the bias term. This integral form appears in many places to analyze the capacity of the neural network. In particular, through the ridgelet analysis, it is shown that there exists the integral form corresponding to any
which has an integrable Fourier transform for an appropriately chosen activation function
such as ReLU (Sonoda and Murata, 2015).Motivated by the integral form of the 3-layer neural network, we consider a more general representation for deeper neural network. To do so, we define a feature space on the
-th layer. The feature space is a a probability space
where is a Polish space, is its Borel algebra, and is a probability measure on . This is introduced to represent a general (possibly) continuous set of features as well as a discrete set of features. For example, if the -th internal layer is endowed with a -dimensional finite feature space, then . On the other hand, the integral form (1) corresponds to a continuous feature space in the second layer. Now the input is a -dimensional real vector, and thus we may set . Since the output is one dimensional, the output layer is just a singleton . Based on these feature spaces, our integral form of the deep neural network is constructed by stacking the map on the -th layer given as(2a) | |||
where corresponds to the weight of the feature for the output and and for all 222 Note that, for , is also square integrable with respect to if is Lipschitz continuous because .. Specifically, the first and the last layers are represented as | |||
(2b) | |||
(2c) |
where we wrote to indicate for simplicity because . Then the true function is given as
(3) |
Since, the shallow 3-layer neural network is a universal approximator, and so is our generalized deep neural network model (3). It is known that deep neural network tends to give more efficient representation of a function than the shallow network. Actually, Eldan and Shamir (2016) gave an example of a function that the 3-layer neural network cannot approximate under a precision unless its with is exponential in the input dimension but the 4-layer neural network can approximate with polynomial order widths (see Safran and Shamir (2016) for other examples). In other words, each layer of a deep neural network can be much “simpler” than one of a shallow network (more rigorous definition of complexity of each layer will be given in the next section). Therefore, it is quite important to consider the integral representation of a deep neural network rather than a 3-layer network.
The integral representation is natural also from the practical point of view. Indeed, it is well known that the deep neural network learns a simple pattern in the early layers and it gradually extracts more complicated features as the layer is going up. The trained feature is usually continuous one. For example, in computer vision tasks, the second layer typically extracts gradients toward several degree angles (Krizhevsky et al., 2012). The angle is a continuous variable and thus the feature space should be continuous to cover all angles. On the other hand, the real network discretize the feature space because of limitation of computational resources. Our theory introduced in the next section offers a measure to evaluate this discretization error.
3 Finite approximation of the integral form
The integral form is a convenient way to describe the true function. However, it is not useful to estimate the function. When we estimate that, we need to discretize the integrals by finite sums due to limitation of computational resources as we do in practice. In other word, we consider the usual finite sum deep learning model as an approximation of the integral form. However, the discrete approximation induces approximation error. Here we give an upper bound of the approximation error. Naturally, there arises the notion of bias and variance trade-off, that is, as the complexity of the finite model increases the “bias” (approximation error) decreases but the “variance” for finding the best parameter in the model increases. Afterwards, we will bound the variance for estimating the finite approximation in Section 4.3. Combining these two notions, it is possible to quantify the bias-variance trade-off and find the best strategy to minimize the entire generalization error.
The approximation error analysis of the deep neural network can be well executed by utilizing notions of the kernel method. Here we construct RKHS for each layer in a way analogous to Bach (2015, 2017) who studied shallow learning and the kernel quadrature rule. Let the output of the -th layer be We define a reproducing kernel Hilbert space (RKHS) corresponding to the -th layer () by introducing its associated kernel function . We define the positive definite kernel as
It is easy to check that is actually symmetric and positive definite. It is known that there exists a unique RKHS corresponding the kernel (Aronszajn, 1950). Close investigation of the RKHS for several examples for shallow network has been given in (Bach, 2017).
Under this setting, all arguments at the -th layer can be carried out through the theories of kernel methods. Importantly, for , there exists such that
Moreover, the norms of and are connected as
(4) |
(Bach, 2015, 2017). Therefore, the function
representing the magnitude of a feature for the input is included in the RKHS and its RKHS norm is equivalent to that of the internal layer weight because of Eq. (4).
To derive the approximation error, we need to evaluate the “complexity” of the RKHS. Basically, the complexity of the -th layer RKHS is controlled by the behavior of the eigenvalues of the kernel. To formally state this notion, we introduce the integral operator associated with the kernel defined as
If the kernel function admits an orthogonal decomposition
in where is the sequence of the eigenvalues ordered in decreasing order, and forms an orthonormal system in , then for , the integral operation is expressed as (see Steinwart and Christmann (2008); Steinwart and Scovel (2012) for more details). Therefore each eigenvalue plays a role like a “filter” for each component . Here it is known that for all , there exists such that and (Caponnetto and de Vito, 2007; Steinwart et al., 2009). Combining this with Eq. (4), we have
Based on the integral operator , we define the degree of freedom of the RKHS as
(5) |
for . The degree of freedom can be represented as by using the eigenvalues of the kernel.
Now, we assume that the true function satisfies a norm condition as follows.
Assumption 1
For each , and satisfy that
By Eq. (4), the first assumption is interpreted as and . This means that the feature map in each internal layer is well regulated by the RKHS norm.
Moreover, we also assume that the activation function is scale invariant.
Assumption 2
We assume the following conditions on the activation function .
-
is scale invariant: for all and (for arbitrary ).
-
is 1-Lipschitz continuous: for all .
The first assumption on the scale invariance is essential to derive tight error bounds. The second one ensures that deviation in each layer does not affect the output so much. The most important example of an activation function that satisfies these conditions is ReLU activation. Another one is the identity map .
Finally we assume that the input distribution has a compact support.
Assumption 3
The support of is compact and it is bounded as
We consider a finite dimensional approximation given as follows: let be the number of nodes in the -th internal layer (we set the dimensions of the output and input layers to and ) and consider a model
where and .
Theorem 1 (Finite approximation error bound of the nonparametric model)
For any and , suppose that
then there exist and such that, by letting ,
(6a) | |||
(6b) |
and
(7) | |||
(8) |
The proof is given in Appendix A. This theorem is proven by borrowing the theoretical technique recently developed for the kernel quadrature rule (Bach, 2015). We also employed some techniques analogous to the analysis of the low rank tensor estimation (Suzuki, 2015; Kanagawa et al., 2016; Suzuki et al., 2016). Intuitively, the degree of freedom is the intrinsic dimension of the -th layer to achieve the approximation error. Indeed, we show in the proof that the -th layer is approximated by the dimensional nodes with the precision under the condition . The error bound (7) indicates that the total approximation error of the whole network is basically obtained by summing up the approximation error of each layer where the factor is a Lipschitz constant for error propagation.
We would like to emphasize that the approximation error bound (7) and the norm bounds (6) of and are independent of the dimensions of the internal layers. This is due to the scale invariance property of the activation function. This is quite beneficial to derive a tight generalization error bound. Indeed, without the scale invariance, we only have a much looser bound , and , which depend on the dimensions and could be huge for small . This would support the practical success of using the ReLU activation.
Let the norm bounds shown in Theorem 1 be
Remind that Theorem 1 gives an upper bound of the infinity norm of , that is, where
Let the set of finite dimensional functions with the norm constraint (6) be
Then, we can show that the infinity norm of is also uniformly bounded as the following lemma.
Lemma 1
For all , it holds that
4 Generalization error bounds
In this section, we define the two estimators in the finite dimensional model introduced in the last section: the empirical risk minimizer and the Bayes estimator. The generalization error bounds for both of these estimators are derived. We also give some examples in which the generalization error is analyzed in details.
4.1 Notations
Before we state the generalization error bounds, we prepare some notations. Let and define , as333We define .
Note that is the finite approximation error given in Theorem 1. Roughly speaking, corresponds to the amount of deviation of the estimators in the finite dimensional model.
4.2 Empirical risk minimization
In this section, we define the empirical risk minimizer and investigate its generalization error. Let the empirical risk minimizer be :
Note that there exists at least one minimizer because the parameter set corresponding to is a compact set and is a continuous function. needs not necessarily be the exact minimizer but it could be an approximated minimizer. We, however, assume is the exact minimizer for theoretical simplicity. In practice, the empirical risk minimizer is obtained by using the back-propagation technique. The regularization for the norm of the weight matrices and the bias terms are implemented by using the -regularization and the drop-out techniques.
The generalization error of the empirical risk minimizer is bounded as in the following theorem.
Theorem 2
For any and , suppose that
(9) |
Then, there exists universal constants such that, for any and ,
with probability for every and .
The proof is given in Appendix C. This theorem can be shown by evaluating the covering number of the model and applying the local Rademacher complexity technique (Mendelson, 2002; Bartlett et al., 2005; Koltchinskii, 2006; Giné and Koltchinskii, 2006).
It is easily checked that the third term of the right side () is smaller than the first two terms, therefore the generalization error bound can be simply evaluated as
Based on a rough evaluation
and the constraint , we can observe the bias-variance trade-off for the generalization error bound because, as decreases, the required width of the internal layer increases by the condition (9) and thus the deviation in the finite dimensional model should increase. In other words, if we want to construct a finite dimensional model which well approximates the true function, then a more complicated model is required and we should pay larger variance of the estimator. A key notion for the bias-variance trade-off is the degree of freedom which expresses the “complexity” of the RKHS in each layer. The degree of freedom of a complicated RKHS grows up faster than a simpler one as goes to 0. This is also informative in practice because, to determine the width of each layer, the degree of freedom gives a good guidance. That is, if the degree of freedom is small compared with the sample size, then we may increase the width of the layer. An estimate of the degree of freedom can be computed from the trained network by computing the Gram matrix corresponding to the kernel induced from the trained network (where the kernel is defined by the finite sum instead of the integral form) and using the eigenvalue of the Gram matrix.
To obtain the best generalization error bound, should be tuned to balance the bias-variance terms (and accordingly should also be fine-tuned). The examples of the best achievable generalization error will be shown in Section 4.4.
4.3 Bayes estimator
In this section, we formulate a Bayes estimator and derive its generalization error. To define the Bayes estimator, we just need to specify the prior distribution. Let be the ball in the Euclidean space with radius , and
be the uniform distribution on the ball
. Since Theorem 1 ensures the norms of and are bounded above by and , it is natural to employ a prior distribution that possesses its support on the set of parameters with norms not greater than those norm bounds. Based on this observation, we employ uniform distributions on balls with the radii indicated above as a prior distribution:In practice, the Gaussian distribution is also employed instead of the uniform distribution. However, the Gaussian distribution does not give good tail probability bound for the infinity norm of the deep neural network model. That is crucial to develop the generalization error bound. For this reason, we decided to analyze the uniform prior distribution.
The prior distribution on the parameters induces the distribution of the function in the space of continuous functions endowed with the Borel algebra corresponding to the -norm. We denote by the induced distribution. Using the prior, the posterior distribution is defined via the Bayes principle:
Since the purpose of this paper is to give a theoretical analysis for the generalization error, we do not pursue the computational issue of the Bayesian deep learning. See, for example, Hernandez-Lobato and Adams (2015); Blundell et al. (2015) for practical algorithms.
The following theorem gives how fast the Bayes posterior contracts around the true function.
Theorem 3
Fix arbitrary and , and suppose that the condition (9) on is satisfied. Then, for all , the posterior tail probability can be bounded as
The proof is given in Appendix B. The proof is accomplished by using the technique for non-parametric Bayes methods (Ghosal et al., 2000; van der Vaart and van Zanten, 2008, 2011). Roughly speaking this theorem indicates that the posterior distribution concentrates in the distance from the true function . The tail probability is sub-Gaussian and thus the posterior mass outside the distance from the true function rapidly decrease. Here we again observe that there appears bias-variance trade-off between and . This can be understood essentially in the same way as the empirical risk minimization.
From the posterior contraction rate, we can derive the generalization error bound of the posterior mean.
Corollary 1
Under the same setting as in Theorem 3, there exists a universal constant such that the generalization error of the posterior mean is bounded as
Therefore, for sufficiently large such that (which is the regime of our interest), the generalization error is simply bounded as
4.4 Examples
Here, we give some examples of the generalization error bound. We have seen that both of the empirical risk minimizer and the Bayes estimators have a simplified generalization error bound as
by supposing , and are in constant order. We evaluate the bound under the best choice of balancing the bias-variance trade-off.
One way to balance the terms is to set so that
where the -factor and
are dropped for simplicity. Since the inequality of arithmetic sum geometric mean gives
, we may set to satisfy(10) |
Considering this relation and the constraint (Eq. (9)), we can estimate the best width that minimizes the upper bound of the generalization error.
4.4.1 Finite dimensional internal layer
If all RKHSs are finite dimensional, say -dimensional. Then for all . Therefore, by setting , the generalization error bound is obtained as
(11) |
where we omitted the factors depending only on . Note that, although there appears the -norm bound , this convergence rate is independent of the Lipschitz constant and up to -order but is solely dependent on the number of parameters. Moreover, the convergence rate is in terms of the sample size . This is much faster than the existing bounds that utilize the Rademacher complexity because their bounds are . This result matches more precise arguments for a finite dimensional 3-layer neural network based on asymptotic expansions (Fukumizu, 1999; Watanabe, 2001) which also showed the generalization error of the 3-layer neural network can be evaluated as .
4.4.2 Polynomial decreasing rate of eigenvalues
We assume that the eigenvalue decays in polynomial order as
(12) |
for a positive real and . This is a standard assumption in the analysis of kernel methods (Caponnetto and de Vito, 2007; Steinwart and Christmann, 2008), and it is known that this assumption is equivalent to the usual covering number assumption (Steinwart et al., 2009). For small , the decay rate is fast and it is easy to approximate the kernel by another one corresponding to a finite dimensional subspace. Therefore small corresponds to a simple model and large corresponds to a complicated model. In this setting, the degree of freedom is evaluated as
(13) |
This can be shown as follows: for any positive integer , the degree of freedom can be bounded as
Letting to balance the first and the second term, we obatain the evaluation (13). Hence, we can show that, according to Eq. (10),
gives the optimal rate, and we obtain the generalization error bound as
(14) |
where we omitted the factors depending on , and . This indicates that the complexity of the RKHS affects the convergence rate directly. As expected, if the RKHSs are simple (that is,