We aim to provide a theoretical justification for the enormous success of deep neural networks (DNNs) in real world applications (He et al., 2016; Collobert et al., 2011; Goodfellow et al., 2016). In particular, our paper focuses on the generalization performance of a general class of DNNs. The generalization bound is a powerful tool to characterize the predictive performance of a class of learning models for unseen data. Early studies investigate the generalization ability of shallow neural networks with no more than one hidden layer (Bartlett, 1998; Anthony and Bartlett, 2009). More recently, studies on the generalization bounds of deep neural networks have received increasing attention (Dinh et al., 2017; Bartlett et al., 2017; Golowich et al., 2017; Neyshabur et al., 2015, 2017). There are two major questions of our interest in these analysis of the generalization bounds:
Can we establish tighter generalization error bounds for deep neural networks in terms of the network dimensions and structure of the weight matrices?
Can we develop generalization bounds for neural networks with special architectures?
For (Q1), Neyshabur et al. (2015); Bartlett et al. (2017); Neyshabur et al. (2017); Golowich et al. (2017) have established results that characterize the generalization bounds in terms of the depth and width of networks and norms of rank- weight matrices. For example, Neyshabur et al. (2015) provide an exponential bound on based on the Frobenius norm , where is the weight matrix of -th layer; Bartlett et al. (2017); Neyshabur et al. (2017) provide a polynomial bound on and based on (spectral norm) and (sum of the Euclidean norms for all rows of ). Golowich et al. (2017) provide a nearly size independent bound based on . Nevertheless, the generalization bound depends on other than the spectral norms of the weight matrices may be too loose. In specific, () is in general () times larger than . Given training data points and suppose for ease of discussion, Bartlett et al. (2017) and Neyshabur et al. (2017) demonstrate generalization error bounds as , and Golowich et al. (2017) achieve a bound , where represents the rate by ignoring logarithmic factors. In comparison, we show a tighter margin based bageneralization error bound as , which is significantly smaller than existing results and achieved based on a new Lipschitz analysis for DNNs in terms of both the input and weight matrices. Moreover, we establish that the margin is proportional to the product of norms for ReLU activation due to its homogeneity, thus the product of (spectral) norms does not lead to vacuous bound in margin based results.
We notice that some recent results characterize the generalization bound in more structured ways, e.g., by considering specific error-resilience parameters (Arora et al., 2018), which can achieve empirically improved generalization bounds than existing ones based on the norms of weight matrices. However, it is not clear how the weight matrices explicitly control these parameters, which makes the results less interpretable. We summarize the result of norm based generalization bounds and our results in Table 1, as well as the results when for more explicit comparison in terms of the network sizes (i.e, depth and width). Fur numerical comparison is provided in Section 5, including the result with bounded output.
For (Q2), we consider several widely used architectures to demonstrate, including convolutional neural networks (CNNs) (Krizhevsky et al., 2012), residual networks (ResNets) (He et al., 2016), and hyperspherical networks (SphereNets) (Liu et al., 2017b). By taking their structures of weight matrices into consideration, we provide tight characterization of their resulting capacities. In particular, we consider orthogonal filters and normalized weight matrices, which show good performance in both optimization and generalization (Mishkin and Matas, 2015; Xie et al., 2017)
. This is closely related with normalization frameworks, e.g., batch normalization(Ioffe and Szegedy, 2015) and layer normalization (Ba et al., 2016), which have achieved great empirical performance (Liu et al., 2017a; He et al., 2016). Take CNNs as an example. By incorporating the orthogonal structure of convolutional filters, we achieve
and Golowich et al. (2017) achieve
( in CNNs), where is the filter size that satisfies and
is stride size that is usually of the same order with; see Section 4.1 for details. Here we achieve stronger results in terms of both depth and width for CNNs, where our bound only depend on rather than . Some recent result achieved results that is free of the linear dependence on the weight matrix norms by considering networks with bounded outputs (Zhou and Feng, 2018)
. We can achieve similar results using bounded loss functions as discussed in Section3.2
, but do not restrict ourselves to this scenario in general. Analogous improvement is also attained for ResNets and SphereNets. In addition, we consider some widely used operations for width expansion and reduction, e.g., padding and pooling, and show that they do not increase the generalization bound.
|Generalization Bound||Original Results|
|Neyshabur et al. (2015)|
|Bartlett et al. (2017)|
|Neyshabur et al. (2017)|
|Golowich et al. (2017)|
Our tighter bounds result in potentially stronger expressive power, hence higher training/testing accuracy for the DNNs. In particular, when achieving the same order of generalization errors, we allow the choice of a larger parameter space with deeper/wider networks and larger matrix spectral norms. We further show numerically that a larger parameter space can lead to better empirical performance. Quantitative analysis for the expressive power of DNNs is of great interest on its own, which includes (but not limited to) studying how well DNNs can approximate general class of functions and distributions (Cybenko, 1989; Hornik et al., 1989; Funahashi, 1989; Barron, 1993, 1994; Lee et al., 2017; Petersen and Voigtlaender, 2017; Hanin and Sellke, 2017), and quantifying the computation hardness of learning neural networks; see e.g., Shamir (2016); Eldan and Shamir (2016); Song et al. (2017). We defer our investigation toward this to future efforts.
Notation. Given an integer , we define . We use the standard notations , , and to denote limiting behaviors ignoring constants, and , and to further ignore logarithmic factors.
We provide a brief description of the DNNs. Given an input , the output of a -layer network is defined as , where
with an entry-wise activation function. We specify
as the rectified linear unit (ReLU) activation(Nair and Hinton, 2010) for ease of discussion. The extension to more general activations, e.g., Lipschitz continuous functions, is straightforward. Then we denote DNNs with bounded weight matrices and ranks as
where is an input, and are real positive constants. We will specify the norm and the corresponding upper bounds , e.g., and , or and , when necessary.
Given a loss function , we denote a class of loss functions measuring the discrepancy between a DNN’s output and the corresponding observation for a given input as
where the sets of bounded inputs and the corresponding observations are
Then the empirical Rademacher complexity (ERC) of given and is
is the set of vectors only containing entriesand , and is a vector with Rademacher entries, i.e., or
with equal probabilities.
Take the classification as an example. For multi-class classification, suppose is the number of classes. Consider with bounded outputs, namely the ramp risk. Specifically, for an input belonging to class , we denote . For a given real value for the margin , the class of ramp risk functions with the margin is
where is -Lipschitz continuous, defined as
For convenience, we denote as (or ) in the rest of the paper.
Then the generalization error bound (Bartlett et al., 2017) states the following. Given any real and , with probability at least , we have that for any , the generalization error for classification is upper bounded with respect to (w.r.t.) the ERC satisfies
The right hand side (R.H.S.) of (4) is viewed as a guaranteed error bound for the gap between the testing and the empirical training performance. Since the ERC is generally the dominating term in (4), a small is desired for DNNs given the loss function . Analogous results hold for regression tasks; see e.g., Kearns and Vazirani (1994); Mohri et al. (2012) for details.
3 Generalization Error Bound for DNNs
We introduce some additional notations first. Given any two layers and input , we denote as the Jacobian from layer to layer , i.e., . For convenience, we denote when and denote when . Next, we denote as an upper bound of the norm of Jacobian for input over the parameter, i.e., .
3.1 A Tighter ERC Bound for DNNs
We first provide the ERC bound for the class of DNNs defined in (1) and Lipschitz loss functions in the following theorem. The proof is provided in Appendix A. Let be a -Lipschitz loss function and be the class of DNNs defined in (1), , for all , , and . Then we have
Note that depends on the norm of Jacobian, which is significantly smaller than the product of matrix norms that is exponential on
in general. For example, when we obtain the network from stochastic gradient descent using randomly initialized weights, then. Empirical distributions of and are provided in Appendix 5.4, where ’s are constants that are orders of magnitude smaller than . Further experiment in Appendix 5.3 shows that has a dependence slower than some low degree poly(depth), rather than exponential on the depth as in . Thus, can be considered as a constant almost independent of in practice. Even in the worst case that (this almost never happens in practice), our bound is still tighter than existing spectral norm based bounds (Bartlett et al., 2017; Neyshabur et al., 2017) by an order of . Also note that is a quantity (including ) only depending on the training dataset, which is due to the fact that the ERC only depends on the training dataset.
For convenience, we treat as a constant here. We achieve
in Theorem 3.1, which is tighter than existing results based on the network sizes and norms of weight matrices, as shown in Table 1. In particular, Neyshabur et al. (2015) show an exponential dependence on , i.e.,
. This has a tighter dependence on network sizes. Nevertheless, is generally times larger than , which results in an exponential dependence
compared with the bound based on the spectral norm. Moreover, is linear on except that the stable ranks across all layers are close to 1 (rather than almost independent on as in Theorem 3.1 as shown in Appendix 5.4). In addition, it has
dependence rather than
except when . Note that our bound is based on a novel characterization of Lipschitz properties of DNNs, which may be of independent interest from the learning theory point of view. We refer to Appendix A for details.
We also remark that when achieving the same order of generalization errors, we allow the choices of larger dimensions () and spectral norms of weight matrices, which lead to stronger expressive power for DNNs. For example, when achieving the same bound with in spectral norm based results (e.g. in ours) and in Frobenius norm based results (e.g., in Golowich et al. (2017)), they only have in Frobenius norm based results. The later results in a much smaller space for eligible weight matrices as is of order in general (i.e., for some constant ), which may lead to weaker expressive ability of DNNs. We also demonstrate numerically in Section 5.1 that when norms of weight matrices are constrained to be very small, both training and testing performance degrade significantly. A quantitative analysis for the tradeoff between the expressive ability and the generalization for DNNs is deferred to a future effort.
3.2 A Spectral Norm Free ERC Bound
When, in addition, the loss function is bounded, we have the ERC bound free of the linear dependence on the spectral norm, as in the following corollary. The proof is provided in Appendix B. In addition to the conditions in Theorem 3.1, suppose we further let be bounded, i.e., . Then the ERC satisfies
where . The boundedness of holds for certain loss functions, e.g., the ramp risk defined in (3). When is constant (e.g., for the ramp risk) and , we have that the ERC reduces to . This is close to the VC dimension of DNNs, which can be significantly tighter than existing norm based bounds in general. Similar norm free results hold for the architectures discussed in Section 4 using argument for Corollary 3.2, which we skip due to space limit. Moreover, our bound (5) is also tighter than recent results that are free of linear dependence on (Zhou and Feng, 2018; Arora et al., 2018). Specifically, Zhou and Feng (2018) show that the generalization bound for CNNs is , which results in a bound larger than (5) by . Arora et al. (2018) derive a bound for a compressed network in terms of some error-resilience parameters, which is since the cushion parameter therein is of the order . Further numerical comparison is provided in Section 5.2.
3.3 The Impact of Margin v.s. Product of Norms
For norm based generalization bounds of DNNs, one may have the concern that the product of norms can be too large to lead to a non-vacuous result, since it is exponential on the depth . However, one should also note that the generalization bound also depends on the margin value , which can be a arbitrary value. Here we give an example showing that for deep ReLU network, the product of norms in our generalization bound can scale with the margin value , and does not reflect the generalization error.
We consider two classes of DNNs and with ReLU activation, where
We then consider the following one-to-one correspondence relationship between and . For any function
there exits a corresponding such that
We then recall that the tightest generalization bound of w.r.t. is essentially given by
where is the optimal margin value, i.e.,
Since the ReLU activation is homogeneous, we have
Then we can rewrite (6) as:
Then we take , and obtain
which is exactly the tightest generalization bound of .
4 Exploring Network Structures
The generic result in Section 3 does not highlight explicitly the potential impacts for specific structures of the networks. In this section, we consider several popular architectures of DNNs, including convolutional neural networks (CNNs) (Krizhevsky et al., 2012), residual networks (ResNets) (He et al., 2016), and hyperspherical networks (SphereNets) (Liu et al., 2017b), and provide sharp characterization of the corresponding generalization bounds. In particular, we consider orthogonal filters and normalized weight matrices, which have shown good performance in both optimization and generalization (Mishkin and Matas, 2015; Huang et al., 2017). Such constraints can be enforced using regularizations on filters and weight matrices, which is very efficient to implement in practice. This is also closely related with normalization approaches, e.g., batch normalization (Ioffe and Szegedy, 2015) and layer normalization (Ba et al., 2016), which have achieved tremendous empirical success.
4.1 CNNs with Orthogonal Filters
CNNs are one of the most powerful architectures in deep learning, especially in tasks related to images and videos(Goodfellow et al., 2016). We consider a tight characterization of the generalization bound for CNNs by generating the weight matrices using unit norm orthogonal filters, which has shown great empirical performance (Huang et al., 2017; Xie et al., 2017). Specifically, we generate the weight matrices using a circulant approach, as follows. For the convolutional operation at the -th layer, we have channels of convolution filters, each of which is generated from a -dimensional feature using a stride side . Suppose that divides both and , i.e., and are integers, then we have . This is equivalent to fixing the weight matrix at the -th layer to be generated as in (7), where for all , each is formed in a circulant-like way using a vector with unit norms for all as
When the stride size , corresponds to a standard circulant matrix (Davis, 2012). The following lemma establishes that when are orthogonal vectors with unit Euclidean norms, the generalization bound only depend on and that are independent of the width . The proof is provided in Appendix C.
Let be a -Lipschitz and bounded loss function, i.e., , and be the class of CNNs defined in (1). Suppose the weight matrices in CNNs are formed as in (7) and (8) with , , and divides both and for all , where satisfies for all and with for all . Denote . Then the ERC satisfies
Since in our setting, the ERC for CNNs is proportional to instead of . For the orthogonal filtered considered in Corollary 4.1, we have and , which lead to the bounds of CNNs in existing results in Table 2. In practice, one usually has , which exhibit a significant improvement over existing results, i.e., . Even without the orthogonal constraint on filters, the rank in CNNs is usually of the same order with width , which also makes the existing bound undesirable. On the other hand, it is widely used in practice that for some small constant in CNNs, then we have resulted from .
|Neyshabur et al. (2015)|
|Bartlett et al. (2017)|
|Neyshabur et al. (2017)|
|Golowich et al. (2017)|
We consider vector input in Corollary 4.1. For matrix inputs, e.g., images, similar results hold by considering vectorized input and permuting columns of . Specifically, suppose and are integers for ease of discussion. Consider the input as a dimensional vector obtained by vectorizing a input matrix. When the 2-dimensional (matrix) convolutional filters are of size , we form the rows of each by concatenating vectors padded with 0’s, each of which is a concatenation of one row of the filter of size with some zeros as follows:
Correspondingly, the stride size is on average and we have if for all ; see Appendix E for details. This is equivalent to permuting the columns of generated as in (8) by vectorizing the matrix filters in order to validate the convolution of the filters with all patches of the matrix input.
A more practical scenario for CNNs is when a network has a few fully connected layers after the convolutional layers. Suppose we have convolutional layers and fully connected layers. From the analysis in Corollary 4.1, when for convolutional layers and for fully connected layers, we have that the overall ERC satisfies
4.2 ResNets with Structured Weight Matrices
Residual networks (ResNets) (He et al., 2016) is one of the most powerful architectures that allows training of tremendously deep networks. Then we denote the class of ResNets with bounded weight matrices , as
Given an input , the output of a -layer ResNet is defined as , where . For any two layers and input , we denote as the Jacobian from layer to layer , i.e., , and as an upper bound of the norm of Jacobian for input over the parameter, i.e., . We then provide an upper bound of the ERC for ResNets in the following corollary. The proof is provided in Appendix D. Let be a -Lipschitz and bounded loss function, i.e., , and be the ResNets defined in (9) with and for all , , and