Deep (neural) networks are being profitably applied in a large and growing number of areas, from signal processing to computer vision and artificial intelligence. The expressive power of these networks has been demonstrated both in theory and practicecybenko1989approximation ; barron1994approximation ; telgarsky2015representation ; yarotsky2017error ; hanin2017approximating ; daubechies2019nonlinear . In fact, it has been shown that deep networks can even perfectly fit pure noise zhang2016understanding . Surprisingly, highly overparameterized deep networks – where the number of network parameters exceeds the number of training data points – can be trained for a range of different classification and regression tasks and perform extremely well on unobserved data. Understanding why these networks generalize so well has been the subject of great interest in recent years. But, to date, classical approaches to bounding the generalization error have failed to provide much insight into deep networks.
The ability of overparameterized deep networks to overfit noise while generalizing well suggests the existence of some kind of (explicit or implicit) regularization in the learning process. In order to both understand and improve regularization in training deep networks, one key question to address is: What is the correct measure to evaluate the complexity of a deep network? As we discuss in detail below, both classical and recent measures have come up short on insights. In this paper, we take a different tack. Let
represent the mapping from the input to output of a deep network constructed using piecewise-affine activations (e.g., ReLU, leaky ReLU, absolute value). When activations are also convex,can be written as composition of Max Affine Spline Operators (MASOs) and using the framework provided in balestriero2018mad ; balestriero2018spline , we can write as the continuous, piecewise affine operator
Below, we propose a new complexity measure for based on the matrices.
The main intuition behind our measure can be described as follows. We aim to quantify how far the mapping is from a locally linear mapping on the data. Motivated by the concept of Hessian eigenmaps introduced by Donoho and Grimes donoho2003hessian for manifold learning, we propose the tangent Hessian norm integral as a new complexity measure for deep networks.
Two main features of our measure distinguish it from other proposed regularization penalties: 1) Distance from a linear mapping: Most regularization penalties proposed in the literature focus on the behavior of the mapping on regions where is linear. For example, Tikhonov regularization on the weights of the network bounds the Lipschitz constant of the mapping in individual regions. In contrast, our Hessian measure quantifies how much the mapping differs from an affine mapping over the entire input space. 2) Local geometrical structure of the input data: In most applications, for example when the input data consists of images, the training data points lie on a lower-dimensional manifold of dimension . We can exploit the local geometrical structure of the data to evaluate the mapping as a function of the manifold local coordinates.
Our main contributions can be summarized as follows:
[C1] A new complexity measure for deep networks. In Section 2, we propose and justify the tangent Hessian norm integral as a new complexity measure for deep networks.
In Section 4, we present two methods to compute the measure efficiently.
[C2] Understanding the role of deep network parameters on complexity. In Section 3, we study the growth in complexity of functions generated by the units (units) in each layer of a deep network. This provides an upper bound on the complexity of the network output in terms of network parameters.
[C3] Data augmentation as implicit Hessian regularization. In Section 5 we study data augmentation and show that using this technique while training a deep network decreases the Hessian complexity measure. Hence, we can consider this technique as an implicit regularization method with the tangent Hessian norm integral as the penalty.
More broadly, our Hessian complexity measure can open up new directions in understanding the role of optimization methods such as stochastic gradient descent in training deep networks as an implicit regularization. In addition, it can inspire new training frameworks for deep networks that are robust to a range of adversarial attacks. Proofs of all of our results appear in the Appendix.
Related work. There is a growing literature studying the generalization properties of overparameterized neural networks bartlett2002rademacher ; bartlett2017spectrally ; arora2018stronger ; dziugaite2017computing ; neyshabur2017exploring . In these papers, the authors obtained sharper bounds than naïve parameter counting by using the stability of the deep network around the weights achieved after training. Recently, it has been shown that by having sufficient overparameterization, the weights of a trained network will be close to its random initialization allen2018learning ; arora2019fine ; cao2019generalization . Using this fact, these papers have achieved better generalization bounds for deep networks. In hanin2019complexity , the authors observe that the expected number of linear regions in a trained deep network grows polynomially (and not exponentially) with the number of units. All of the above works suggest that, in order to understand generalization in deep nets, one must measure the complexity of the deep network mapping.
Recently, in savarese2019infinite , the authors show that for one-dimensional, one hidden layer, ReLU networks, an penalty on the weights (Tikhonov regularization) is equivalent to a penalty on the integral of the absolute value of the second derivative of the network output. Moreover, in belkin2019two ; belkin2018reconciling ; hastie2019surprises the authors show that in some linear and nonlinear inference problems, properly regularized overparameterized models can generalize well. This again indicates the importance of complexity regularization for understanding generalization in deep networks.
2 Formulating the Hessian Based Complexity Measure
Network with smooth activations. Let
be the prediction function of a deep network whose nonlinear activation functions are smooth functions. For regression, we can takeas the mapping from the input to the output of the network. For classification, we can take as the mapping from the input to one of the inputs of the final softmax operation. We assume that the training data lies close to a -dimensional smooth manifold . This assumption has been studied in an extensive literature on unsupervised learning, e.g. tenenbaum2000global ; belkin2003laplacian ; donoho2003hessian , and holds at least approximately for many practical datasets, including images.
For , inspired by the Hessian eigenmaps approach of Donoho and Grimes donoho2003hessian , we propose the following complexity measure ,
where is the Hessian of at in the coordinates of -dimensional affine space tangent to manifold at .
From donoho2003hessian we know that measures the average curviness of over the manifold and that is zero if and only if is an affine function on . In the simplistic case of one-dimensional data and one hidden layer networks, savarese2019infinite have related to the sum of the squared Frobenius norms of the weight matrices.
While the manifold assumption is highly recommended for exploiting the data’s geometrical structure in computing the complexity measure (2), it is not essential. We can take and by let be the entire input space to obtain
This measure is easier to compute, but it does not exploit the geometrical structure of the data and therefore might not be as revealing as (2).
Network with continuous, piecewise affine activations. Our focus in this paper is on a complexity measure for deep networks constructed using piecewise affine activations (e.g., ReLU, leaky ReLU, absolute value). In this case, is piecewise affine and thus not continuously differentiable. Therefore, the Hessian is not well defined in its usual form.
Note that a network with continuous, piecewise affine activations partitions the input space
based on the activation pattern of the network’s units (neurons). We call these partitions,, the vector quantization (VQ) regions of the network. Note that inside one VQ region, is simply an affine mapping. As a result, can be written as a continuous, piecewise affine operator as in (1). Note that in (1), , are in fact functions of the network activation pattern and therefore the VQ region containing . However, for the sake of brevity, we will use the simplified notation in (1).
We can now define our complexity measure for a network with continuous, piecewise affine activations. Let . For not on the boundaries of VQ partitions and an arbitrary unit vector, we define as
if is on the -dimensional affine space tangent to the data manifold at and , otherwise. Note that is a (weak) gradient of at , and therefore this definition agrees with the finite element definition of the Hessian. For smooth and recovers the Hessian milne2000calculus ; jordan1965calculus . Thus, for a network with continuous, piecewise affine activations, we define
where is uniform over the unit sphere. Comparing with (2), this definition is consistent with the definition of the distributional derivative for piecewise constant functions and can be seen as measuring the changes in the local slopes of the piecewise affine spline realized by the network.
It is worthwhile to compare the Hessian complexity measure with the notion of “number of VQ regions” hanin2019complexity in a ReLU network. We believe that the Hessian measure provides a more useful quantification of the network output complexity, because it explicitly takes into account the changes in the output function across the VQ regions. For instance, consider the analysis of infinitely wide networks, which have been used to help understand the convergence properties and performance of deep networks lee2017deep ; mei2019mean ; arora2019exact . The number of VQ regions can be infinite in such networks; however, the Hessian measure remains bounded as long as the network weight matrices have rows with bounded norm. We will discuss this in more detail in Section 3.
When a network has more than one hidden layer, it is not straightforward to obtain an explicit formula for or in terms of the network parameters (weight matrices and biases). However, it is possible to efficiently approximate the Hessian measure (see Section 4).
3 Hessian Complexity Growth Through a Deep Network
In this section, we study the outputs of different units (neurons) in a ReLU deep network as functions of the input in order to shed light on how the Hessian complexity increases through the network. The scalar ReLU activation function is applied elementwise to an input vector to create a thresholded output. A ReLU network intersperses affine transformations with ReLU thresholding. We call the weights and the biases of the network.
We focus on the case of a network with layers processes a scalar, one-dimensional input . For , let be the number of units in the -th layer of the network. In this case, the output of -th unit in the -th layer of the network, , is a continuous, piecewise affine function of the input that we denote by . Such a function can be written as
where is the number of linear pieces of and are the spline break points (or knots). We can compute
for as in (6) via
In order to understand the complexity of the output functions of different units of the network, i.e., for and , we must compute the complexity of a unit output function in terms of the complexities of its input functions. In a ReLU network, each unit performs two operations on its input functions: a linear combination of the input functions followed by ReLU thresholding. The following lemma studies the linear combination process and finds that is a seminorm on the space of one-dimensional, continuous, piecewise affine functions .
Let be piecewise affine functions for and let . If , then
The following theorem bounds the complexity of in terms of the complexity of and its Lipschitz constant.
Let be a piecewise affine function as in (6) with Lipschitz constant . Then
Theorem 3 implies that, if the functions generated in the network have bounded Lipschitz constant , which is the case when the weight matrices have bounded norm, then each ReLU nonlinearity unit adds at most to the complexity regardless of the layer in which the unit is located. This suggests that, in the case of one-dimensional networks with weight matrices of bounded norm, the maximum complexity of the output is a linear function of number of units in the network.
Define the complexity of the -th layer of the network as
We can bound the complexity of the layers for networks with bounded weight matrices.
Let with rows be the weight matrix of the -th layer. If the output functions of the units in the -th layer have Lipschitz constant , then
Using the fact that , we have the following.
If the weight matrices of all layers have rows with norm less than one, then the complexity of the network output function is at most , where is the number of layers.
4 Computing the Hessian Complexity Measure
In this section, we discuss two efficient methods to approximate in practical deep networks.
Finite Differences Method. First note that we can find the -dimensional subspace tangent to the data manifold at neighborhood around data point as the -dimensional principal subspace of defined by , where are the set of nearest neighbors of and
In this method, we estimateusing its Monte Carlo approximation based on the training data .
When the network has smooth activations, we have
We can also apply the Monte Carlo method to estimate . If is chosen uniformly at random on the unit sphere, then for we have
For smooth , we have
When the network has continuous, piecewise affine activations, Monte Carlo approximation of based on the training data yields
For such a network, using (4) yields
as our approximation.
In a ReLU network with layers, for a given , can be computed very efficiently via
where is the weight matrix of the -th layer and is a diagonal matrix with if the output of the -th ReLU unit in the -th layer is nonzero (aka “active”) with as input and if the ReLU ouput is zero. This enables us to use as a regularization penalty in training real networks.
Let be the manifold neighborhood of that contains and its nearest neighbors , . Taking as a partition of unity on with , we can write as
Let be the vector of samples of over
We use the approximation
where is the tangent Hessian operator on . is a positive semidefinite matrix whose null space contains the constant and linear functions on the tangent space at ; it can be constructed in the following way wang2012geometric . Let be top right singular vectors of and construct the matrix
where is the Hadamard product of the vectors . By performing Gram-Schmidt orthogonalization on the columns of , we obtain , which forms an orthonormal basis for quadratic functions on . Then
is the tangent Hessian operator on .
For , this approximation can be further simplified to
with and when or are not in .
Thanks to the simplicity of the quadratic form , we can study the smoothness properties of
by analyzing the spectrum and eigenspaces of the Hessian operator.
5 Data Augmentation Effects Implicit Hessian Complexity Regularization
Data augmentation wong2016understanding ; perez2017effectiveness is an oft-used, yet poorly understood, heuristic applied in learning the parameters of deep networks. The data augmentation procedure augments the set of training data points to training data points by applying transformations to the training data such that they continue to lie on the data manifold . Example transformations applied to images include translation, rotation, color changes, etc. In such cases, is the vector difference , where is the translated/rotated image. In this section, we analyze the effect of data augmentation on our Hessian complexity measure and furthermore show that it acts as an implicit regularizer during learning. Consider training a deep network with continuous, piecewise affine activations given the original training dataset by minimizing the loss
is any convex loss function. After data augmentation, the loss can be written as
The following result establishes the relationship between data augmentation and Hessian complexity.
Consider a deep network with continuous, piecewise affine activations and thus prediction function as in (1). Assume that , and has Lipschitz constant , i.e., , for all . Further, assume that the loss function has Lipschitz constant . Then, for and small enough , in (29) can be approximated by
From (30), we can note the close relationship between data augmentation and the Hessian complexity measure. Indeed, the second term on the right-hand side of the inequality is very similar in form to the Hessian complexity measure in (19). This, suggests that adding a Hessian complexity penalty term as a regularizer to the loss should decrease the resulting of the network. Moreover, in the experimental validation in Figure 1 and Table 1, we observe that the converse is also true; data augmentation also decreases the Hessian complexity.
|on training data | CIFAR10 | ResNet||on test data | CIFAR10 | ResNet|
|on training data | SVHN | ResNet||on test data | SVHN | ResNet|
|learning epochs||learning epochs|
|on training data | CIFAR100 | ResNet||on test data | CIFAR100 | ResNet|
|learning epochs||learning epochs|
|on training data | CIFAR100 | CNN||on test data | CIFAR100 | CNN|
|learning epochs||learning epochs|
We can make a few additional observations from the results in Table 1. 1) Impact of network architecture on complexity: We observe that the convolutional network (CNN) results in smaller than the residual network (ResNet). This sheds light on the advantages of a convolutional architecture for image classification. We also observe that the measured using the training data is almost zero for CNNs trained on all four datasets. This suggests that the prediction function is almost linear within of the training data. However, the complexity is significantly higher when measured using the test data. This is an interesting property of purely convolutional networks that begs investigation. 2) Impact of overparameterization on complexity: Surprisingly, the Large ResNet (with more parameters) results in smaller than the smaller ResNet. This might be a result of the implicit regularization that results from training overparameterized networks via stochastic gradient descent (SGD) arora2019fine ; allen2018learning . 3) Impact of dataset on complexity: As expected, training the same network for a more complex task (e.g., classification with CIFAR100 vs. CIFAR10) results in larger .
|on training data||on test data||Test accuracy (%)|
|Large ResNet (MNIST)||0.061||0.079||99.5|
|Large ResNet+DA (MNIST)||0.019||0.025||99.5|
In this paper we have introduced a new Hessian-based measure for the complexity of a deep network and its prediction. An attractive property of our measure compared to previously proposed measures such as the number of linear regions (VQ partitions) is that it captures the amount by which the network’s output changes not just locally but across the entire input space. Further, our measure explicitly exploits the geometrical structure of the training data. We have demonstrated a direct link between the heuristic of data augmentation and an implicit Hessian complexity penalty during learning. There are many potential applications for our new measure, including new ways to study generalization and optimization in deep networks and new more powerful regularization penalties.
- (ADH19a) Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, and Ruosong Wang, On exact computation with an infinitely wide neural net, arXiv preprint arXiv:1904.11955 (2019).
- (ADH19b) Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, and Ruosong Wang, Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks, arXiv preprint arXiv:1901.08584 (2019).
- (AGNZ18) Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang, Stronger generalization bounds for deep nets via a compression approach, arXiv preprint arXiv:1802.05296 (2018).
- (AZLL18) Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang, Learning and generalization in overparameterized neural networks, going beyond two layers, arXiv preprint arXiv:1811.04918 (2018).
- (Bar94) Andrew R Barron, Approximation and estimation bounds for artificial neural networks, Machine learning 14 (1994), no. 1, 115–133.
- (BB18a) Randall Balestriero and Richard Baraniuk, Mad max: Affine spline insights into deep learning, arXiv preprint arXiv:1805.06576 (2018).
- (BB18b) Randall Balestriero and Richard G. Baraniuk, A spline theory of deep networks, International Conference on Machine Learning, 2018, pp. 383–392.
- (BFT17) Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky, Spectrally-normalized margin bounds for neural networks, Advances in Neural Information Processing Systems, 2017, pp. 6240–6249.
- (BHMM18) Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal, Reconciling modern machine learning and the bias-variance trade-off, arXiv preprint arXiv:1812.11118 (2018).
- (BHX19) Mikhail Belkin, Daniel Hsu, and Ji Xu, Two models of double descent for weak features, arXiv preprint arXiv:1903.07571 (2019).
- (BM02) Peter L Bartlett and Shahar Mendelson, Rademacher and gaussian complexities: Risk bounds and structural results, Journal of Machine Learning Research 3 (2002), no. Nov, 463–482.
- (BN03) Mikhail Belkin and Partha Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation, Neural Computation 15 (2003), no. 6, 1373–1396.
- (CG19) Yuan Cao and Quanquan Gu, A generalization theory of gradient descent for learning over-parameterized deep relu networks, arXiv preprint arXiv:1902.01384 (2019).
Approximation by superpositions of a sigmoidal function, Mathematics of control, signals and systems 2 (1989), no. 4, 303–314.
- (DDF19) Ingrid Daubechies, Ronald DeVore, Simon Foucart, Boris Hanin, and Guergana Petrova, Nonlinear approximation and (deep) relu networks, arXiv preprint arXiv:1905.02199 (2019).
David L Donoho and Carrie Grimes,
Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data, Proceedings of the National Academy of Sciences 100 (2003), no. 10, 5591–5596.
- (DGR18) Tri Dao, Albert Gu, Alexander J Ratner, Virginia Smith, Christopher De Sa, and Christopher Ré, A kernel theory of modern data augmentation, arXiv preprint arXiv:1803.06084 (2018).
- (DR17) Gintare Karolina Dziugaite and Daniel M Roy, Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data, arXiv preprint arXiv:1703.11008 (2017).
- (HMRT19) Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani, Surprises in high-dimensional ridgeless least squares interpolation, arXiv preprint arXiv:1903.08560 (2019).
- (HR19) Boris Hanin and David Rolnick, Complexity of linear regions in deep networks, arXiv preprint arXiv:1901.09021 (2019).
- (HS17) Boris Hanin and Mark Sellke, Approximating continuous functions by relu nets of minimal width, arXiv preprint arXiv:1710.11278 (2017).
- (JJ65) Charles Jordan and Károly Jordán, Calculus of finite differences, vol. 33, American Mathematical Soc., 1965.
- (LBN17) Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein, Deep neural networks as gaussian processes, arXiv preprint arXiv:1711.00165 (2017).
- (MMM19) Song Mei, Theodor Misiakiewicz, and Andrea Montanari, Mean-field theory of two-layer neural networks: dimension-free bounds and kernel limit, arXiv preprint arXiv:1902.06015 (2019).
- (MT00) Louis Melville Milne-Thomson, The calculus of finite differences, American Mathematical Soc., 2000.
Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro,
Exploring generalization in deep learning, Advances in Neural Information Processing Systems, 2017, pp. 5947–5956.
- (PW17) Luis Perez and Jason Wang, The effectiveness of data augmentation in image classification using deep learning, arXiv preprint arXiv:1712.04621 (2017).
- (RMV11) Salah Rifai, Grégoire Mesnil, Pascal Vincent, Xavier Muller, Yoshua Bengio, Yann Dauphin, and Xavier Glorot, Higher order contractive auto-encoder, Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, 2011, pp. 645–660.
- (SESS19) Pedro Savarese, Itay Evron, Daniel Soudry, and Nathan Srebro, How do infinite width bounded norm networks look in function space?, arXiv preprint arXiv:1902.05040 (2019).
- (TDSL00) Joshua B Tenenbaum, Vin De Silva, and John C Langford, A global geometric framework for nonlinear dimensionality reduction, science 290 (2000), no. 5500, 2319–2323.
- (Tel15) Matus Telgarsky, Representation benefits of deep feedforward networks, arXiv preprint arXiv:1509.08101 (2015).
- (Wan12) Jianzhong Wang, Geometric structure of high-dimensional data and dimensionality reduction, Springer, 2012.
- (WGSM16) Sebastien C Wong, Adam Gatt, Victor Stamatescu, and Mark D McDonnell, Understanding data augmentation for classification: When to warp?, 2016 international conference on digital image computing: techniques and applications (DICTA), IEEE, 2016, pp. 1–6.
- (Yar17) Dmitry Yarotsky, Error bounds for approximations with deep relu networks, Neural Networks 94 (2017), 103–114.
- (ZBH16) Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals, Understanding deep learning requires rethinking generalization, arXiv preprint arXiv:1611.03530 (2016).
- (ZK16) Sergey Zagoruyko and Nikos Komodakis, Wide residual networks, arXiv preprint arXiv:1605.07146 (2016).
Appendix A Proofs
a.1 Proof of Lemma 3
a.2 Proof of Theorem 3
Let be as in (6). If at the most negative point where crosses zero, the function changes from negative to positive, we call this root . Also, if at the most positive point where crosses zero, it changes from positive to negative, we call this root . Let all other zero crossings of be , and let and be the set of points where changes from positive to negative, and from negative to positive, respectively. Note that we have
which completes the proof.
a.3 Proof of Theorem 3
a.4 Proof of Theorem 5
Let . Using (1), we have
Therefore, for , the first order approximation of around we obtain
Under the conditions of the theorem, we obtain
Summing up over and yields the following bound for , the first order approximation of for small ,
which completes the proof.
Appendix B Experimental Details
All experiments used the following parameters: batch size of 16, Adam optimizer with learning scheduled at 0.005 (initial), 0.0015 (epoch 100) and 0.001 (epoch 150). The default training/test split was used for all datasets. The validation set consists of 15% of the training set sampled randomly.
b.1 CNN Architecture
Conv2D(Number Filters=96, size=3x3, Leakiness=0.01)) Conv2D(Number Filters=96, size=3x3, Leakiness=0.01)) Conv2D(Number Filters=96, size=3x3, Leakiness=0.01)) Pool2D(2x2) Conv2D(Number Filters=192, size=3x3, Leakiness=0.01)) Conv2D(Number Filters=192, size=3x3, Leakiness=0.01)) Conv2D(Number Filters=192, size=3x3, Leakiness=0.01)) Pool2D(2x2) Conv2D(Number Filters=192, size=3x3, Leakiness=0.01)) Conv2D(Number Filters=192, size=1x1, Leakiness=0.01)) Conv2D(Number Filters=Number Classes, size=1x1, Leakiness=0.01)) GlobalPool2D(pool_type=’AVG’))
b.2 ResNet and Large ResNet Architectures
The ResNets follow the original architecture zagoruyko2016wide with depth , width for the ResNet and depth , width for the Large ResNet.