DeepAI
Log In Sign Up

A Hessian Based Complexity Measure for Deep Networks

05/28/2019
by   Hamid Javadi, et al.
0

Deep (neural) networks have been applied productively in a wide range of supervised and unsupervised learning tasks. Unlike classical machine learning algorithms, deep networks typically operate in the overparameterized regime, where the number of parameters is larger than the number of training data points. Consequently, understanding the generalization properties and role of (explicit or implicit) regularization in these networks is of great importance. Inspired by the seminal work of Donoho and Grimes in manifold learning, we develop a new measure for the complexity of the function generated by a deep network based on the integral of the norm of the tangent Hessian. This complexity measure can be used to quantify the irregularity of the function a deep network fits to training data or as a regularization penalty for deep network learning. Indeed, we show that the oft-used heuristic of data augmentation imposes an implicit Hessian regularization during learning. We demonstrate the utility of our new complexity measure through a range of learning experiments.

READ FULL TEXT VIEW PDF
09/14/2020

Input Hessian Regularization of Neural Networks

Regularizing the input gradient has shown to be effective in promoting t...
11/25/2022

The smooth output assumption, and why deep networks are better than wide ones

When several models have similar training scores, classical model select...
08/24/2020

The Hessian Penalty: A Weak Prior for Unsupervised Disentanglement

Existing disentanglement methods for deep generative models rely on hand...
12/07/2020

Why Unsupervised Deep Networks Generalize

Promising resolutions of the generalization puzzle observe that the actu...
02/06/2019

A Scale Invariant Flatness Measure for Deep Network Minima

It has been empirically observed that the flatness of minima obtained fr...
05/02/2017

Redundancy in active paths of deep networks: a random active path model

Deep learning has become a powerful and popular tool for a variety of ma...

1 Introduction

Deep (neural) networks are being profitably applied in a large and growing number of areas, from signal processing to computer vision and artificial intelligence. The expressive power of these networks has been demonstrated both in theory and practice

cybenko1989approximation ; barron1994approximation ; telgarsky2015representation ; yarotsky2017error ; hanin2017approximating ; daubechies2019nonlinear . In fact, it has been shown that deep networks can even perfectly fit pure noise zhang2016understanding . Surprisingly, highly overparameterized deep networks – where the number of network parameters exceeds the number of training data points – can be trained for a range of different classification and regression tasks and perform extremely well on unobserved data. Understanding why these networks generalize so well has been the subject of great interest in recent years. But, to date, classical approaches to bounding the generalization error have failed to provide much insight into deep networks.

The ability of overparameterized deep networks to overfit noise while generalizing well suggests the existence of some kind of (explicit or implicit) regularization in the learning process. In order to both understand and improve regularization in training deep networks, one key question to address is: What is the correct measure to evaluate the complexity of a deep network? As we discuss in detail below, both classical and recent measures have come up short on insights. In this paper, we take a different tack. Let

represent the mapping from the input to output of a deep network constructed using piecewise-affine activations (e.g., ReLU, leaky ReLU, absolute value). When activations are also convex,

can be written as composition of Max Affine Spline Operators (MASOs) and using the framework provided in balestriero2018mad ; balestriero2018spline , we can write as the continuous, piecewise affine operator

(1)

Below, we propose a new complexity measure for based on the matrices.

The main intuition behind our measure can be described as follows. We aim to quantify how far the mapping is from a locally linear mapping on the data. Motivated by the concept of Hessian eigenmaps introduced by Donoho and Grimes donoho2003hessian for manifold learning, we propose the tangent Hessian norm integral as a new complexity measure for deep networks.

Two main features of our measure distinguish it from other proposed regularization penalties: 1) Distance from a linear mapping: Most regularization penalties proposed in the literature focus on the behavior of the mapping on regions where is linear. For example, Tikhonov regularization on the weights of the network bounds the Lipschitz constant of the mapping in individual regions. In contrast, our Hessian measure quantifies how much the mapping differs from an affine mapping over the entire input space. 2) Local geometrical structure of the input data: In most applications, for example when the input data consists of images, the training data points lie on a lower-dimensional manifold of dimension . We can exploit the local geometrical structure of the data to evaluate the mapping as a function of the manifold local coordinates.

Our main contributions can be summarized as follows:

[C1] A new complexity measure for deep networks. In Section 2, we propose and justify the tangent Hessian norm integral as a new complexity measure for deep networks.

In Section 4, we present two methods to compute the measure efficiently.

[C2] Understanding the role of deep network parameters on complexity. In Section 3, we study the growth in complexity of functions generated by the units (units) in each layer of a deep network. This provides an upper bound on the complexity of the network output in terms of network parameters.

[C3] Data augmentation as implicit Hessian regularization. In Section 5 we study data augmentation and show that using this technique while training a deep network decreases the Hessian complexity measure. Hence, we can consider this technique as an implicit regularization method with the tangent Hessian norm integral as the penalty.

More broadly, our Hessian complexity measure can open up new directions in understanding the role of optimization methods such as stochastic gradient descent in training deep networks as an implicit regularization. In addition, it can inspire new training frameworks for deep networks that are robust to a range of adversarial attacks. Proofs of all of our results appear in the Appendix.

Related work.  There is a growing literature studying the generalization properties of overparameterized neural networks bartlett2002rademacher ; bartlett2017spectrally ; arora2018stronger ; dziugaite2017computing ; neyshabur2017exploring . In these papers, the authors obtained sharper bounds than naïve parameter counting by using the stability of the deep network around the weights achieved after training. Recently, it has been shown that by having sufficient overparameterization, the weights of a trained network will be close to its random initialization allen2018learning ; arora2019fine ; cao2019generalization . Using this fact, these papers have achieved better generalization bounds for deep networks. In hanin2019complexity , the authors observe that the expected number of linear regions in a trained deep network grows polynomially (and not exponentially) with the number of units. All of the above works suggest that, in order to understand generalization in deep nets, one must measure the complexity of the deep network mapping.

Recently, in savarese2019infinite , the authors show that for one-dimensional, one hidden layer, ReLU networks, an penalty on the weights (Tikhonov regularization) is equivalent to a penalty on the integral of the absolute value of the second derivative of the network output. Moreover, in belkin2019two ; belkin2018reconciling ; hastie2019surprises the authors show that in some linear and nonlinear inference problems, properly regularized overparameterized models can generalize well. This again indicates the importance of complexity regularization for understanding generalization in deep networks.

In rifai2011higher , the authors propose an auto-encoder whose regularization penalizes the change in the encoder Jacobian. Also, in dao2018kernel , the effects of data augmentation have been studied by modeling data augmentation as a kernel.

2 Formulating the Hessian Based Complexity Measure

Network with smooth activations.  Let

be the prediction function of a deep network whose nonlinear activation functions are smooth functions. For regression, we can take

as the mapping from the input to the output of the network. For classification, we can take as the mapping from the input to one of the inputs of the final softmax operation. We assume that the training data lies close to a -dimensional smooth manifold . This assumption has been studied in an extensive literature on unsupervised learning, e.g. tenenbaum2000global ; belkin2003laplacian ; donoho2003hessian , and holds at least approximately for many practical datasets, including images.

For , inspired by the Hessian eigenmaps approach of Donoho and Grimes donoho2003hessian , we propose the following complexity measure ,

(2)

where is the Hessian of at in the coordinates of -dimensional affine space tangent to manifold at .

From donoho2003hessian we know that measures the average curviness of over the manifold and that is zero if and only if is an affine function on . In the simplistic case of one-dimensional data and one hidden layer networks, savarese2019infinite have related to the sum of the squared Frobenius norms of the weight matrices.

While the manifold assumption is highly recommended for exploiting the data’s geometrical structure in computing the complexity measure (2), it is not essential. We can take and by let be the entire input space to obtain

(3)

This measure is easier to compute, but it does not exploit the geometrical structure of the data and therefore might not be as revealing as (2).

Network with continuous, piecewise affine activations. Our focus in this paper is on a complexity measure for deep networks constructed using piecewise affine activations (e.g., ReLU, leaky ReLU, absolute value). In this case, is piecewise affine and thus not continuously differentiable. Therefore, the Hessian is not well defined in its usual form.

Note that a network with continuous, piecewise affine activations partitions the input space

based on the activation pattern of the network’s units (neurons). We call these partitions,

, the vector quantization (VQ) regions of the network. Note that inside one VQ region, is simply an affine mapping. As a result, can be written as a continuous, piecewise affine operator as in (1). Note that in (1), , are in fact functions of the network activation pattern and therefore the VQ region containing . However, for the sake of brevity, we will use the simplified notation in (1).

We can now define our complexity measure for a network with continuous, piecewise affine activations. Let . For not on the boundaries of VQ partitions and an arbitrary unit vector, we define as

(4)

if is on the -dimensional affine space tangent to the data manifold at and , otherwise. Note that is a (weak) gradient of at , and therefore this definition agrees with the finite element definition of the Hessian. For smooth and recovers the Hessian milne2000calculus ; jordan1965calculus . Thus, for a network with continuous, piecewise affine activations, we define

(5)

where is uniform over the unit sphere. Comparing with (2), this definition is consistent with the definition of the distributional derivative for piecewise constant functions and can be seen as measuring the changes in the local slopes of the piecewise affine spline realized by the network.

It is worthwhile to compare the Hessian complexity measure with the notion of “number of VQ regions” hanin2019complexity in a ReLU network. We believe that the Hessian measure provides a more useful quantification of the network output complexity, because it explicitly takes into account the changes in the output function across the VQ regions. For instance, consider the analysis of infinitely wide networks, which have been used to help understand the convergence properties and performance of deep networks lee2017deep ; mei2019mean ; arora2019exact . The number of VQ regions can be infinite in such networks; however, the Hessian measure remains bounded as long as the network weight matrices have rows with bounded norm. We will discuss this in more detail in Section 3.

When a network has more than one hidden layer, it is not straightforward to obtain an explicit formula for or in terms of the network parameters (weight matrices and biases). However, it is possible to efficiently approximate the Hessian measure (see Section 4).

3 Hessian Complexity Growth Through a Deep Network

In this section, we study the outputs of different units (neurons) in a ReLU deep network as functions of the input in order to shed light on how the Hessian complexity increases through the network. The scalar ReLU activation function is applied elementwise to an input vector to create a thresholded output. A ReLU network intersperses affine transformations with ReLU thresholding. We call the weights and the biases of the network.

We focus on the case of a network with layers processes a scalar, one-dimensional input . For , let be the number of units in the -th layer of the network. In this case, the output of -th unit in the -th layer of the network, , is a continuous, piecewise affine function of the input that we denote by . Such a function can be written as

(6)

where is the number of linear pieces of and are the spline break points (or knots). We can compute

(7)

for as in (6) via

(8)

In order to understand the complexity of the output functions of different units of the network, i.e., for and , we must compute the complexity of a unit output function in terms of the complexities of its input functions. In a ReLU network, each unit performs two operations on its input functions: a linear combination of the input functions followed by ReLU thresholding. The following lemma studies the linear combination process and finds that is a seminorm on the space of one-dimensional, continuous, piecewise affine functions .

Let be piecewise affine functions for and let . If , then

(9)

The following theorem bounds the complexity of in terms of the complexity of and its Lipschitz constant.

Let be a piecewise affine function as in (6) with Lipschitz constant . Then

(10)

Theorem 3 implies that, if the functions generated in the network have bounded Lipschitz constant , which is the case when the weight matrices have bounded norm, then each ReLU nonlinearity unit adds at most to the complexity regardless of the layer in which the unit is located. This suggests that, in the case of one-dimensional networks with weight matrices of bounded norm, the maximum complexity of the output is a linear function of number of units in the network.

Define the complexity of the -th layer of the network as

(11)

We can bound the complexity of the layers for networks with bounded weight matrices.

Let with rows be the weight matrix of the -th layer. If the output functions of the units in the -th layer have Lipschitz constant , then

(12)

Using the fact that , we have the following.

If the weight matrices of all layers have rows with norm less than one, then the complexity of the network output function is at most , where is the number of layers.

4 Computing the Hessian Complexity Measure

In this section, we discuss two efficient methods to approximate in practical deep networks.

Finite Differences Method.  First note that we can find the -dimensional subspace tangent to the data manifold at neighborhood around data point as the -dimensional principal subspace of defined by , where are the set of nearest neighbors of and

(13)

In this method, we estimate

using its Monte Carlo approximation based on the training data .

When the network has smooth activations, we have

(14)

We can also apply the Monte Carlo method to estimate . If is chosen uniformly at random on the unit sphere, then for we have

(15)

For smooth , we have

(16)

Therefore, choosing uniformly at random with on the -dimensional subspace tangent to the manifold at , plugging (15), (16) into (14) yields the following approximation for for small

(17)

When the network has continuous, piecewise affine activations, Monte Carlo approximation of based on the training data yields

(18)

For such a network, using (4) yields

(19)

as our approximation.

In a ReLU network with layers, for a given , can be computed very efficiently via

(20)

where is the weight matrix of the -th layer and is a diagonal matrix with if the output of the -th ReLU unit in the -th layer is nonzero (aka “active”) with as input and if the ReLU ouput is zero. This enables us to use as a regularization penalty in training real networks.

Discretization Method. When the manifold dimension , we can use the discretization technique of donoho2003hessian to write the penalty in (2) as a quadratic form of the values of on the data points.

Let be the manifold neighborhood of that contains and its nearest neighbors , . Taking as a partition of unity on with , we can write as

(21)

Let be the vector of samples of over

(22)

We use the approximation

(23)

where is the tangent Hessian operator on . is a positive semidefinite matrix whose null space contains the constant and linear functions on the tangent space at ; it can be constructed in the following way wang2012geometric . Let be top right singular vectors of and construct the matrix

(24)

where is the Hadamard product of the vectors . By performing Gram-Schmidt orthogonalization on the columns of , we obtain , which forms an orthonormal basis for quadratic functions on . Then

(25)

is the tangent Hessian operator on .

For , this approximation can be further simplified to

(26)

where

(27)

with and when or are not in .

Thanks to the simplicity of the quadratic form , we can study the smoothness properties of

by analyzing the spectrum and eigenspaces of the Hessian operator

.

5 Data Augmentation Effects Implicit Hessian Complexity Regularization

Data augmentation wong2016understanding ; perez2017effectiveness is an oft-used, yet poorly understood, heuristic applied in learning the parameters of deep networks. The data augmentation procedure augments the set of training data points to training data points by applying transformations to the training data such that they continue to lie on the data manifold . Example transformations applied to images include translation, rotation, color changes, etc. In such cases, is the vector difference , where is the translated/rotated image. In this section, we analyze the effect of data augmentation on our Hessian complexity measure and furthermore show that it acts as an implicit regularizer during learning. Consider training a deep network with continuous, piecewise affine activations given the original training dataset by minimizing the loss

(28)

where

is any convex loss function. After data augmentation, the loss can be written as

(29)

The following result establishes the relationship between data augmentation and Hessian complexity.

Consider a deep network with continuous, piecewise affine activations and thus prediction function as in (1). Assume that , and has Lipschitz constant , i.e., , for all . Further, assume that the loss function has Lipschitz constant . Then, for and small enough , in (29) can be approximated by

(30)

From (30), we can note the close relationship between data augmentation and the Hessian complexity measure. Indeed, the second term on the right-hand side of the inequality is very similar in form to the Hessian complexity measure in (19). This, suggests that adding a Hessian complexity penalty term as a regularizer to the loss should decrease the resulting of the network. Moreover, in the experimental validation in Figure 1 and Table 1, we observe that the converse is also true; data augmentation also decreases the Hessian complexity.

on training data | CIFAR10 | ResNet on test data | CIFAR10 | ResNet

learning epochs

learning epochs
on training data | SVHN | ResNet on test data | SVHN | ResNet
learning epochs learning epochs
on training data | CIFAR100 | ResNet on test data | CIFAR100 | ResNet
learning epochs learning epochs
on training data | CIFAR100 | CNN on test data | CIFAR100 | CNN
learning epochs learning epochs

Figure 1: Experimental validation that data augmentation reduces the Hessian complexity measure (19) in a classification task with the ResNet and CNN architectures on the CIFAR10, SVHN, and CIFAR100 datasets. The blue/black curves correspond to experiments with/without data augmentation. (See the Appendix for the experimental details.)

We can make a few additional observations from the results in Table 1. 1) Impact of network architecture on complexity: We observe that the convolutional network (CNN) results in smaller than the residual network (ResNet). This sheds light on the advantages of a convolutional architecture for image classification. We also observe that the measured using the training data is almost zero for CNNs trained on all four datasets. This suggests that the prediction function is almost linear within of the training data. However, the complexity is significantly higher when measured using the test data. This is an interesting property of purely convolutional networks that begs investigation. 2) Impact of overparameterization on complexity: Surprisingly, the Large ResNet (with more parameters) results in smaller than the smaller ResNet. This might be a result of the implicit regularization that results from training overparameterized networks via stochastic gradient descent (SGD) arora2019fine ; allen2018learning . 3) Impact of dataset on complexity: As expected, training the same network for a more complex task (e.g., classification with CIFAR100 vs. CIFAR10) results in larger .

on training data on test data Test accuracy (%)
CNN (MNIST) 1.57e-09 0.036 99.6
CNN+DA (MNIST) 1.56e-09 0.074 99.6
ResNet (MNIST) 0.086 0.20 99.4
ResNet+DA (MNIST) 0.021 0.016 99.4
Large ResNet (MNIST) 0.061 0.079 99.5
Large ResNet+DA (MNIST) 0.019 0.025 99.5
ResNet (CIFAR10) 0.43 0.50 84.9
ResNet+DA (CIFAR10) 0.10 0.12 91.0
CNN (CIFAR10) 1.54e-09 0.10 87.4
CNN+DA (CIFAR10) 2.58e-09 0.11 91.7
ResNet (SVHN) 0.08 0.09 93.9
ResNet+DA (SVHN) 0.01 0.01 94.1
CNN (SVHN) 8.3e-10 0.022 95.6
CNN+DA (SVHN) 6.32e-10 0.019 95.5
ResNet (CIFAR100) 4.32 4.39 49.3
ResNet+DA (CIFAR100) 1.08 1.14 64.5
CNN (CIFAR100) 6.3e-08 1.51 61.1
CNN+DA (CIFAR100) 7.0e-08 1.40 68.3
Table 1: Additional experimental validation that data augmentation reduces the Hessian complexity measure (19) in a classification task with a range of deep networks and datasets. We tabulate the converged values for on the training and test data without and with data augmentation (denoted by DA). The fourth and sixth sets of rows summarizes the experiment in Figure 1. Generally speaking, learning with data augmentation reduces the Hessian complexity. (See the Appendix for the experimental details.)

6 Discussion

In this paper we have introduced a new Hessian-based measure for the complexity of a deep network and its prediction. An attractive property of our measure compared to previously proposed measures such as the number of linear regions (VQ partitions) is that it captures the amount by which the network’s output changes not just locally but across the entire input space. Further, our measure explicitly exploits the geometrical structure of the training data. We have demonstrated a direct link between the heuristic of data augmentation and an implicit Hessian complexity penalty during learning. There are many potential applications for our new measure, including new ways to study generalization and optimization in deep networks and new more powerful regularization penalties.

References

Appendix A Proofs

a.1 Proof of Lemma 3

Let be as in (6). We have

(31)

Further, let

(32)

Let , . We have

(33)
(34)
(35)
(36)

Combining this with (31), the first inequality in (9) is proved. The second inequality is the result of applying the Cauchy-Schwarz inequality.

a.2 Proof of Theorem 3

Let be as in (6). If at the most negative point where crosses zero, the function changes from negative to positive, we call this root . Also, if at the most positive point where crosses zero, it changes from positive to negative, we call this root . Let all other zero crossings of be , and let and be the set of points where changes from positive to negative, and from negative to positive, respectively. Note that we have

(37)
(38)
(39)

and that

(40)

Therefore,

(41)
(42)
(43)
(44)

Therefore,

(45)

which completes the proof.

a.3 Proof of Theorem 3

The proof follows from Lemma 3 and Theorem 3. Note that under the theorem’s assumptions, all pre-activation functions in the -th layer have Lipschitz constant of at most . Further, by Lemma 3, their complexity is bounded by . Applying Theorem 3, the proof is complete.

a.4 Proof of Theorem 5

Let . Using (1), we have

(46)
(47)

Therefore, for , the first order approximation of around we obtain

(48)

Under the conditions of the theorem, we obtain

(49)

Summing up over and yields the following bound for , the first order approximation of for small ,

(50)
(51)

which completes the proof.

Appendix B Experimental Details

All experiments used the following parameters: batch size of 16, Adam optimizer with learning scheduled at 0.005 (initial), 0.0015 (epoch 100) and 0.001 (epoch 150). The default training/test split was used for all datasets. The validation set consists of 15% of the training set sampled randomly.

b.1 CNN Architecture

    Conv2D(Number Filters=96, size=3x3, Leakiness=0.01))

    Conv2D(Number Filters=96, size=3x3, Leakiness=0.01))

    Conv2D(Number Filters=96, size=3x3, Leakiness=0.01))

    Pool2D(2x2)

    Conv2D(Number Filters=192, size=3x3, Leakiness=0.01))

    Conv2D(Number Filters=192, size=3x3, Leakiness=0.01))

    Conv2D(Number Filters=192, size=3x3, Leakiness=0.01))

    Pool2D(2x2)

    Conv2D(Number Filters=192, size=3x3, Leakiness=0.01))

    Conv2D(Number Filters=192, size=1x1, Leakiness=0.01))

    Conv2D(Number Filters=Number Classes, size=1x1, Leakiness=0.01))

    GlobalPool2D(pool_type=’AVG’))

b.2 ResNet and Large ResNet Architectures

The ResNets follow the original architecture zagoruyko2016wide with depth , width for the ResNet and depth , width for the Large ResNet.