A Scale Invariant Flatness Measure for Deep Network Minima

02/06/2019 ∙ by Akshay Rangamani, et al. ∙ Google Boston University ibm Johns Hopkins University 0

It has been empirically observed that the flatness of minima obtained from training deep networks seems to correlate with better generalization. However, for deep networks with positively homogeneous activations, most measures of sharpness/flatness are not invariant to rescaling of the network parameters, corresponding to the same function. This means that the measure of flatness/sharpness can be made as small or as large as possible through rescaling, rendering the quantitative measures meaningless. In this paper we show that for deep networks with positively homogenous activations, these rescalings constitute equivalence relations, and that these equivalence relations induce a quotient manifold structure in the parameter space. Using this manifold structure and an appropriate metric, we propose a Hessian-based measure for flatness that is invariant to rescaling. We use this new measure to confirm the proposition that Large-Batch SGD minima are indeed sharper than Small-Batch SGD minima.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the past few years, deep learning

[16] has had empirical successes in several domains such as object detection and recognition [15, 22], machine translation [26, 12], and speech recognition [7, 23], there is still a gap between theoretical bounds on the performance of deep networks and the performance of these networks in practice. Deep networks tend to be highly overparameterized, which means the hypothesis space is very large.

However, optimization techniques such as stochastic gradient descent (SGD) are able to find solutions that generalize well, even if the number of training samples we have are far fewer than the number of parameters of the network we are training. This suggests that the solutions that we are able to retrieve certain desirable properties which are related to generalization.

Several empirical studies [3, 13] observe that the generalization ability of a deep network model is related to the spectrum of the Hessian matrix of the training loss at the solution obtained during training. It is also noted that solutions with smaller Hessian spectral norm tend to generalize better. These are popularly known as Flat Minima, which have been studied since 1995 [8, 9].

The flat minima heuristic is also related to a more formal framework for generalization – PAC-Bayesian analysis of generalization behavior of deep networks. PAC-Bayes bounds

[6] are concerned with analyzing the behavior of solutions drawn from a posterior distribution rather than the particular solution obtained from empirical risk minimization, for instance. One posterior distribution the bounds are valid for are perturbations about the original solution obtained from empirical risk minimization. Neyshabur et al. relate the generalization of this distribution to the sharpness of the minima obtained [19]. More recently, Wang et al. in [27]

provide formal connections between perturbation bounds and the Hessian of the empirical loss function and then propose a generalization metric that is related to the Hessian.

A number of quantitative definitions of flatness have been proposed both recently [3, 13] as well as in the early literature [9]. These authors formalize the notions of “flat” or “wide” minima by either measuring the size of the connected region that is within of the value of the loss function at the minimum or by finding the difference between the maximum value of the loss function and the minimum value within an -radius ball of the minimum. Note that the second notion of flatness is closely related to the spectral norm of the Hessian of the loss function at the minimum.

Definition 1.

If is the Euclidean ball of radius centered at a local minimum of a loss function , then the -sharpness of the minimum is defined as:

By performing a second order Taylor expansion of around , we can relate the -sharpness to the spectral norm of the Hessian of at as follows

However, [5]

show that deep networks with positively homogeneous layer activations (like the common ReLU activation,

) can be rescaled to make their

-sharpness arbitrarily small or large with a simple transformation that implements the same neural network function but have widely different sharpness measures

[5]. To formalize this we consider a 2-layer neural network with parameters where the network is given by . We can transform the parameters of the network by in the following manner: . We notice that for positively homogeneous activations, the networks parameterized by and implement the same function.

Theorem 1.1.

(Theorem 4 in [5]) For a one hidden layer rectified neural network of the form where is a minimum for such that , for any real number , we can find a number such that .

This tells us that Hessian based measures like -sharpness are not very meaningful since we can transform the parameters of the network to get as large or small a quantity as we want. This is also the case for other generalization metrics which are related to the Hessian, such as the one proposed in [27].

In this paper, we propose an alternative measure for quantifying the sharpness/flatness of minima of empirical loss functions. This measure is based on defining a quotient manifold of parameters which gives us a sharpness measure that is invariant to rescalings of the form described above. We use our sharpness measure to then test whether the minima obtained from large batch training are sharper than those obtained from small batch training.

The rest of the paper is organized as follows. In section 2, we formalize the rescaling that can change the sharpness of minima without changing the function and show that the relation described by rescaling is indeed an equivalence relation, which in turn induces a manifold structure in the space of deep network parameters. In section 3

, we describe an algorithm analogous to the power method that can be used to estimate the spectral norm of the Riemannian Hessian, which in turn can be employed as a measure of sharpness of the deep network minima. In section

4, we present several experimental results of applying our measure to small-batch vs large-batch training of various deep networks. Our results confirm that the geometric landscape of the loss function at small-batch minima are indeed flatter than that of large-batch minima.

2 Characterizing a Quotient Manifold of Deep Network Parameters

Let us define a neural network as a function which takes an -dimensional input and outputs an

-dimensional vector which could be a vector of class labels or a continuous measurement, depending on the task. We consider neural networks which consist of a series of nonlinear transformations, represented as

Here

is a linear transformation, and

is a positively homogeneous nonlinear function, usually applied pointwise to a vector. Each combination of a linear and nonlinear transformation is referred to as a “layer”, and the linear transformation is referred to as the parameter or weights of the layer. Even if has a matrix/convolutional structure, we will be concerned only with the vectorized version, , which we will use interchangeably with

. First, we consider networks without bias vectors in each layer. We will extend our manifold construction to networks with bias at the end of this section.

Usually we have access to samples from a distribution , and the way we train neural networks is by minimizing certain (convex) loss function that captures the distance between the network outputs and the target labels

Due to the positive homogeneity we can scale the weights of the neural network appropriately to represent the same function with a different set of weights. This means there is a whole set of local minima that correspond to the same function, but are located at different points in the parameter space.

Proposition 2.1.

Let be the parameters of a neural network with layers, and be a set of multipliers. Here . We can transform the layer weights by in the following manner: . We introduce a relation from to itself, if such that and . Then, the relation is an equivalence relation.

This equivalence relation is of interest to us because if , for all inputs . Denote as the Euclidean vector space and the product manifold that covers the entire parameter space. We can use the equivalence relation defined in Proposition 2.1 to obtain a quotient manifold induced by the equivalence relation .

Proposition 2.2.

The set obtained by mapping all points within an equivalence class to a single point in the set has a quotient manifold structure, making a differentiable quotient manifold.

Due to the space limitation, we leave the proof of Propositions 2.1 and 2.2 in the Appendix.

Let denote the mapping between the Euclidean parameter space and the quotient manifold. Given a point , is the equivalence class of and is also an embedded manifold of . We have

In order to impart a Riemannian structure to our quotient manifold, we need to define a metric on that is invariant within an equivalence class.

Proposition 2.3.

Let and be two tangent vectors at a point . The Riemannian metric defined by:

is invariant within an equivalence class, and hence induces a metric for , . Here is the Euclidean inner product and are tangent vectors at a point .

Proof.

Let belong to the equivalent class , and be tangent vectors in . Since and are in the same equivalence class, such that with . Using arguments similar to that presented in Example 3.5.4 in [2], the corresponding tangent vectors in are related by the same scaling factor between and .

Thus,

which completes the proof. ∎

One invariant property of the equivalence class is the product of the norms of all the layers. That is, if , then . For calculation convenience, we can replace the product by the sum by applying the log operator which gives .

Lemma 2.4.

The tangent space of at is with .

Proof.

Consider the curves with , we have

Taking the derivative on both sides with respect to gives

It is clear that with satisfies the above equation. Therefore the tangent space of contains all tangent vectors with . ∎

The tangent space to the embedded submanifold of is usually referred to as the Vertical Tangent space () of the quotient manifold . The orthogonal complement of the vertical space from the tangent space is referred to as the horizontal space . We note that all smooth curves such that and , lie within the equivalence class .

2.1 Deep Networks with Biases

A deep neural network with biases is a function which takes an -dimensional input and outputs an -dimensional vector through a series of nonlinear transformations can be represented as

Here, are the bias parameters for each layer. Once again, due to the positive homogeneity of the nonlinear functions , we can rescale the weights and biases of the network to obtain a different set of weights and biases that implement the same function.

Suppose we have , such that . Consider the following transformation:

Now, if , then for all . Let us denote , as the Euclidean space for each layer. The product space is the entire space of parameters for the neural networks with biases. Using arguments similar to Propositions 2.1 and 2.2, we can see that this new transformation also introduces an equivalence relation on and that admits a quotient manifold structure. We modify Proposition 2.3 slightly to get a new metric for the tangent space of .

Proposition 2.5.

Since is a Euclidean space, its tangent space is also . Let and be two tangent vectors at a point . The Riemannian metric defined by:

is invariant within an equivalence class and hence induces a metric for , . Here, is the usual Euclidean inner product and and are tangent vectors at a point .

Let us now introduce a new invariant property of the equivalence class for network parameters with biases. First, for a point in the space of parameters, we know that and . For each layer, let us define as follows

We then have that for each if , which is the invariant property of the equivalence class. We can also get a description of the tangent space of from the following lemma.

Lemma 2.6.

The tangent space of at is with , .

Proof.

Consider the curves with , we have

Taking the derivative on both sides with respect to gives

It is clear that with and satisfies the above equation. Therefore the tangent space at contains all tangent vectors with and . ∎

3 Measuring the Spectral Norm of the Riemannian Hessian

In the previous section, we introduced a quotient manifold structure that captures the rescaling that is natural to the space of parameters of neural networks with positively homogeneous activations. Now, similar to how the spectral norm of the Euclidean Hessian is used as a measure of sharpness, we can use the Taylor expansion of real-valued functions on a manifold to give us an analogous measure of sharpness using the spectral norm of the Riemannian Hessian.

In this section, we will use normal symbols to denote points, functions, and gradients on the quotient manifold , and overlines to denote their lifted representations in the total manifold (which is a vector space). If is a tangent vector in , then denotes the representation in of the horizontal projection of . The definition of Riemannian Hessian as per [2] is as follows.

Definition 2.

For a real valued function on a Riemannian manifold , the Riemannian Hessian is the linear mapping of onto itself, defined by

for all , where is a Riemannian connection defined on .

To see how the Riemmannian Hessian is related to the flatness/sharpness of the function around a minimum , we consider a retraction which maps points in the tangent space to points on the manifold. For example, in a Euclidean space, , is a retraction. The flatness/sharpness of a function around a minimum is defined (similar to Definition 1) using the value of the function in a ”neighborhood” of the minimum. To formalize what we mean by an -neighborhood of , it is the set of points that can be reached through a retraction using tangent vectors of norm at most

Where is the norm induced by the Riemannian metric . This gives us the following flatness/sharpness measure:

Using the fact that is a vector space, and that is a function on a vector space that admits a Taylor expansion, we get the following approximation for when , and :

Using the approximation, recognizing that at a minimum, , and using a Cauchy-Schwarz argument, we can bound the flatness/sharpness measure by the spectral norm of the Riemannian Hessian. We define it similar to the spectral norm of a linear map in Euclidean space.

Definition 3.

The spectral norm of the Riemannian Hessian of a function is defined as

With the definition of the spectral norm of the Riemannian Hessian, we now would like to be able to compute it for any function defined on a manifold. To achieve this, we present a Riemannian Power Method in Algorithm 1.

1:procedure RiemannianPM()
2:     Initialize randomly in
3:     while not converged do

   (We use relative change in the eigenvector as a stopping criterion)

4:         
5:         
6:               
7:     return
Algorithm 1 Riemannian Power Method
Remark 1.

Using Proposition 5.5.2 from [2], we have that:

Since we are only interested in computing the tangent vector in

that corresponds to the maximum eigenvalue of the linear map

at the minimum, let us set to be a constant vector field, equal to at all points on the manifold.

We use a connection similar to the one defined in Theorem 3.4 of [1], which means:

This means that . Now from the definition of the Riemmanian gradient of a function (equation 3.31 of [2]), we have that:

Considering as another function on the manifold , we have:

Which means the Hessian vector product can be computed as:

Remark 2.

While we have specified the definition of the spectral norm of the Riemannian Hessian and the algorithm to compute it in general for any Riemannian manifold, we recall that we are dealing with a quotient manifold of neural network parameters. In order to implement our algorithms on a computer, we use the lifted representations of points and tangent vectors in the total manifold . The lifted representation of a point is the parameter vector . The lifted representation of a tangent vector is the projection of the representation of a tangent vector into the horizontal space , i.e., .

However, since the neural network loss functions that we would like to estimate the Hessian spectral norm for are constant within an equivalence class, their gradients are always zero along tangent directions within the vertical tangent space, which means the gradient lies in the horizontal space, and the projection of the lifted representation is unnecessary in practice. This is also true for the Hessian-vector product, which is computed as the gradient of the inner product between the gradient and the tangent vector.

For the sake of completion, the Riemannian gradient is computed as:

In the above equation, is the Euclidean gradient of the function , and is the inverse of the matrix representation of the metric at

. The Euclidean gradient is easily computed using backpropagation. The inverse metric is given by

.

3.1 Simulations

To validify Algorithm 1, we consider two deep network architectures described in Table 1. For each architecture, we generate a synthetic dataset containing samples in which belong to one of 10 different classes with randomly generated class labels. For each network, we consider softmax cross-entropy as the loss function.

Network Architecture
60mm[FC(), FC(), FC()]
60mm[conv(), conv(), FC(), FC(), FC()]
Table 1: Network Architectures for Simulations

We compute the spectral norms of the Hessians of their losses at different points within the equivalence class by considering for different settings of . Let be the spectral norm computed at , and be the spectral norm computed at . We define the relative difference between the two measurements as follows:

Results for are reported in Table 2 whereas results for are reported in Table 3.

Relative Difference
Table 2: Relative Difference in Spectral Norms for under different transformations
Relative Difference
Table 3: Relative Difference in Spectral Norms for under different transformations

In Figure 1, we can observe how our power method based algorithm converges for an network. From the tables, we notice that the spectral norm that we compute using the eigenvectors from Algorithm 1 is invariant to transformations within the equivalence class. That is, the values for Relative Difference are small. We can substitute in the spectral norm that we compute on the manifold into Definition 1 in order to come up with a measure of flatness that is invariant to rescaling.

Figure 1: Convergence of Algorithm 1 for a synthetic dataset for an network

4 Large Batch vs Small Batch Training

One context in which the flatness of deep network minima has been suggested to correlate with better generalization is in large-batch training vs small-batch training of neural networks. In an empirical study, [13] observe that small-batch gradient methods with 32-512 samples per batch tend to converge to flatter minima than large-batch methods which have batch sizes of the order of s of samples. However, since [5] have shown that measures of flatness can be gamed by rescaling the network appropriately, we cannot trust the current quantitative measures to compare the sharpness of small-batch vs large-batch minima. Instead, we use the spectral norm of the Riemannian Hessian as a measure of sharpness and compare small-batch gradient based methods to large-batch methods.

4.1 Datasets and Network Architectures

Similar to [13], we consider two datasets – MNIST [17] and CIFAR-10 [14] – with two different network architectures for each dataset.

For MNIST, we used a fully connected deep network (MNIST-FC) with hidden layers of neurons each. In addition to this, we used a convolutional network based on the LeNet architecture [17]. This network has two convolutional-pooling layers, followed by two fully connected layers of and neurons before the final output layer with neurons.

For CIFAR-10, we considered a shallow convolutional network with an AlexNet-type architecture [15] and a deep convolutional network with a VGG16-type architecture [25].

In order to test our measure, we did not use layers which are not positively homogeneous like Local Response Normalization. Even though Batch Normalization layers are compatible with our manifold structure (if we consider the trained BN layer parameters as part of the network parameters), we did not use them in order to keep the experiments simple.

4.2 Results

Our goal in this set of experiments is not to achieve state of the art performance on these datasets. Instead, we are interested in characterizing and contrasting the solutions obtained using small-vs-large-batch gradient based methods. For each network architecture and dataset, we trained the network to training accuracy using SGD or Adam, resulting in training cross-entropy loss values in the order of 111All code used to run the experiments can be found at https://github.com/akshay-r/scale-invariant-flatness.. For MNIST we used batch sizes of and samples for the small batch and large batch training respectively, while for CIFAR-10, we used batch sizes of and . The MNIST networks were trained using Adam while the CIFAR-10 networks were trained using SGD. The learning rate used for small batches was while a learning rate of was used for large batches. In the case of both MNIST and CIFAR-10, we computed our flatness measure on the empirical loss on the training set at the minima obtained through the training process. Due to memory issues in the case of CIFAR-10, we limit ourselves to using training examples instead of the entire training set for computing the flatness measure.

Five different repetitions of these experiments were conducted, from different random initializations. We first generate parametric line plots along different random directions for AlexNet and VGG16. These plots are shown in Figure 4. These plots are layer normalized [18], which means that the random directions chosen are scaled according to the norms of the layers of the trained networks. More precisely, if the minimum obtained from training AlexNet/VGG is , we generate random direction , and plot the loss along the curve for . Here is given by:

From the plots we see that the large-batch plots are above the small-batch plots, indicating that the large-batch minima are sharper than the small-batch counterparts.

(a) AlexNet
(b) VGG16
Figure 4: Parametric line plots for convolutional networks trained on CIFAR-10

Now, in order to quantify the sharpness and see how it correlates with generalization, we report the test accuracy and spectral norm of the Hessian at minima for each of the four networks trained on their respective datasets in Table 4. We observe that the estimated spectral norms for the large-batch minima are orders of magnitude larger than those of the small-batch minima for every network and dataset. This also correlates with test accuracy, with the small-batch minima having better generalization abilities.

We see that the difference in the spectral norm is 3-4 orders of magnitude. However, the same effect is not observed in the parametric line plots in Figure 4. This can be attributed to the fact that the spectral norm is only indicative of the sharpness along one particular direction or subspace in a very high dimensional parameter space. The parametric line plots are plotted along random directions, and thus we should not expect that the difference in sharpness will be of the same order of magnitude along all or even most random directions.

Batch Size Test Accuracy Spectral Norm
MNIST / Fully-Connected
256
5000
MNIST / LeNet
256
5000
CIFAR-10 / AlexNet
256
2000
CIFAR-10 / VGG16
256
2000
Table 4: Test Accuracy and Spectral Norm of Hessian at Minima for different trained networks. MNIST-FC and LeNet are trained and tested on MNIST, while AlexNet and VGG16 are trained and tested on CIFAR-10.

5 Related Work

In this paper, we have proposed a Hessian based measure for the sharpness of minima, which follows pioneering works in [9] and [13] in attempting to measure the sharpness/flatness of deep network minima. As we noted in section 1, flatter minima are believed to be robust to perturbation of the neural network parameters. [21] connect generalization to the sensitivty of the network to perturbations to the inputs. In a recent work, [27] obtain a measure of generalization that is also related to the Hessian at the minima, but still have not resolved the rescaling issue that results in arbitrarily large or small Hessian spectra for the same neural network function.

Riemannian approaches to training neural networks have mostly focused on batch normalization [4, 10]. Since batch norm layers are invariant to scalings of the linear layers that precede them, a common approach is to restrict the weights of the linear layers to the manifold of weight matrices with unit norm, or an oblique manifold [11], or the Stiefel manifold. To the best of our knowledge, we are the first to propose a quotient manifold of neural network parameters and successfully employ it to resolve the question of how to accurately measure the Hessian of the loss function at minima.

6 Conclusion and Future Work

In this paper, we observe that natural rescalings of neural networks with positively homogeneous activations induce an equivalence relation in the parameter space which in turn leads to a quotient manifold structure in the parameter space. We provide theoretical justification for these claims and then adopt the manifold structure to propose a Hessian based sharpness measure for deep network minima. We provide an algorithm to compute this measure and apply this technique to compare minima obtained using large-batch and small-batch gradient based methods.

We believe this quotient manifold view of the parameter space of neural networks can have implications for training deep networks as well. While balanced training procedures like weight normalization [24] and Path-SGD [20] have been explored in the past, we would like to study how an optimization procedure on this manifold will compare to those approaches.

As demonstrated in [27], properties of the Hessian at the minima are also related to generalization of deep networks. Our framework provides a principled path to estimate properties of the Hessian such that they are invariant to rescaling of deep networks.

Finally, our framework can also be extended to nodewise rescalings of neural network parameters, as defined in [20]. For example, consider a neural network with 2 hidden layers with parameters , represented as the function . For positive definite diagonal matrices satisfying , the network with parameters implements the same function as the network with parameters . For the nodewise rescalings of this nature, one can work with the following invariant metric on the parameter space:

Using this metric in our framework will yield a flatness measure that is invariant to nodewise rescaling as well.

Acknowledgements.  We thank Rui Wu and Daniel Park for providing helpful comments on the paper.

Appendix A Missing Proofs from Section 2

We retain the same notation from the main paper.

Proposition A.1.

Let be the parameters of a neural network with layers, and be a set of multipliers. Here . We can transform the layer weights by in the following manner: . We introduce a relation from to itself, if such that and . The relation , is an equivalence relation.

Proof.
  1. It is self evident that , with

  2. If , then such that . Set , then and . Also, , which means .

  3. Let , and . This means, such that , and such that Let . We see that , and . Since , we have that .

Hence is an equivalence relation. ∎

Proposition A.2.

The set obtained by mapping all points within an equivalence class to a single point in the set has a quotient manifold structure, making a differentiable quotient manifold.

Proof.

In order to prove that is a manifold, we need to show that:

  1. is an embedded submanifold of .

  2. The projection , is a submersion.

  3. is a closed subset of .


First, we look at a point . This means , , such that . For every we can define which is a smooth curve and an injection from to , and . Since , we see that , where is the Jacobian of . This means that is a submersion, proving point 2.

Next we will prove point 3. For this, we define a function

Under , the preimage of , is . Since the preimage of a closed set is a closed set, we have that is a closed subset of .

Finally we will prove 1, by defining a submersion from to . Suppose there is a smooth function from to