## 1 Introduction

Neural networks (NNs) have become the state of the art machine learning approach in many applications. An explanation for their superior performance is attributed to their ability to automatically learn suitable features from data. In supervised learning, these features are learned implicitly through minimizing the

empirical error for a training set drawn iid according to a target distribution , and a loss function . Here, denotes the function represented by a neural network. It is an open question why minimizing the empirical error during deep neural network training leads to good generalization, even though in many cases the number of network parameters is higher than the number of training examples. That is, why deep neural networks have a low generalization error(1) |

which is the difference between expected error on the target distribution and the empirical error on a finite dataset sampled from .

It has been proposed that good generalization correlates with flat minima of the non-convex loss surface [flatMinima, simplifyingByFlat] and this correlation has been empirically validated keskarLarge, sensitivityGeneralization, identifyingGenProperties. However, as dinhSharp

remarked, current flatness measures—which are based only on the Hessian of the loss function—cannot theoretically be related to generalization: For deep neural networks with ReLU activation functions, a linear reparameterization of one layer,

for , can lead to the same network function by simultaneously multiplying another layer by the inverse of , . Representing the same function, the generalization performance remains unchanged. However, this linear reparameterization changes all common measures of the Hessian of the loss. This constitutes an issue in relating flatness of the loss curve to generalization. We propose a novel flatness measure that becomes invariant under layer-wise reparameterization through a multiplication with . We empirically show that it also correlates strongly with good generalization performance.## 2 Measures of Flatness of the Loss Curve

Consider a function , where is the composition of a twice differentiable function and a matrix product with a matrix , whereas can be considered as a feature extractor. For a loss function we let denote the Hessian of the empirical error on a training set considered as a function on and

the largest eigenvalue of

.###### Definition 1.

Let be a model with an arbitrary twice differentiable function on a matrix product of parameters and the image of under a (feature) function . Then shall denote a flatness measure of the loss curve, defined by (Note that small values of indicate flatness and high values indicate sharpness. )

#### Linear regression with squared loss

In the case of linear regression,

(, and ), and the squared loss function , we can easily compute second derivatives with respect to to be and the Hessian is independent of the parameters . In this case, with a constant and the measurereduces to (a constant multiple of) the well-known Tikhonov (ridge) regression penalty.

#### Layers of Neural Networks

We consider neural network functions

(2) |

of a neural network of layers with nonlinear activation function . We hide a possible non-linearity at the output by integrating it in a loss function chosen for neural network training. By letting denote the output of the composition of the first layers and the composition of the activation function of the -th layer together with the rest of layers, we can write for each layer , . Then, for each layer of the neural network, we obtain a measure of flatness at parameters with with the largest eigenvalue of the Hessian of the loss with respect to the parameters of the l-th layer.

###### Theorem 2.

Let denote a neural network function parameterized by weights of the -th layer. Suppose there are positive numbers such that for all . Then, with and , we have

We provide a proof in Appendix A.

#### An Averaging Alternative

Experimental work hessianEigenvalueDensity

suggests that the spectrum of the Hessian has a lot of small values and only a few large outliers. We therefore consider the trace as an average of the spectrum to obtain

as a measure of flatness. The same arguments as those used to prove Theorem 2 also show the measure to be independent with respect to the same layer-wise reparameterizations.## 3 Empirical Evaluation

We empirically validate the practical usefulness of the proposed flatness measure by showing a strong correlation with the generalization error at local minima of the loss surface. For measuring the generalization error, we employ a Monte Carlo approximation of the target distribution defined by the testing dataset and measure the difference between loss value on this approximation and empirical error. In order to track the correlation of the flatness measure to the generalization error at local minima, sufficiently different minima should be achieved by training. The most popular technique is to train the model with small and large batch size [scaleInvariantMeasure, keskarLarge, sensitivityGeneralization, identifyingGenProperties], which we also employed.

A neural network (LeNet5 [lecun2015lenet]) is trained on CIFAR10 multiple times until convergence with various training setups. This way, we obtain network configurations in multiple local minima. In particular four different initialization schemes were considered (Xavier normal, Kaiming uniform, uniform in , normal with and ), with four different mini-batch sizes (, , , ) and corresponding learning rates to keep the ration between them equal (, , , ) for the standard SGD optimizer. Each of the setups was run for times with different random initializations. Here the generalization error is the difference between summed error values on test samples multiplied by (since the size of the training set is times larger) and summed error values on the training examples. Figure 1 shows the approximated generalization error with respect to the flatness measure (for both and with corresponding to the last hidden layer) for all network configurations. The correlation is significant for both measures, and it is stronger (with ) for . This indicates that taking into account the full spectrum of the Hessian is beneficial.

To investigate the invariance of the proposed measure to reparameterization, we apply the reparameterization discussed in Sec. 2 to all networks using random factors in the interval . The impact of the reparameterization on the proposed flatness measures in comparison to the traditional ones is shown in Figure 2. While the proposed flatness measures are not affected, the ones purely based on the Hessian have very weak correlation with the generalization error after the modifications.

Additional experiments conducted on MNIST dataset are described in Appendix C.

In contrast to existing measures of flatness, our proposed measure is invariant to layer-wise reparameterizations of ReLU networks. However, we note that other reparameterizations are possible, e.g., we can use the positive homogeneity and multiply all incoming weights into a single neuron by a positive number

and multiply all outgoing weights of the same neuron by . Our proposed measures of flatness and are in general not invariant to such reparameterizations. We define other flatness measures that are invariant to such reparameterizations as well in Appendix B.Taking things together, we proposed a novel and practically useful flatness measure that strongly correlates with the generalization error while being invariant to reparameterization.

## References

## Appendix A Proof of Theorem 2

In this section, we discuss the proof to Theorem 2. Before starting with the formal proof, we discuss the idea in a simplified setting to separate the essential insight from the complicated notation in the setting of neural networks.

Let denote twice differentiable functions such that for all and all . Later, will correspond to weights of a specific layer of the neural network and the functions and will correspond respectively to the neural network functions before and after reparameterizations of possibly all layers of the network. We show that

Indeed, the second derivative of at with respect to coordinates is given by the differential quotient as

Since this holds for all combinations of coordinates, we see that for the Hessians of and , and hence

#### Formal Proof of Theorem 2

We are given a neural network function parameterized by weights of the -th layer and positive numbers such that for all and all . With defined by , and , we aim to show that

where

is the product of the squared norm of vectorized weight matrix

with the maximal eigenvalue of the Hessian of the empirical error at with respect to parameters .Let denote the loss as a function on the parameters of the -th layer before reparameterization. Further, we let denote the loss as a function on the parameters of the -th layer after reparameterization. We define a linear function by . By assumption, we have that for all

. By the chain rule, we compute for any variable

of ,Similarily, for second derivatives, we get for all

Consequently, the Hessian of the empirical error before reparameterization and the Hessian after reparameterization satisfy and also . Therefore,

## Appendix B Additional Measures of Flatness

We present additional measures of flatness we have considered during our study. The original motivation to study additional measures was given by the observation that there are other possible reparameterizations of a fully connected ReLU network than suitable multiplication of layers by positive scalars: We can use the positive homogeneity and multiply all incoming weights into a single neuron by a positive number and multiply all outgoing weights of the same neuron by . Our previous measures of flatness and are in general not independent of the latter reparameterizations. We define for each layer and neuron in that layer a flatness measure by

For each and , this measure is invariant under all linear reparameterizations that do not change the network function. The proof of the following theorem is given in Section B.1

###### Theorem 3.

Let denote a neural network function parameterized by weights of the -th layer. Suppose there are positive numbers such that the products obtained from multiplying weight at matrix position in layer by satisfy that for all . Then for each and .

We define a measure of flatness for a full layer by combinations of the measures of flatness for each individual neuron

Since each of the individual expressions is invariant under all linear reparameterizations, so are the maximum and sum.

Notation | Definition | One value per | Invariance |
---|---|---|---|

layer | layer-wise mult. by pos scalar | ||

layer | layer-wise mult. by pos scalar | ||

network | layer-wise mult. by pos scalar | ||

network | layer-wise mult. by pos scalar | ||

network | layer-wise mult. by pos scalar | ||

network | layer-wise mult. by pos scalar | ||

neuron | all linear reparameterizations | ||

layer | all linear reparameterizations | ||

layer | all linear reparameterizations | ||

network | all linear reparameterizations | ||

network | all linear reparameterizations |

#### One Value for all Layers

It is clear that a low value of for a specific layer alone cannot explain good performance. We therefore consider simple common bounds by combinations of the individual terms , e.g., by taking the maximum of over all layers, , or the sum . Since each of the individual expressions are invariant under linear reparameterizations of full layers, so are the maximum and sum.

Finally, we define and .

Table 1 summarizes all our measures of flatness, specifying whether each measure is defined per network, layer or neuron, and whether it is invariant layer-wise multiplication by a positive scalar (as considered in Theorem 2) or invariant under all linear reparameterization (as considered in Theorem 3).

### b.1 Proof of Theorem 3

As in Subsection A, we first present the idea in a simplified setting.

For the proof of Theorem 3 we need to consider the case when we multiply coordinates by different scalars. Let denote twice differentiable functions such that for all , and all . In the formal proof, will correspond to two outgoing weights for a specific neuron, while again and correspond to network functions before and after reparameterizations of all possibly all weights of the neural network. Then

for all and all .

Indeed, the second derivative of at with respect to coordinates is given by the differential quotient as

From the calculation above, we also see that

It follows that

#### Formal Proof of Theorem 3

We are given a neural network function parameterized by weights of the -th layer and positive numbers such that the products obtained from multiplying weight at matrix position in layer by satisfies that for all and all . We aim to show that

for each and where , denotes the -th column of the weight matrix at the -th layer and denotes the Hessian of the empirical error with respect to the weight parameters in . Similar to the above, we denote by the product obtained from multiplying weight at matrix position in layer by .

The proof is very similar to the proof of Theorem 2, only this time we have to take the different parameters into account. For fixed layer , we denote the -th column of and .

Let

denote the loss as a function on the parameters of the -th column in the -th layer before reparameterization and

denote the loss as a function on the parameters of the -th neuron in the -th layer after reparameterization.

We define a linear function by

By assumption, we have that for all . By the chain rule, we compute for any variable of ,

Similarily, for second derivatives, we get for all ,

Consequently, the Hessian of the empirical error before reparameterization and the Hessian after reparameterization satisfy that at position of the Hessian matrix,

Therefore,

## Appendix C Additional experiments

In addition to the evaluation on the CIFAR10 dataset with LeNet5 network, we also conducted experiments on the MNIST dataset. For learning with this data, we employed a custom fully connected network with ReLU activations containing hidden layers with , , , and neurons correspondingly. The output layer has neurons with softmax activation. The networks were trained till convergence on the training dataset of MNIST, moreover, the configurations that achieved larger than training error were filtered out. All the networks were initialized according to Xavier normal scheme with random seed. For obtaining different convergence minima the batch size was varied between , , , with learning rate changed from to correspondingly to keep the ratio constant. All the configurations were trained with SGD. Figure 3 shows the correlation between the layer-wise flatness measure based on the trace of the Hessian for the corresponding layer. The values for all four hidden layers are calculated (the trace is not normalized) and aligned with values of generalization error (difference between normalized test error and train error). The observed correlation is strong (with ) and varies slightly for different layers, nevertheless it is hard to identify the most influential layer for identifying generalization properties.

We also calculated neuron-wise flatness measures described in Appendix B for these network configurations. In Figure 4 we depicted correlation between and generalization error for each of the layers, and in Figure 5–between and generalization error. The observed correlation is again significant, but compared to the previous measure we can see that it might differ considerably depending on the layer.

The network-wise flatness measures can be based both on layer-wise and neuron-wise measures as defined in Appendix B. We computed , , , and and depicted them in Figure 6. Interesting to note, that each of the network-wise measures has a larger correlation with generalization loss than the original neuron-wise and layer-wise measures.

### c.1 Proof of Equation (LABEL:eq:bound)

First note that for ,

(3) |

From (LABEL:eq:calculation1) and (LABEL:eq:calculation2) we get

where we used the identity that for any symmetric matrix .

## Appendix D Additional properties of feature robustness

### d.1 Relation to noise injection at the feature layer

Feature robustness is related to noise injection in the layer of consideration. By defining a probability measure

on matrices of norm , we can take expectations over matrices. An expectation over such matrices induces for each samplean expectation over a probability distribution of vectors

with . We find the induced probability distribution from the measure defined by for a measurable subset . Then,The latter is robustness to noise injection according to noise distribution for sample at the feature layer defined by .

### d.2 Adversarial examples

#### Large changes of loss (adversarial examples) can be hidden in the mean in the definition of feature robustness.

We have seen that flatness of the loss curve with respect to some weights is related to the mean change in loss value when perturbing all data points into directions for some matrix . For a common bound over different directions governed by the matrix , we restrict ourselves to matrices . One may therefore wonder, what freedom of perturbing individual points do we have?

At first, note that for each fixed sample and direction there is a matrix such that , so each direction for each datapoint can be considered within a bound as above. We get little insight over the change of loss for this perturbation however, since a larger change of the loss may go missing in the mean change of loss over all data points considered in the same bound.

The bound involving from above does not directly allow to check the change of the loss when perturbing the samples independently into arbitrary directions . For example, suppose we have two samples close to each other and we are interested in the change of loss when perturbing them into directions orthogonal to each other. Specifically, suppose our dataset contains the points and for some small , and we aim to check how the loss changes when perturbing into direction and orthogonally into direction . To allow for this simultaneous change, our matrix has to be of the form

Then

Hence, our desired alterations of the input necessarily lead to a large matrix norm and our attainable bound with becomes almost vacuous.

### d.3 Convolutional Layers

Feature robustness is not restricted to fully connected neural networks. In this section, we briefly consider convolutional layers . Using linearity, we get
What about changes for some matrix ? Since convolution is a linear function, there is a matrix such that and there is a matrix such that . We assume that the convolutional layer is dimensionality-reducing, and that the matrix has full rank, so that there is a matrix with .^{1}^{1}1

This holds for example for a convolutional filter with stride one without padding, as in this case

has a Toeplitz submatrix of size . ThenAs a consequence, similar considerations of flatness and feature robustness can be considered for convolutional layers.