:mortar_board: Documenting my journey in Machine Learning and Data Science
Despite their overwhelming capacity to overfit, deep learning architectures tend to generalize relatively well to unseen data, allowing them to be deployed in practice. However, explaining why this is the case is still an open area of research. One standing hypothesis that is gaining popularity, e.g. Hochreiter & Schmidhuber (1997); Keskar et al. (2017), is that the flatness of minima of the loss function found by stochastic gradient based methods results in good generalization. This paper argues that most notions of flatness are problematic for deep models and can not be directly applied to explain generalization. Specifically, when focusing on deep networks with rectifier units, we can exploit the particular geometry of parameter space induced by the inherent symmetries that these architectures exhibit to build equivalent models corresponding to arbitrarily sharper minima. Furthermore, if we allow to reparametrize a function, the geometry of its parameters can change drastically without affecting its generalization properties.READ FULL TEXT VIEW PDF
It is widely observed that deep learning models with learned parameters
This paper explains why deep learning can generalize well, despite large...
In distributed deep learning, a large batch size in Stochastic Gradient
The power of neural networks lies in their ability to generalize to unse...
It has been empirically observed that the flatness of minima obtained fr...
The question why deep learning algorithms generalize so well has attract...
While the generalization properties of neural networks are not yet well
:mortar_board: Documenting my journey in Machine Learning and Data Science
Deep learning techniques have been very successful in several domains, like object recognition in images (e.g Krizhevsky et al., 2012; Simonyan & Zisserman, 2015; Szegedy et al., 2015; He et al., 2016), machine translation (e.g. Cho et al., 2014; Sutskever et al., 2014; Bahdanau et al., 2015; Wu et al., 2016; Gehring et al., 2016) and speech recognition (e.g. Graves et al., 2013; Hannun et al., 2014; Chorowski et al., 2015; Chan et al., 2016; Collobert et al., 2016). Several arguments have been brought forward to justify these empirical results. From a representational point of view, it has been argued that deep networks can efficiently approximate certain functions (e.g. Montufar et al., 2014; Raghu et al., 2016). Other works (e.g Dauphin et al., 2014; Choromanska et al., 2015) have looked at the structure of the error surface to analyze how trainable these models are. Finally, another point of discussion is how well these models can generalize (Nesterov & Vial, 2008; Keskar et al., 2017; Zhang et al., 2017). These correspond, respectively, to low approximation, optimization and estimation error as described by Bottou (2010).
Our work focuses on the analysis of the estimation error. In particular, different approaches had been used to look at the question of why stochastic gradient descent results in solutions that generalize well (Bottou & LeCun, 2005; Bottou & Bousquet, 2008). For example, Duchi et al. (2011); Nesterov & Vial (2008); Hardt et al. (2016); Bottou et al. (2016); Gonen & Shalev-Shwartz (2017) rely on the concept of stochastic approximation or uniform stability (Bousquet & Elisseeff, 2002). Another conjecture that was recently (Keskar et al., 2017) explored, but that could be traced back to Hochreiter & Schmidhuber (1997)
, relies on the geometry of the loss function around a given solution. It argues that flat minima, for some definition of flatness, lead to better generalization. Our work focuses on this particular conjecture, arguing that there are critical issues when applying the concept of flat minima to deep neural networks, which require rethinking what flatness actually means.
While the concept of flat minima is not well defined, having slightly different meanings in different works, the intuition is relatively simple. If one imagines the error as a one-dimensional curve, a minimum is flat if there is a wide region around it with roughly the same error, otherwise the minimum is sharp. When moving to higher dimensional spaces, defining flatness becomes more complicated. In Hochreiter & Schmidhuber (1997) it is defined as the size of the connected region around the minimum where the training loss is relatively similar. Chaudhari et al. (2017) relies, in contrast, on the curvature of the second order structure around the minimum, while Keskar et al. (2017) looks at the maximum loss in a bounded neighbourhood of the minimum. All these works rely on the fact that flatness results in robustness to low precision arithmetic or noise in the parameter space, which, using an minimum description length-based argument, suggests a low expected overfitting.
However, several common architectures and parametrizations in deep learning are already at odds with this conjecture, requiring at least some degree of refinement in the statements made. In particular, we show how the geometry of the associated parameter space can alter the ranking between prediction functions when considering several measures offlatness/sharpness. We believe the reason for this contradiction stems from the Bayesian arguments about KL-divergence made to justify the generalization ability of flat minima (Hinton & Van Camp, 1993). Indeed, Kullback-Liebler divergence is invariant to change of parameters whereas the notion of "flatness" is not. The demonstrations of Hochreiter & Schmidhuber (1997) are approximately based on a Gibbs formalism and rely on strong assumptions and approximations that can compromise the applicability of the argument, including the assumption of a discrete function space.
For conciseness, we will restrict ourselves to supervised scalar output problems, but several conclusions in this paper can apply to other problems as well. We will consider a function that takes as input an element from an input space and outputs a scalar . We will denote by
the prediction function. This prediction function will be parametrized by a parameter vectorin a parameter space . Often, this prediction function will be over-parametrized and two parameters that yield the same prediction function everywhere, , are called observationally equivalent. The model is trained to minimize a continuous loss function which takes as argument the prediction function . We will often think of the loss as a function of and adopt the notation .
The notion of flatness/sharpness of a minimum is relative, therefore we will discuss metrics that can be used to compare the relative flatness between two minima. In this section we will formalize three used definitions of flatness in the literature.
Hochreiter & Schmidhuber (1997) defines a flat minimum as "a large connected region in weight space where the error remains approximately constant". We interpret this formulation as follows:
Given , a minimum , and a loss , we define as the largest (using inclusion as the partial order over the subsets of ) connected set containing such that . The -flatness will be defined as the volume of . We will call this measure the volume -flatness.
In Figure 1, will be the purple line at the top of the red area if the height is and its volume will simply be the length of the purple line.
Flatness can also be defined using the local curvature of the loss function around the minimum if it is a critical point 111In this paper, we will often assume that is the case when dealing with Hessian-based measures in order to have them well-defined.. Chaudhari et al. (2017); Keskar et al. (2017)
suggest that this information is encoded in the eigenvalues of the Hessian. However, in order to compare how flat one minimum versus another, the eigenvalues need to be reduced to a single number. Here we consider thespectral norm and trace of the Hessian, two typical measurements of the eigenvalues of a matrix.
Additionally Keskar et al. (2017) defines the notion of -sharpness. In order to make proofs more readable, we will slightly modify their definition. However, because of norm equivalence in finite dimensional space, our results will transfer to the original definition in full space as well. Our modified definition is the following:
Let be an Euclidean ball centered on a minimum with radius . Then, for a non-negative valued loss function , the -sharpness will be defined as proportional to
In Figure 1, if the width of the red area is then the height of the red area is .
-sharpness can be related to the spectral norm of the Hessian. Indeed, a second-order Taylor expansion of around a critical point minimum is written
In this second order approximation, the -sharpness at would be
Before moving forward to our results, in this section we first introduce the notation used in the rest of paper. Most of our results, for clarity, will be on the deep rectified feedforward networks with a linear output layer that we describe below, though they can easily be extended to other architectures (e.g. convolutional, etc.).
Given weight matrices with and , the output of a deep rectified feedforward networks with a linear output layer is:
is the input to the model, a high-dimensional vector
vec reshapes a matrix into a vector.
Note that in our definition we excluded the bias terms, usually found in any neural architecture. This is done mainly for convenience, to simplify the rendition of our arguments. However, the arguments can be extended to the case that includes biases (see Appendix LABEL:app:bias). Another choice is that of the linear output layer. Having an output activation function does not affect our argument either: since the loss is a function of the output activation, it can be rephrased as a function of linear pre-activation.
Deep rectifier models have certain properties that allows us in section 4 to arbitrary manipulate the flatness of a minimum.
An important topic for optimization of neural networks is understanding the non-Euclidean geometry of the parameter space as imposed by the neural architecture (see, for example Amari, 1998). In principle, when we take a step in parameter space what we expect to control is the change in the behavior of the model (i.e. the mapping of the input to the output ). In principle we are not interested in the parameters per se, but rather only in the mapping they represent.
If one defines a measure for the change in the behavior of the model, which can be done under some assumptions, then, it can be used to define, at any point in the parameter space, a metric that says what is the equivalent change in the parameters for a unit of change in the behavior of the model. As it turns out, for neural networks, this metric is not constant over . Intuitively, the metric is related to the curvature, and since neural networks can be highly non-linear, the curvature will not be constant. See Amari (1998); Pascanu & Bengio (2014) for more details. Coming back to the concept of flatness or sharpness of a minimum, this metric should define the flatness.
However, the geometry of the parameter space is more complicated. Regardless of the measure chosen to compare two instantiations of a neural network, because of the structure of the model, it also exhibits a large number of symmetric configurations that result in exactly the same behavior. Because the rectifier activation has the non-negative homogeneity property, as we will see shortly, one can construct a continuum of points that lead to the same behavior, hence the metric is singular. Which means that one can exploit these directions in which the model stays unchanged to shape the neighbourhood around a minimum in such a way that, by most definitions of flatness, this property can be controlled. See Figure 2 for a visual depiction, where the flatness (given here as the distance between the different level curves) can be changed by moving along the curve.
Let us redefine, for convenience, the non-negative homogeneity property (Neyshabur et al., 2015; Lafond et al., 2016) below. Note that beside this property, the reason for studying the rectified linear activation is for its widespread adoption (Krizhevsky et al., 2012; Simonyan & Zisserman, 2015; Szegedy et al., 2015; He et al., 2016).
A given a function is non-negative homogeneous if
The rectified function is non-negative homogeneous.
Follows trivially from the constraint that , given that , iff .
For a deep rectified neural network it means that:
meaning that for this one (hidden) layer neural network, the parameters is observationally equivalent to . This observational equivalence similarly holds for convolutional layers.
Given this non-negative homogeneity, if then is an infinite set of observationally equivalent parameters, inducing a strong non-identifiability in this learning scenario. Other models like deep linear networks (Saxe et al., 2013), leaky rectifiers (He et al., 2015) or maxout networks (Goodfellow et al., 2013) also have this non-negative homogeneity property.
In what follows we will rely on such transformations, in particular we will rely on the following definition:
For a single hidden layer rectifier feedforward network we define the family of transformations
which we refer to as a -scale transformation.
Note that a -scale transformation will not affect the generalization, as the behavior of the function is identical. Also while the transformation is only defined for a single layer rectified feedforward network, it can trivially be extended to any architecture having a single rectified network as a submodule, e.g. a deep rectified feedforward network. For simplicity and readability we will rely on this definition.
In this section we exploit the resulting strong non-identifiability to showcase a few shortcomings of some definitions of flatness. Although -scale transformation does not affect the function represented, it allows us to significantly decrease several measures of flatness. For another definition of flatness, -scale transformation show that all minima are equally flat.
For a one-hidden layer rectified neural network of the form
and a minimum , such that and , has an infinite volume.
We will not consider the solution where any of the weight matrices is zero, or , as it results in a constant function which we will assume to give poor training performance. For , the -scale transformation has Jacobian determinant , where once again and
. Note that the Jacobian determinant of this linear transformation is the change in the volume induced byand . We show below that there is a connected region containing with infinite volume and where the error remains approximately constant.
We will first introduce a small region with approximately constant error around with non-zero volume. Given and if we consider the loss function continuous with respect to the parameter, is an open set containing . Since we also have and , let such that the ball is in and has empty intersection with . Let the volume of .
Since the Jacobian determinant of is the multiplicative change of induced by , the volume of is . If , we can arbitrarily grow the volume of , with error within an -interval of , by having tends to if or to otherwise.
If , has volume . Let . is a connected region where the error remains approximately constant, i.e. within an -interval of .
Let . Since
where is the Cartesian set product, we have
Therefore, (see Figure 3).
Similarly, are disjoint and have volume . We have also . The volume of is then lower bounded by and is therefore infinite. has then infinite volume too, making the volume -flatness of infinite. ∎
This theorem can generalize to rectified neural networks in general with a similar proof. Given that every minimum has an infinitely large region (volume-wise) in which the error remains approximately constant, that means that every minimum would be infinitely flat according to the volume -flatness. Since all minima are equally flat, it is not possible to use volume -flatness to gauge the generalization property of a minimum.
The non-Euclidean geometry of the parameter space, coupled with the manifolds of observationally equal behavior of the model, allows one to move from one region of the parameter space to another, changing the curvature of the model without actually changing the function. This approach has been used with success to improve optimization, by moving from a region of high curvature to a region of well behaved curvature (e.g. Desjardins et al., 2015; Salimans & Kingma, 2016). In this section we look at two widely used measures of the Hessian, the spectral radius and trace, showing that either of these values can be manipulated without actually changing the behavior of the function. If the flatness of a minimum is defined by any of these quantities, then it could also be easily manipulated.
The gradient and Hessian of the loss with respect to can be modified by .
we have then by differentiation
Through these transformations we can easily find, for any critical point which is a minimum with non-zero Hessian, an observationally equivalent parameter whose Hessian has an arbitrarily large spectral norm.
For a one-hidden layer rectified neural network of the form
and critical point being a minimum for , such that , where is the spectral norm of .
The trace of a symmetric matrix is the sum of its eigenvalues and a real symmetric matrix can be diagonalized in , therefore if the Hessian is non-zero, there is one non-zero positive diagonal element. Without loss of generality, we will assume that this non-zero element of value corresponds to an element in . Therefore the Frobenius norm of
is lower bounded by .
Since all norms are equivalent in finite dimension, there exists a constant such that for all symmetric matrices . So by picking , we are guaranteed that . ∎
Any minimum with non-zero Hessian will be observationally equivalent to a minimum whose Hessian has an arbitrarily large spectral norm. Therefore for any minimum in the loss function, if there exists another minimum that generalizes better then there exists another minimum that generalizes better and is also sharper according the spectral norm of the Hessian. The spectral norm of critical points’ Hessian becomes as a result less relevant as a measure of potential generalization error. Moreover, since the spectral norm lower bounds the trace for a positive semi-definite symmetric matrix, the same conclusion can be drawn for the trace.
However, some notion of sharpness might take into account the entire eigenspectrum of the Hessian as opposed to its largest eigenvalue, for instance, Chaudhari et al. (2017) describe the notion of wide valleys, allowing the presence of very few large eigenvalues. We can generalize the transformations between observationally equivalent parameters to deeper neural networks with hidden layers: for with . If we define
then the first and second derivatives at will be
We will show to which extent you can increase several eigenvalues of by varying .
For each matrix , we define the vector of sorted singular values of
of sorted singular values ofwith their multiplicity .
If is symmetric positive semi-definite, is also the vector of its sorted eigenvalues.
For a -hidden layer rectified neural network of the form
and critical point being a minimum for , such that has rank , such that eigenvalues are greater than .
For simplicity, we will note the principal square root of a symmetric positive-semidefinite matrix . The eigenvalues of are the square root of the eigenvalues of and are its singular values. By definition, the singular values of are the square root of the eigenvalues of . Without loss of generality, we consider and choose and . Since and are positive symmetric semi-definite matrices, we can apply the multiplicative Horn inequalities (Klyachko, 2000) on singular values of the product :
By choosing , since we have we can conclude that
It means that there exists an observationally equivalent parameter with at least arbitrarily large eigenvalues. Since Sagun et al. (2016) seems to suggests that rank deficiency in the Hessian is due to over-parametrization of the model, one could conjecture that can be high for thin and deep neural networks, resulting in a majority of large eigenvalues. Therefore, it would still be possible to obtain an equivalent parameter with large Hessian eigenvalues, i.e. sharp in multiple directions.
We have redefined for the -sharpness of Keskar et al. (2017) as follow
where is the Euclidean ball of radius centered on
. This modification will demonstrate more clearly the issues of that metric as a measure of probable generalization. If we useand corresponding to a non-constant function, i.e. and , then we can define . We will now consider the observationally equivalent parameter . Given that , we have that , making the maximum loss in this neighborhood at least as high as the best constant-valued function, incurring relatively high sharpness. Figure 4 provides a visualization of the proof.
For rectified neural network every minimum is observationally equivalent to a minimum that generalizes as well but with high -sharpness. This also applies when using the full-space -sharpness used by Keskar et al. (2017). We can prove this similarly using the equivalence of norms in finite dimensional vector spaces and the fact that for (see Keskar et al. (2017)). We have not been able to show a similar problem with random subspace -sharpness used by Keskar et al. (2017), i.e. a restriction of the maximization to a random subspace, which could relate to the notion of wide valleys described by Chaudhari et al. (2017).
By exploiting the non-Euclidean geometry and non-identifiability of rectified neural networks, we were able to demonstrate some of the limits of using typical definitions of minimum’s flatness as core explanation for generalization.
In the previous section 4 we explored the case of a fixed parametrization, that of deep rectifier models. In this section we demonstrate a simple observation. If we are allowed to change the parametrization of some function , we can obtain arbitrarily different geometries without affecting how the function evaluates on unseen data. The same holds for reparametrization of the input space. The implication is that the correlation between the geometry of the parameter space (and hence the error surface) and the behavior of a given function is meaningless if not preconditioned on the specific parametrization of the model.
One thing that needs to be considered when relating flatness of minima to their probable generalization is that the choice of parametrization and its associated geometry are arbitrary. Since we are interested in finding a prediction function in a given family of functions, no reparametrization of this family should influence generalization of any of these functions. Given a bijection onto , we can define new transformed parameter . Since and represent in different space the same prediction function, they should generalize as well.
Let’s call the loss function with respect to the new parameter . We generalize the derivation of Subsection 4.2:
At a differentiable critical point, we have by definition , therefore the transformed Hessian at a critical point becomes
This means that by reparametrizing the problem we can modify to a large extent the geometry of the loss function so as to have sharp minima of in correspond to flat minima of in and conversely. Figure 5 illustrates that point in one dimension. Several practical (Dinh et al., 2014; Rezende & Mohamed, 2015; Kingma et al., 2016; Dinh et al., 2016) and theoretical works (Hyvärinen & Pajunen, 1999) show how powerful bijections can be. We can also note that the formula for the transformed Hessian at a critical point also applies if is not invertible, would just need to be surjective over in order to cover exactly the same family of prediction functions
We show in Appendix LABEL:appendix:radial, bijections that allow us to perturb the relative flatness between a finite number of minima.
Instances of commonly used reparametrization are batch normalization (Ioffe & Szegedy, 2015), or the virtual batch normalization variant (Salimans et al., 2016), and weight normalization (Badrinarayanan et al., 2015; Salimans & Kingma, 2016; Arpit et al., 2016). Im et al. (2016)
have plotted how the loss function landscape was affected by batch normalization. However, we will focus on weight normalization reparametrization as the analysis will be simpler, but the intuition with batch normalization will be similar. Weight normalization reparametrizes a nonzero weightas with the new parameter being the scale and the unnormalized weight .
Since we can observe that is invariant to scaling of , reasoning similar to Section 3 can be applied with the simpler transformations for . Moreover, since this transformation is a simpler isotropic scaling, the conclusion that we can draw can be actually more powerful with respect to :
every minimum has infinite volume -sharpness;
every minimum is observationally equivalent to an infinitely sharp minimum and to an infinitely flat minimum when considering nonzero eigenvalues of the Hessian;
every minimum is observationally equivalent to a minimum with arbitrarily low full-space and random subspace -sharpness and a minimum with high full-space -sharpness.
This further weakens the link between the flatness of a minimum and the generalization property of the associated prediction function when a specific parameter space has not been specified and explained beforehand.
As we conclude that the notion of flatness for a minimum in the loss function by itself is not sufficient to determine its generalization ability in the general case, we can choose to focus instead on properties of the prediction function instead. Motivated by some work in adversarial examples (Szegedy et al., 2014; Goodfellow et al., 2015) for deep neural networks, one could decide on its generalization property by analyzing the gradient of the prediction function on examples. Intuitively, if the gradient is small on typical points from the distribution or has a small Lipschitz constant, then a small change in the input should not incur a large change in the prediction.
But this infinitesimal reasoning is once again very dependent of the local geometry of the input space. For an invertible preprocessing , e.g. feature standardization, whitening or gaussianization (Chen & Gopinath, 2001), we will call the prediction function on the preprocessed input . We can reproduce the derivation in Section 5 to obtain
As we can alter significantly the relative magnitude of the gradient at each point, analyzing the amplitude of the gradient of the prediction function might prove problematic if the choice of the input space have not been explained beforehand. This remark applies in applications involving images, sound or other signals with invariances (Larsen et al., 2015). For example, Theis et al. (2016) show for images how a small drift of one to four pixels can incur a large difference in terms of norm.
It has been observed empirically that minima found by standard deep learning algorithms that generalize well tend to be flatter than found minima that did not generalize well (Chaudhari et al., 2017; Keskar et al., 2017). However, when following several definitions of flatness, we have shown that the conclusion that flat minima should generalize better than sharp ones cannot be applied as is without further context. Previously used definitions fail to account for the complex geometry of some commonly used deep architectures. In particular, the non-identifiability of the model induced by symmetries, allows one to alter the flatness of a minimum without affecting the function it represents. Additionally the whole geometry of the error surface with respect to the parameters can be changed arbitrarily under different parametrizations. In the spirit of (Swirszcz et al., 2016), our work indicates that more care is needed to define flatness to avoid degeneracies of the geometry of the model under study. Also such a concept can not be divorced from the particular parametrization of the model or input space.
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 1724–1734. ACL, 2014. ISBN 978-1-937284-96-1. URL http://aclweb.org/anthology/D/D14/D14-1179.pdf.
Speech recognition with deep recurrent neural networks.In Acoustics, speech and signal processing (icassp), 2013 ieee international conference on, pp. 6645–6649. IEEE, 2013.
Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.In
Proceedings of the IEEE international conference on computer vision, pp. 1026–1034, 2015.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778, 2016.
Proceedings of the sixth annual conference on Computational learning theory, pp. 5–13. ACM, 1993.
Nonlinear independent component analysis: Existence and uniqueness results.Neural Networks, 12(3):429–439, 1999.
Imagenet classification with deep convolutional neural networks.In Advances in neural information processing systems, pp. 1097–1105, 2012.