Beyond their remarkable representation and memorization ability, deep neural networks empirically perform well in out-of-sample prediction. This intriguing out-of-sample generalization property poses two fundamental theoretical questions:
What are the complexity notions that control the generalization aspects of neural networks?
Why does stochastic gradient descent, or other variants, find parameters with small complexity?
In this paper we approach the generalization question for deep neural networks from a geometric invariance vantage point. The motivation behind invariance is twofold: (1) The specific parametrization of the neural network is arbitrary and should not impact its generalization power. As pointed out in (Neyshabur et al., 2015a)
, for example, there are many continuous operations on the parameters of ReLU nets that will result in exactly the same prediction and thus generalization can only depend on the equivalence class obtained by identifying parameters under these transformations. (2) Although flatness of the loss function has been linked to generalization(Hochreiter and Schmidhuber, 1997), existing definitions of flatness are neither invariant to nodewise re-scalings of ReLU nets nor general coordinate transformations (Dinh et al., 2017) of the parameter space, which calls into question their utility for describing generalization.
It is thus natural to argue for a purely geometric characterization of generalization that is invariant under the aforementioned transformations and additionally resolves the conflict between flat minima and the requirement of invariance. Information geometry is concerned with the study of geometric invariances arising in the space of probability distributions, so we will leverage it to motivate a particular geometric notion of complexity — the Fisher-Rao norm. From an algorithmic point of view the steepest descent induced by this geometry is precisely the natural gradient(Amari, 1998). From the generalization viewpoint, the Fisher-Rao norm naturally incorporates distributional aspects of the data and harmoniously unites elements of flatness and norm which have been argued to be crucial for explaining generalization (Neyshabur et al., 2017).
Statistical learning theory equips us with many tools to analyze out-of-sample performance. The Vapnik-Chervonenkis dimension is one possible complexity notion, yet it may be too large to explain generalization in over-parametrized models, since it scales with the size (dimension) of the network. In contrast, under additional distributional assumptions of a margin, Perceptron (a one-layer network) enjoys a dimension-free error guarantee, with an
norm playing the role of “capacity”. These observations (going back to the 60’s) have led the theory of large-margin classifiers, applied to kernel methods, boosting, and neural networks(Anthony and Bartlett, 1999). In particular, the analysis of Koltchinskii and Panchenko (2002) combines the empirical margin distribution (quantifying how well the data can be separated) and the Rademacher complexity of a restricted subset of functions. This in turn raises the capacity control question: what is a good notion of the restrictive subset of parameter space for neural networks? Norm-based capacity control provides a possible answer and is being actively studied for deep networks (Krogh and Hertz, 1992; Neyshabur et al., 2015b, a; Bartlett et al., 2017; Neyshabur et al., 2017), yet the invariances are not always reflected in these capacity notions. In general, it is very difficult to answer the question of which capacity measure is superior. Nevertheless, we will show that our proposed Fisher-Rao norm serves as an umbrella for the previously considered norm-based capacity measures, and it appears to shed light on possible answers to the above question.
Much of the difficulty in analyzing neural networks stems from their unwieldy recursive definition interleaved with nonlinear maps. In analyzing the Fisher-Rao norm, we proved an identity for the partial derivatives of the neural network that appears to open the door to some of the geometric analysis. In particular, we prove that any stationary point of the empirical objective with hinge loss that perfectly separates the data must also have a large margin. Such an automatic large-margin property of stationary points may link the algorithmic facet of the problem with the generalization property. The same identity gives us a handle on the Fisher-Rao norm and allows us to prove a number of facts about it. Since we expect that the identity may be useful in deep network analysis, we start by stating this result and its implications in the next section. In Section 3 we introduce the Fisher-Rao norm and establish through norm-comparison inequalities that it serves as an umbrella for existing norm-based measures of capacity. Using these norm-comparison inequalities we bound the generalization error of various geometrically distinct subsets of the Fisher-Rao ball and provide a rigorous proof of generalization for deep linear networks. Extensive numerical experiments are performed in Section 5 demonstrating the superior properties of the Fisher-Rao norm.
2 Geometry of Deep Rectified Networks
The function class realized by the feedforward neural network architecture of depth
with coordinate-wise activation functionsis defined as set of functions ( and )111
It is possible to generalize the above architecture to include linear pre-processing operations such as zero-padding and average pooling.with
where the parameter vector() and
For simplicity of calculations, we have set all bias terms to zero222In practice, we found that setting the bias to zero does not significantly impact results on image classification tasks such as MNIST and CIFAR-10.. We also assume throughout the paper that
for all the activation functions, which includes ReLU , “leaky” ReLU , and linear activations as special cases.
To make the exposition of the structural results concise, we define the following intermediate functions in the definition (2.1). The output value of the -th layer hidden node is denoted as , and the corresponding input value as , with . By definition, , and the final output . For any , the subscript denotes the -th coordinate of the vector.
Given a loss function , the statistical learning problem can be phrased as optimizing the unobserved population loss:
based on i.i.d. samples
from the unknown joint distribution. The unregularized empirical objective function is denoted by
We first establish the following structural result for neural networks. It will be clear in the later sections that the lemma is motivated by the study of the Fisher-Rao norm, formally defined in Eqn. (3.1
) below, and information geometry. For the moment, however, let us provide a different viewpoint. For linear functions, we clearly have that . Remarkably, a direct analogue of this simple statement holds for neural networks, even if over-parametrized.
Lemma 2.1 (Structure in Gradient).
Lemma 2.1 reveals the structural constraints in the gradients of rectified networks. In particular, even though the gradients lie in an over-parametrized high-dimensional space, many equality constraints are induced by the network architecture. Before we unveil the surprising connection between Lemma 2.1 and the proposed Fisher-Rao norm, let us take a look at a few immediate corollaries of this result. The first corollary establishes a large-margin property of stationary points that separate the data.
Corollary 2.1 (Large Margin Stationary Points).
Consider the binary classification problem with , and a neural network where the output layer has only one unit. Choose the hinge loss . If a certain parameter satisfies two properties
is a stationary point for in the sense ;
separates the data in the sense that for all ,
then it must be that is a large margin solution: for all ,
The same result holds for the population criteria , in which case is stated as , and the conclusion is .
Observe that if , and if . Using Eqn. (2.6) when the output layer has only one unit, we find
For a stationary point , we have , which implies the LHS of the above equation is 0. Now recall that the second condition that separates the data implies implies for any point in the data set. In this case, the RHS equals zero if and only if . ∎
Granted, the above corollary can be proved from first principles without the use of Lemma 2.1, but the proof reveals a quantitative statement about stationary points along arbitrary directions .
In the second corollary, we consider linear networks.
Corollary 2.2 (Stationary Points for Deep Linear Networks).
Consider linear neural networks with and square loss function. Then all stationary points that satisfy
must also satisfy
where , and are the data matrices.
The proof follows from applying Lemma 2.1
which means . ∎
This simple Lemma is not quite asserting that all stationary points are global optima, since global optima satisfy , while we only proved that the stationary points satisfy .
3 Fisher-Rao Norm and Geometry
In this section, we propose a new notion of complexity of neural networks that can be motivated by geometrical invariance considerations, specifically the Fisher-Rao metric of information geometry. We postpone this motivation to Section 3.3 and instead start with the definition and some properties. Detailed comparison with the known norm-based capacity measures and generalization results are delayed to Section 4.
3.1 An analytical formula
The Fisher-Rao norm for a parameter is defined as the following quadratic form
The underlying distribution for the expectation in the above definition has been left ambiguous because it will be useful to specialize to different distributions depending on the context. Even though we call the above quantity the “Fisher-Rao norm,” it should be noted that it does not satisfy the triangle inequality. The following Theorem unveils a surprising identity for the Fisher-Rao norm.
Theorem 3.1 (Fisher-Rao norm).
The proof of the Theorem relies mainly on the geometric Lemma 2.1 that describes the gradient structure of multi-layer rectified networks.
In the case when the output layer has only one node, Theorem 3.1 reduces to the simple formula
Proof of Theorem 3.1.
Using the definition of the Fisher-Rao norm,
By Lemma 2.1,
Combining the above equalities, we obtain
Before illustrating how the explicit formula in Theorem 3.1 can be viewed as a unified “umbrella” for many of the known norm-based capacity measures, let us point out one simple invariance property of the Fisher-Rao norm, which follows as a direct consequence of Thm. 3.1. This property is not satisfied for norm, spectral norm, path norm, or group norm.
Corollary 3.1 (Invariance).
If there are two parameters such that they are equivalent, in the sense that , then their Fisher-Rao norms are equal, i.e.,
3.2 Norms and geometry
In this section we will employ Theorem 3.1 to reveal the relationship among different norms and their corresponding geometries. Norm-based capacity control is an active field of research for understanding why deep learning generalizes well, including norm (weight decay) in (Krogh and Hertz, 1992; Krizhevsky et al., 2012), path norm in (Neyshabur et al., 2015a), group-norm in (Neyshabur et al., 2015b), and spectral norm in (Bartlett et al., 2017). All these norms are closely related to the Fisher-Rao norm, despite the fact that they capture distinct inductive biases and different geometries.
For simplicity, we will showcase the derivation with the absolute loss function and when the output layer has only one node (). The argument can be readily adopted to the general setting. We will show that the Fisher-Rao norm serves as a lower bound for all the norms considered in the literature, with some pre-factor whose meaning will be clear in Section 4.1. In addition, the Fisher-Rao norm enjoys an interesting umbrella property: by considering a more constrained geometry (motivated from algebraic norm comparison inequalities) the Fisher-Rao norm motivates new norm-based capacity control methods.
The main theorem we will prove is informally stated as follows.
Theorem 3.2 (Norm comparison, informal).
The detailed proof of the above theorem will be the main focus of Section 4.1. Here we will give a sketch on how the results are proved.
Lemma 3.1 (Matrix form).
where , for . In addition, is a diagonal matrix with diagonal elements being either or .
Proof of Lemma 3.1.
Since , we have . Proof is completed via induction. ∎
For the absolute loss, one has and therefore Theorem 3.1 simplifies to,
where . The norm comparison results are thus established through a careful decomposition of the data-dependent vector , in distinct ways according to the comparing norm/geometry.
3.3 Motivation and invariance
In this section, we will provide the original intuition and motivation for our proposed Fisher-Rao norm from the viewpoint of geometric invariance.
Information geometry and the Fisher-Rao metric
Information geometry provides a window into geometric invariances when we adopt a generative framework where the data generating process belongs to the parametric family indexed by the parameters of the neural network architecture. The Fisher-Rao metric on is defined in terms of a local inner product for each value of as follows. For each define the corresponding tangent vectors , . Then for all and we define the local inner product
where . The above inner product extends to a Riemannian metric on the space of positive densities called the Fisher-Rao metric333Bauer et al. (2016) showed that it is essentially the the unique metric that is invariant under the diffeomorphism group of .. The relationship between the Fisher-Rao metric and the Fisher information matrix in statistics literature follows from the identity,
Notice that the Fisher information matrix induces a semi-inner product unlike the Fisher-Rao metric which is non-degenerate444The null space of is mapped to the origin under .. If we make the additional modeling assumption that then the Fisher information becomes,
If we now identify our loss function as then the Fisher-Rao metric coincides with the Fisher-Rao norm when . In fact, our Fisher-norm encompasses the Fisher-Rao metric and generalizes it to the case when the model is misspecified .
Having identified the geometric origin of Fisher-Rao norm, let us study the implications for generalization of flat minima. Dinh et al. (2017)
argued by way of counter-example that the existing measures of flatness are inadequate for explaining the generalization capability of multi-layer neural networks. Specifically, by utilizing the invariance property of multi-layer rectified networks under non-negative nodewise rescalings, they proved that the Hessian eigenvalues of the loss function can be made arbitrarily large, thereby weakening the connection between flat minima and generalization. They also identified a more general problem which afflicts Hessian-based measures of generalization for any network architecture and activation function: the Hessian is sensitive to network parametrization whereas generalization should be invariant under general coordinate transformations. Our proposal can be motivated from the following fact555Set
and recall the fact that Fisher information can be viewed as variance as well as the curvature.which relates flatness to geometry (under appropriate regularity conditions)
In other words, the Fisher-Rao norm evades the node-wise rescaling issue because it is exactly invariant under linear re-parametrizations. The Fisher-Rao norm moreover possesses an “infinitesimal invariance” property under non-linear coordinate transformations, which can be seen by passing to the infinitesimal form where non-linear coordinate invariance is realized exactly by the following infinitesimal line element,
Comparing with the above line element reveals the geometric interpretation of the Fisher-Rao norm as the approximate geodesic distance from the origin. It is important to realize that our definition of flatness (3.9) differs from (Dinh et al., 2017) who employed the Hessian loss . Unlike the Fisher-Rao norm, the norm induced by the Hessian loss does not enjoy the infinitesimal invariance property (it only holds at critical points).
There exists a close relationship between the Fisher-Rao norm and the natural gradient. In particular, the natural gradient descent is simply the steepest descent direction induced by the Fisher-Rao geometry of . Indeed, the natural gradient can be expressed as a semi-norm-penalized iterative optimization scheme as follows,
We remark that the positive semi-definite matrix changes with different . We emphasize an “invariance” property of natural gradient under re-parametrization and an “approximate invariance” property under over-parametrization, which is not satisfied for the classic gradient descent. The formal statement and its proof are deferred to Lemma 6.1 in Section 6.2. The invariance property is desirable: in multi-layer ReLU networks, there are many equivalent re-parametrizations of the problem, such as nodewise rescalings, which may slow down the optimization process. The advantage of natural gradient is also illustrated empirically in Section 5.5.
4 Capacity Control and Generalization
In this section, we discuss in full detail the questions of geometry, capacity measures, and generalization. First, let us define empirical Rademacher complexity for the parameter space , conditioned on data , as
are i.i.d. Rademacher random variables.
4.1 Norm Comparison
Let us collect some definitions before stating each norm comparison result. For a vector , the vector norm is denoted , . For a matrix , denotes the spectral norm; denotes the matrix induced norm, for ; denotes the matrix group norm, for .
4.1.1 Spectral norm.
Definition 3 (Spectral norm).
Define the following “spectral norm” ball:
We have the following norm comparison Lemma.
Lemma 4.1 (Spectral norm).
Spectral norm as a capacity control has been considered in (Bartlett et al., 2017). Lemma 4.1 shows that spectral norm serves as a more stringent constraint than Fisher-Rao norm. Let us provide an explanation of the pre-factor here. Define the set of parameters induced by the Fisher-Rao norm geometry
From Lemma 4.1, if the expectation is over the empirical measure , then, because , we obtain
From Theorem 1.1 in (Bartlett et al., 2017), we know that a subset of the characterized by the spectral norm enjoys the following upper bound on Rademacher complexity under mild conditions: for any
Plugging in , we have,
Interestingly, the additional factor in Theorem 1.1 in (Bartlett et al., 2017) exactly cancels with our pre-factor in the norm comparison. The above calculations show that a subset of , induced by the spectral norm geometry, has good generalization error.
4.1.2 Group norm.
Definition 4 (Group norm).
Define the following “group norm” ball, for
where . Here denote the matrix induced norm.
Lemma 4.2 (Group norm).
It holds that
Group norm as a capacity measure has been considered in (Neyshabur et al., 2015b). Lemma 4.2 shows that group norm serves as a more stringent constraint than Fisher-Rao norm. Again, let us provide an explanation of the pre-factor here.
Note that for all
From Lemma 4.2, if the expectation is over the empirical measure , we know that in the case when for all ,
By Theorem 1 in (Neyshabur et al., 2015b), we know that a subset of (different from the subset induced by spectral geometry) characterized by the group norm, satisfies the following upper bound on the Rademacher complexity, for any
Plugging in , we have
Once again, we point out that the intriguing combinatorial factor in Theorem 1 of Neyshabur et al. (2015b) exactly cancels with our pre-factor in the norm comparison. The above calculations show that another subset of , induced by the group norm geometry, has good generalization error (without additional factors).
4.1.3 Path norm.
Definition 5 (Path norm).
Define the following “path norm” ball, for
where , indices set . Here is a notation for all the paths (from input to output) of the weights .
Lemma 4.3 (Path- norm).
The following inequality holds for any ,
Path norm has been investigated in (Neyshabur et al., 2015a), where the definition is
Again, let us provide an intuitive explanation for our pre-factor
here for the case . Due to Lemma 4.3, when the expectation is over empirical measure,
By Corollary 7 in (Neyshabur et al., 2015b), we know that for any , the Rademacher complexity of path- norm ball satisfies
Plugging in , we find that the subset of Fisher-Rao norm ball induced by path- norm geometry, satisfies
Once again, the additional factor appearing in the Rademacher complexity bound in (Neyshabur et al., 2015b), cancels with our pre-factor in the norm comparison.
4.1.4 Matrix induced norm.
Definition 6 (Induced norm).
Define the following “matrix induced norm” ball, for , as
Lemma 4.4 (Matrix induced norm).
For any , the following inequality holds
Remark that may contain dependence on when . This motivates us to consider the following generalization of matrix induced norm, where the norm for each can be different.
Definition 7 (Chain of induced norm).
Define the following “chain of induced norm” ball, for a chain of