1 Introduction
Beyond their remarkable representation and memorization ability, deep neural networks empirically perform well in outofsample prediction. This intriguing outofsample generalization property poses two fundamental theoretical questions:

What are the complexity notions that control the generalization aspects of neural networks?

Why does stochastic gradient descent, or other variants, find parameters with small complexity?
In this paper we approach the generalization question for deep neural networks from a geometric invariance vantage point. The motivation behind invariance is twofold: (1) The specific parametrization of the neural network is arbitrary and should not impact its generalization power. As pointed out in (Neyshabur et al., 2015a)
, for example, there are many continuous operations on the parameters of ReLU nets that will result in exactly the same prediction and thus generalization can only depend on the equivalence class obtained by identifying parameters under these transformations. (2) Although flatness of the loss function has been linked to generalization
(Hochreiter and Schmidhuber, 1997), existing definitions of flatness are neither invariant to nodewise rescalings of ReLU nets nor general coordinate transformations (Dinh et al., 2017) of the parameter space, which calls into question their utility for describing generalization.It is thus natural to argue for a purely geometric characterization of generalization that is invariant under the aforementioned transformations and additionally resolves the conflict between flat minima and the requirement of invariance. Information geometry is concerned with the study of geometric invariances arising in the space of probability distributions, so we will leverage it to motivate a particular geometric notion of complexity — the FisherRao norm. From an algorithmic point of view the steepest descent induced by this geometry is precisely the natural gradient
(Amari, 1998). From the generalization viewpoint, the FisherRao norm naturally incorporates distributional aspects of the data and harmoniously unites elements of flatness and norm which have been argued to be crucial for explaining generalization (Neyshabur et al., 2017).Statistical learning theory equips us with many tools to analyze outofsample performance. The VapnikChervonenkis dimension is one possible complexity notion, yet it may be too large to explain generalization in overparametrized models, since it scales with the size (dimension) of the network. In contrast, under additional distributional assumptions of a margin, Perceptron (a onelayer network) enjoys a dimensionfree error guarantee, with an
norm playing the role of “capacity”. These observations (going back to the 60’s) have led the theory of largemargin classifiers, applied to kernel methods, boosting, and neural networks
(Anthony and Bartlett, 1999). In particular, the analysis of Koltchinskii and Panchenko (2002) combines the empirical margin distribution (quantifying how well the data can be separated) and the Rademacher complexity of a restricted subset of functions. This in turn raises the capacity control question: what is a good notion of the restrictive subset of parameter space for neural networks? Normbased capacity control provides a possible answer and is being actively studied for deep networks (Krogh and Hertz, 1992; Neyshabur et al., 2015b, a; Bartlett et al., 2017; Neyshabur et al., 2017), yet the invariances are not always reflected in these capacity notions. In general, it is very difficult to answer the question of which capacity measure is superior. Nevertheless, we will show that our proposed FisherRao norm serves as an umbrella for the previously considered normbased capacity measures, and it appears to shed light on possible answers to the above question.Much of the difficulty in analyzing neural networks stems from their unwieldy recursive definition interleaved with nonlinear maps. In analyzing the FisherRao norm, we proved an identity for the partial derivatives of the neural network that appears to open the door to some of the geometric analysis. In particular, we prove that any stationary point of the empirical objective with hinge loss that perfectly separates the data must also have a large margin. Such an automatic largemargin property of stationary points may link the algorithmic facet of the problem with the generalization property. The same identity gives us a handle on the FisherRao norm and allows us to prove a number of facts about it. Since we expect that the identity may be useful in deep network analysis, we start by stating this result and its implications in the next section. In Section 3 we introduce the FisherRao norm and establish through normcomparison inequalities that it serves as an umbrella for existing normbased measures of capacity. Using these normcomparison inequalities we bound the generalization error of various geometrically distinct subsets of the FisherRao ball and provide a rigorous proof of generalization for deep linear networks. Extensive numerical experiments are performed in Section 5 demonstrating the superior properties of the FisherRao norm.
2 Geometry of Deep Rectified Networks
Definition 1.
The function class realized by the feedforward neural network architecture of depth
with coordinatewise activation functions
is defined as set of functions ( and )^{1}^{1}1It is possible to generalize the above architecture to include linear preprocessing operations such as zeropadding and average pooling.
with(2.1) 
where the parameter vector
() andFor simplicity of calculations, we have set all bias terms to zero^{2}^{2}2In practice, we found that setting the bias to zero does not significantly impact results on image classification tasks such as MNIST and CIFAR10.. We also assume throughout the paper that
(2.2) 
for all the activation functions, which includes ReLU , “leaky” ReLU , and linear activations as special cases.
To make the exposition of the structural results concise, we define the following intermediate functions in the definition (2.1). The output value of the th layer hidden node is denoted as , and the corresponding input value as , with . By definition, , and the final output . For any , the subscript denotes the th coordinate of the vector.
Given a loss function , the statistical learning problem can be phrased as optimizing the unobserved population loss:
(2.3) 
based on i.i.d. samples
from the unknown joint distribution
. The unregularized empirical objective function is denoted by(2.4) 
We first establish the following structural result for neural networks. It will be clear in the later sections that the lemma is motivated by the study of the FisherRao norm, formally defined in Eqn. (3.1
) below, and information geometry. For the moment, however, let us provide a different viewpoint. For linear functions
, we clearly have that . Remarkably, a direct analogue of this simple statement holds for neural networks, even if overparametrized.Lemma 2.1 (Structure in Gradient).
Lemma 2.1 reveals the structural constraints in the gradients of rectified networks. In particular, even though the gradients lie in an overparametrized highdimensional space, many equality constraints are induced by the network architecture. Before we unveil the surprising connection between Lemma 2.1 and the proposed FisherRao norm, let us take a look at a few immediate corollaries of this result. The first corollary establishes a largemargin property of stationary points that separate the data.
Corollary 2.1 (Large Margin Stationary Points).
Consider the binary classification problem with , and a neural network where the output layer has only one unit. Choose the hinge loss . If a certain parameter satisfies two properties

is a stationary point for in the sense ;

separates the data in the sense that for all ,
then it must be that is a large margin solution: for all ,
The same result holds for the population criteria , in which case is stated as , and the conclusion is .
Proof.
Observe that if , and if . Using Eqn. (2.6) when the output layer has only one unit, we find
For a stationary point , we have , which implies the LHS of the above equation is 0. Now recall that the second condition that separates the data implies implies for any point in the data set. In this case, the RHS equals zero if and only if . ∎
Granted, the above corollary can be proved from first principles without the use of Lemma 2.1, but the proof reveals a quantitative statement about stationary points along arbitrary directions .
In the second corollary, we consider linear networks.
Corollary 2.2 (Stationary Points for Deep Linear Networks).
Consider linear neural networks with and square loss function. Then all stationary points that satisfy
must also satisfy
where , and are the data matrices.
Proof.
Remark 2.1.
This simple Lemma is not quite asserting that all stationary points are global optima, since global optima satisfy , while we only proved that the stationary points satisfy .
3 FisherRao Norm and Geometry
In this section, we propose a new notion of complexity of neural networks that can be motivated by geometrical invariance considerations, specifically the FisherRao metric of information geometry. We postpone this motivation to Section 3.3 and instead start with the definition and some properties. Detailed comparison with the known normbased capacity measures and generalization results are delayed to Section 4.
3.1 An analytical formula
Definition 2.
The FisherRao norm for a parameter is defined as the following quadratic form
(3.1) 
The underlying distribution for the expectation in the above definition has been left ambiguous because it will be useful to specialize to different distributions depending on the context. Even though we call the above quantity the “FisherRao norm,” it should be noted that it does not satisfy the triangle inequality. The following Theorem unveils a surprising identity for the FisherRao norm.
Theorem 3.1 (FisherRao norm).
The proof of the Theorem relies mainly on the geometric Lemma 2.1 that describes the gradient structure of multilayer rectified networks.
Remark 3.1.
In the case when the output layer has only one node, Theorem 3.1 reduces to the simple formula
(3.3) 
Proof of Theorem 3.1.
Using the definition of the FisherRao norm,
By Lemma 2.1,
Combining the above equalities, we obtain
∎
Before illustrating how the explicit formula in Theorem 3.1 can be viewed as a unified “umbrella” for many of the known normbased capacity measures, let us point out one simple invariance property of the FisherRao norm, which follows as a direct consequence of Thm. 3.1. This property is not satisfied for norm, spectral norm, path norm, or group norm.
Corollary 3.1 (Invariance).
If there are two parameters such that they are equivalent, in the sense that , then their FisherRao norms are equal, i.e.,
3.2 Norms and geometry
In this section we will employ Theorem 3.1 to reveal the relationship among different norms and their corresponding geometries. Normbased capacity control is an active field of research for understanding why deep learning generalizes well, including norm (weight decay) in (Krogh and Hertz, 1992; Krizhevsky et al., 2012), path norm in (Neyshabur et al., 2015a), groupnorm in (Neyshabur et al., 2015b), and spectral norm in (Bartlett et al., 2017). All these norms are closely related to the FisherRao norm, despite the fact that they capture distinct inductive biases and different geometries.
For simplicity, we will showcase the derivation with the absolute loss function and when the output layer has only one node (). The argument can be readily adopted to the general setting. We will show that the FisherRao norm serves as a lower bound for all the norms considered in the literature, with some prefactor whose meaning will be clear in Section 4.1. In addition, the FisherRao norm enjoys an interesting umbrella property: by considering a more constrained geometry (motivated from algebraic norm comparison inequalities) the FisherRao norm motivates new normbased capacity control methods.
The main theorem we will prove is informally stated as follows.
Theorem 3.2 (Norm comparison, informal).
The detailed proof of the above theorem will be the main focus of Section 4.1. Here we will give a sketch on how the results are proved.
Lemma 3.1 (Matrix form).
(3.4) 
where , for . In addition, is a diagonal matrix with diagonal elements being either or .
Proof of Lemma 3.1.
Since , we have . Proof is completed via induction. ∎
For the absolute loss, one has and therefore Theorem 3.1 simplifies to,
(3.5) 
where . The norm comparison results are thus established through a careful decomposition of the datadependent vector , in distinct ways according to the comparing norm/geometry.
3.3 Motivation and invariance
In this section, we will provide the original intuition and motivation for our proposed FisherRao norm from the viewpoint of geometric invariance.
Information geometry and the FisherRao metric
Information geometry provides a window into geometric invariances when we adopt a generative framework where the data generating process belongs to the parametric family indexed by the parameters of the neural network architecture. The FisherRao metric on is defined in terms of a local inner product for each value of as follows. For each define the corresponding tangent vectors , . Then for all and we define the local inner product
(3.6) 
where . The above inner product extends to a Riemannian metric on the space of positive densities called the FisherRao metric^{3}^{3}3Bauer et al. (2016) showed that it is essentially the the unique metric that is invariant under the diffeomorphism group of .. The relationship between the FisherRao metric and the Fisher information matrix in statistics literature follows from the identity,
(3.7) 
Notice that the Fisher information matrix induces a semiinner product unlike the FisherRao metric which is nondegenerate^{4}^{4}4The null space of is mapped to the origin under .. If we make the additional modeling assumption that then the Fisher information becomes,
(3.8) 
If we now identify our loss function as then the FisherRao metric coincides with the FisherRao norm when . In fact, our Fishernorm encompasses the FisherRao metric and generalizes it to the case when the model is misspecified .
Flatness
Having identified the geometric origin of FisherRao norm, let us study the implications for generalization of flat minima. Dinh et al. (2017)
argued by way of counterexample that the existing measures of flatness are inadequate for explaining the generalization capability of multilayer neural networks. Specifically, by utilizing the invariance property of multilayer rectified networks under nonnegative nodewise rescalings, they proved that the Hessian eigenvalues of the loss function can be made arbitrarily large, thereby weakening the connection between flat minima and generalization. They also identified a more general problem which afflicts Hessianbased measures of generalization for any network architecture and activation function: the Hessian is sensitive to network parametrization whereas generalization should be invariant under general coordinate transformations. Our proposal can be motivated from the following fact
^{5}^{5}5Setand recall the fact that Fisher information can be viewed as variance as well as the curvature.
which relates flatness to geometry (under appropriate regularity conditions)(3.9) 
In other words, the FisherRao norm evades the nodewise rescaling issue because it is exactly invariant under linear reparametrizations. The FisherRao norm moreover possesses an “infinitesimal invariance” property under nonlinear coordinate transformations, which can be seen by passing to the infinitesimal form where nonlinear coordinate invariance is realized exactly by the following infinitesimal line element,
(3.10) 
Comparing with the above line element reveals the geometric interpretation of the FisherRao norm as the approximate geodesic distance from the origin. It is important to realize that our definition of flatness (3.9) differs from (Dinh et al., 2017) who employed the Hessian loss . Unlike the FisherRao norm, the norm induced by the Hessian loss does not enjoy the infinitesimal invariance property (it only holds at critical points).
Natural gradient
There exists a close relationship between the FisherRao norm and the natural gradient. In particular, the natural gradient descent is simply the steepest descent direction induced by the FisherRao geometry of . Indeed, the natural gradient can be expressed as a seminormpenalized iterative optimization scheme as follows,
(3.11) 
We remark that the positive semidefinite matrix changes with different . We emphasize an “invariance” property of natural gradient under reparametrization and an “approximate invariance” property under overparametrization, which is not satisfied for the classic gradient descent. The formal statement and its proof are deferred to Lemma 6.1 in Section 6.2. The invariance property is desirable: in multilayer ReLU networks, there are many equivalent reparametrizations of the problem, such as nodewise rescalings, which may slow down the optimization process. The advantage of natural gradient is also illustrated empirically in Section 5.5.
4 Capacity Control and Generalization
In this section, we discuss in full detail the questions of geometry, capacity measures, and generalization. First, let us define empirical Rademacher complexity for the parameter space , conditioned on data , as
(4.1) 
where
are i.i.d. Rademacher random variables.
4.1 Norm Comparison
Let us collect some definitions before stating each norm comparison result. For a vector , the vector norm is denoted , . For a matrix , denotes the spectral norm; denotes the matrix induced norm, for ; denotes the matrix group norm, for .
4.1.1 Spectral norm.
Definition 3 (Spectral norm).
Define the following “spectral norm” ball:
(4.2) 
We have the following norm comparison Lemma.
Lemma 4.1 (Spectral norm).
Remark 4.1.
Spectral norm as a capacity control has been considered in (Bartlett et al., 2017). Lemma 4.1 shows that spectral norm serves as a more stringent constraint than FisherRao norm. Let us provide an explanation of the prefactor here. Define the set of parameters induced by the FisherRao norm geometry
From Lemma 4.1, if the expectation is over the empirical measure , then, because , we obtain
From Theorem 1.1 in (Bartlett et al., 2017), we know that a subset of the characterized by the spectral norm enjoys the following upper bound on Rademacher complexity under mild conditions: for any
(4.3) 
Plugging in , we have,
(4.4) 
Interestingly, the additional factor in Theorem 1.1 in (Bartlett et al., 2017) exactly cancels with our prefactor in the norm comparison. The above calculations show that a subset of , induced by the spectral norm geometry, has good generalization error.
4.1.2 Group norm.
Definition 4 (Group norm).
Define the following “group norm” ball, for
(4.5) 
where . Here denote the matrix induced norm.
Lemma 4.2 (Group norm).
It holds that
(4.6) 
Remark 4.2.
Group norm as a capacity measure has been considered in (Neyshabur et al., 2015b). Lemma 4.2 shows that group norm serves as a more stringent constraint than FisherRao norm. Again, let us provide an explanation of the prefactor here.
Note that for all
because
From Lemma 4.2, if the expectation is over the empirical measure , we know that in the case when for all ,
By Theorem 1 in (Neyshabur et al., 2015b), we know that a subset of (different from the subset induced by spectral geometry) characterized by the group norm, satisfies the following upper bound on the Rademacher complexity, for any
(4.7) 
Plugging in , we have
(4.8) 
Once again, we point out that the intriguing combinatorial factor in Theorem 1 of Neyshabur et al. (2015b) exactly cancels with our prefactor in the norm comparison. The above calculations show that another subset of , induced by the group norm geometry, has good generalization error (without additional factors).
4.1.3 Path norm.
Definition 5 (Path norm).
Define the following “path norm” ball, for
(4.9) 
where , indices set . Here is a notation for all the paths (from input to output) of the weights .
Lemma 4.3 (Path norm).
The following inequality holds for any ,
(4.10) 
Remark 4.3.
Path norm has been investigated in (Neyshabur et al., 2015a), where the definition is
Again, let us provide an intuitive explanation for our prefactor
here for the case . Due to Lemma 4.3, when the expectation is over empirical measure,
By Corollary 7 in (Neyshabur et al., 2015b), we know that for any , the Rademacher complexity of path norm ball satisfies
Plugging in , we find that the subset of FisherRao norm ball induced by path norm geometry, satisfies
Once again, the additional factor appearing in the Rademacher complexity bound in (Neyshabur et al., 2015b), cancels with our prefactor in the norm comparison.
4.1.4 Matrix induced norm.
Definition 6 (Induced norm).
Define the following “matrix induced norm” ball, for , as
(4.11) 
Lemma 4.4 (Matrix induced norm).
For any , the following inequality holds
Remark that may contain dependence on when . This motivates us to consider the following generalization of matrix induced norm, where the norm for each can be different.
Definition 7 (Chain of induced norm).
Define the following “chain of induced norm” ball, for a chain of