Generalization by design: Shortcuts to Generalization in Deep Learning

07/05/2021 ∙ by Petr Taborsky, et al. ∙ DTU 0

We take a geometrical viewpoint and present a unifying view on supervised deep learning with the Bregman divergence loss function - this entails frequent classification and prediction tasks. Motivated by simulations we suggest that there is principally no implicit bias of vanilla stochastic gradient descent training of deep models towards "simpler" functions. Instead, we show that good generalization may be instigated by bounded spectral products over layers leading to a novel geometric regularizer. It is revealed that in deep enough models such a regularizer enables both, extreme accuracy and generalization, to be reached. We associate popular regularization techniques like weight decay, drop out, batch normalization, and early stopping with this perspective. Backed up by theory we further demonstrate that "generalization by design" is practically possible and that good generalization may be encoded into the structure of the network. We design two such easy-to-use structural regularizers that insert an additional generalization layer into a model architecture, one with a skip connection and another one with drop-out. We verify our theoretical results in experiments on various feedforward and convolutional architectures, including ResNets, and datasets (MNIST, CIFAR10, synthetic data). We believe this work opens up new avenues of research towards better generalizing architectures.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and Contributions

State of the art deep neural networks are trained with some form of regularization, including well-known tools such as weight decay, dropout, batch normalization

Ioffe and Szegedy (2015) used to train deep ResNet architectures He et al. (2016) or more loosely regularization is achieved by "early stopping". Regularization is thought to keep weights ’under control’ during training and thus positively impacts its generalization properties by reducing overfitting. Figure (1) suggests that the distribution of the weights indeed is related to how the model generalize on a previously unseen data, as argued by many Huh et al. (2021); Jacot et al. (2018); Kawaguchi et al. (2017); Goodfellow et al. (2016). We note that in most cases regularization is applied without analyzing how it may potentially interfere with the structure of the network.

(a) Initialization
(b) True labels
(c) Random labels
Figure 1:

Histogram of weights of a convolutional network classifier of handwritten digits (MNIST) after zero training error has been reached

. Figure (a) depicts weights distribution after standard (He) initialization, while Fig.(b) shows a distribution of weights after reaching 98.6 test accuracy and Fig.(c) reports same but in case of fitting random labels as described in Zhang et al. (2016) (test accuracy corresponds to random choice. i.e. approx. 10%). We can observe that a more complex model (c) has significantly heavier tails compared to the fully convergent and well generalizing close to optimum model (b) and that a distribution of weights, especially its tails, may carry an information on generalization properties.

Let alone the practical success and well developed theory, is omnipresent regularization really necessary in deep learning? There is a very wide and active area of research on an implicit regularization of deep learning with stochastic gradient descent, see our section on related work for more details. Besides many provable benefits, SGD alone seems to be not enough to ensure a good performance outside of training data however. In Fig.2 we show that three feed forward models of different depths, 1, 7 and 118 hidden layers, all overfit when trained by vanilla SGD for long enough time. So in the light of this "counter example" it seems some sort of regularization or "early stopping" is indeed needed. Here our aim is to improve network generalization by taking the architecture of the model into account and design a regularizing layer as an integral part of the model, we call it - regularization by design.

Without much loss of generality we consider loss functions of Bregman divergences in this paper. The choice of the Bregman divergence loss may seem limiting at the first glance but as shown in Banerjee et al. (2005) the Bregman divergence encompasses many common loss functions, such as squared loss, max. likelihood (KL-divergence), cross-entropy etc. used to train both predictive and classification models. In section 3 the Bregman divergence loss allows us to develop a geometrical perspective on generalization and characterize "well" generalizing networks in weight parameter space by using products of weights along the input-output paths, motivated by Fig.1 and aforementioned work.

We present 1.) a novel approach to generalization in neural networks - generalization by design. Following the theory presented in section three, we introduce a "generalization" layer in two forms. One encoded by skip connections the other using dropout. Each of them has specific training regime. As opposed to other regularization methods this generalizing layer is encoded into an architecture of a the model. We further design 2.) a practical method and demonstrate experimentally that this enhanced architecture significantly improves generalization, see Table.1, and is applicable across classification and regression models in general. Finally, 3.) section three and discussion derives novel geometrical perspective, labelled Learning in the Manifold of Distributions, that establishes the arguments for a long puzzling problem of the extreme accuracy and generalization, that has often been reached by deep networks, observed but not yet fully understood Zhang et al. (2016).

The paper is organized as follows. After this introduction, Section 2 lays out a geometrical perspective on SGD training and generalization of both classification and prediction neural networks, Learning in the Manifold of Distributions. Further, it describes our proposed "generalization by design" approach. Section 3 will cast the most commonly used regularization techniques, e.g., the dropout, batch normalization, early stopping, weight decay into the geometrical perspective of the previous section. Experiments on MNIST, CIFAR10, and synthetic data in Section 4 provide supporting evidence for the claim.

2 Related Work

Inspired by the work of Hauser (2018) and Amari (2016) we aim to provide an intuitive and unifying geometric perspective on generalization in deep networks. We investigate the long puzzling problem of why and how deep learning models generalize to unseen data so well despite their ability to fit arbitrary functions (random labels) Zhang et al. (2016).

The work of Hauser (2018) uses a differential geometric formalism on smooth manifolds to (re)define forward pass and back-propagation in a "coordinate" free manner. This allows us to see the layers of neural network as the "coordinate representations" of data (input) manifold . As the number of nodes changes as one moves through the layers of the neural network we effectively change the dimensionality used by the neural network to represent the data manifold. The dimension of the underlying data manifold is the number of dimensions in the smallest "bottleneck" layer while all other layers are immersion/embedding representations. For details we refer reader to Hauser (2018).

Conveniently, the coordinates of an input (data) layer are Cartesian and thus all data points are embedded into this smooth Riemannian manifold with -norm induced inner product. A topology of this embedding space provides a well defined neighbourhood of all data points and allows us to analyze its behaviour111Note on a smoothness assumption on .

Overall Hauser (2018) works with an empirically supported assumption that neural network learns a sequence of coordinate transformations to put data into a flattened form, i.e., it is assumed that output

manifold is Riemannian and flat, that is the Riemannian metric tensor is constant.

Further a recent work He et al. (2020); Zhang et al. (2019) shows that identity connections, also called "shortcuts", besides counteracting vanishing gradients and degradation by depth, see. He et al. (2016), help residual networks He et al. (2016); Rousseau and Fablet (2018) to generalize well. Use of shortcuts of varying length motivated by modelling suitable representations of inputs are used in U-nets Ronneberger et al. (2015) and DenseNets Huang et al. (2017) to improve an accuracy and convergence.

Many have investigated the dynamics of stochastic gradient descent (SGD), see e.g. Huh et al. (2021); Ali et al. (2020); Roberts (2021); Smith et al. (2021); Chizat and Bach (2020); Volhejn and Lampert (2021); Kalimeris et al. (2019); Nakkiran et al. (2019) and more, under the hypothesis that SDG is implicitly biased towards "simple" functions and as such regularizes itself. It provably finds minimum norm solutions in specific. i.e. convex, least squares optimization problems Ali et al. (2020). Yet shows that min. norm solution is not enough to guarantee good generalization in deep enough networks. Other line of works shows SGD to spend exponentially longer time in shallow minima Xie et al. (2020); Li et al. (2017) if approximated by Langevin dynamics with Gaussian noise. Bayesian and approximate inference based works Smith and Le (2017) with many more interesting properties.

(a) epoch 1
(b) epoch 500
(c) epoch 10,000
(d) epoch 1
(e) epoch 500
(f) epoch 10,000
(g) epoch 1
(h) epoch 500
(i) epoch 10,000
Figure 2: Does vanilla SGD implicitly self-regularize? Fitting a noisy quadratic function:

by a feed forward network (FFN) with 1 hidden layer (blue) and 7 hidden layers (red) and residual FFN with 119 hidden layers and skip connections (green). All models use ReLu activations and have comparable number of parameters, i.e., 35,989 (blue) and 35,998 (red) and 36160 (green). They are trained by vanilla SGD for over 10,000 epochs with a constant learning rate of

and no regularization or batch normalization in case of ResNet. We can observe that independent of the depth or architecture of the models they all overfit a training data eventually, as reported in the right most column. A training data consists of randomly generated values of on 100 data point grid on the interval (-8,8) (gray), test data similarly on interval (-2,2).

3 Learning in the Manifold of Distributions

Following the approach outlined in the Introduction section and using the work of Hauser (2018) as a basic concept developed furhter in this section we see layers of the neural network model as a coordinate representations of a data manifold . The neural network is than a composition of layer-to-layer maps between these coordinate representations of and parametrized by the learnable weights. Altogether it forms by assumption a smooth222ReLu can be seen as a limit case of smooth NN models using softmax activations, see comments in the end of the section input-output map.

The work of Zhang et al. (2016) showed that flatness of the weight space is insufficient to capture generalization in neural networks. This section rethinks a generalization of neural network by first, endowing the output layer with dually flat Riemannian geometry and then, second, pulling its metric back to an input layer along making use of differential geometry.

Having established link between known metric spaces333If loss function is Bregman divergence than output layer is dually flat Riemannian manifold with Fisher information matrix as a metric tensor, see Amari (2016) we define a local generalization of network around every training datum by "flatness" of this pulled-back metric (a bilinear form at given defined by its symmetric matrix) and weight compositions of , formalized in the main Proposition 3.1. of this section. In the last step, we derive global generalization from local one by applying local properties on every training datum and deriving global sufficient conditions for inducing local generalization (Corollary 3.0.1).

As noted in Hauser (2018) an input layer is a rather arbitrary representation of a data manifold . For example an RGB representation of the image used for image recognition is used just because RGB is a convenient format for image software and displays. However it is not a good representation for labeling/image recognition task.

On contrary an output layer is by definition and by our choice of loss function, i.e., Bregman divergence, a regular probabilistic model best suited for a given modelling task. For example for a binary classifier with cross entropy loss an output layer is a regular one dimensional 444dually flat

parameter manifold corresponding to Binomial distribution (of Exponential family) with a cumulant function

and spanned by natural parameter , where

defines probability of a particular class, e.g. ’0’. From this modelling perspective the data manifold is best and intuitively represented by the output layer.

Further following Hauser (2018) and using machinery of differential geometry that provides a necessary formal justification a neural network can be seen in reverse as embedding/immersing an output (dually) flat555see Supplementary material and Amari (2016) for details layer into an arbitrary input layer. This embedding/immersion is being updated over the course of training driven by optimization in the output layer and (back)propagating the output layer gradient towards the input layers. In this perspective a choice of Bregmann loss defines a probabilistic Exponential family model in the output layer whose number of linearly independent parameters is upper bounded by dimensionality of the output.

Following the binary classification example above, an output layer model is a one-dimensional unimodal convex model parameterized by or by transformed . In the setting of deep learning666

as well as in linear regression or Gaussian processes

the neural network model maps data points , with indexing training data samples, to the point in the output manifold of Binomial distributions - that is for every pair of input target it is a different input dependent distribution being fitted. In other words, a map fits input dependent metric tensor to the geometry of output layer model of targets777see the Appendix for more on dually coupled Exponential Families . Every gives rise to an inner product, that is in case of induced Exponential family the Fisher Information Matrix, , of the output layer manifold that depends on and thus differs across data points in general. For every

a mode (in this case ML estimate of

) of the output layer model maps to a multitude of weight space optima Amari (2016), section 12.2. They all would have the same level of loss, however.

In just outlined perspective a learning proceeds in this manifold of distributions and the overall generalization of the model is determined by generalization properties of all training data-point dependent models as opposed to standard one probability model view.

Further we follow the machinery of differential geometry and use the neural network to "pull" the output layer (data dependent) inner product FIM() "back" to the input layer (a so called "pull-back metric" Hauser (2018)), see Fig.10 in Supplementary Material (Appendix).

This operation "imprints" specific properties of into this "pulled back inner product" which is symmetric bilinear positive semidefinite form, defined in the neighbourhood of and characterized by its matrix, see Gantmakher (1959); Bhatia (1997). When we use "flatness" of bilinear form we have in mind the operator norm of the matrix of this bilinear form. We’ll abuse notation for brevity throughout the paper.

Notably while a flatness of this bilinear form at , or rather its related quadratic form for in the neighbourhood of , captures the generalization of the map

naturally using the curvature of pulled back inner product (and its eigenvalues, see Fig.10 in Appendix), it differs essentially from a flatness of

around . While the pulled back metric at depends on the first derivatives of

(Jacobian), given by backpropagation or chain rule of derivative (see Supplementary Material, Fig. 1 and Eq.(4) "Back-propagated Inner Product of the Manifold of Distributions") the flatness of

would be defined by the second derivatives of (Hessian).

In short and fundamentally, it is the Jacobian of , involved in this quadratic form, not the Hessian, that defines generalization in this paper.

On an example of the ReLu network, Hessian gives a constant flat structure ( is piece-wise linear) while generalization, defined by the flatness of pulled back metric above, depends on products of weights that are activated at particular data input, , promoting weights configurations resulting in "small" products. This idea underpins Proposition 3.1 3.1, one of the main results of the paper.

Following the idea we use "flatness" of the quadratic form seen as a function of in the neighbourhood of to define local generalization properties of the model. A global generalization arises from the local one applied "for every ".

In particular, getting back to Zhang et al. (2016), fitting random labels viewed from the perspective above would correspond to fitting the individual models around training data points exactly, yet pulled back inner products would have large curvatures, given by randomness in labels of neighboring training data and thus resulting in bad generalization properties. On contrary, if the generalization of the kind above was enforced (by methods developed later on in the paper) the randomly initialized model would not converge to solution fitting random labels. In other words, upon "good generalization" constraints on weights, the weight configuration fitting random labels becomes unreachable.

This "rethought" generalization is a basis for our results that will materialize in Proposition 3.1 capturing the idea of local generalization.

Along the lines above the Corollary 3.0.1 then finds an upper bound of the weight products that will be used to design a generalization layer structure that would instigate the flatness of related pulled back metric around every training data point and thus enforce global generalization of the model.

Generalization in the Manifold of Distributions

Following the previous section a generalization on the output layer at datum is given by spectral properties of the Fisher information matrix (FIM) at of the chosen probability model over targets . The flatter the landscape w.r.t. outputs playing the role of natural parameters, see Appendix B, the better. Nevertheless we’d like to know how the log likelihood changes with regards to an input layer and how does it depend on the model parameters.

From now on we use upper indices to denote coordinates while lower one index vectors/tensors. Also the Einstein summation is used whenever pair of indexes appears in the equation to simplify notation. Let’s further denote

an inner product on the input space. It is a constant identity matrix in our case, i.e.

.

Without loss of generalization consider being a smooth non-invertible map between input manifold with a coordinate system denoted and output manifold with a coordinate system 888As such it defines push-forward operator that acts on tangent spaces: and it can be viewed as a generalized coordinate free derivative, see. Hauser (2018) and Supplementary Material.

Because an input space is by our choice of input data representation the Euclidean space with an ortho-normal basis it collides with its tangent space. If we for a sake of brevity take for granted that a back-propagation is well defined and is formally equivalent to a coordinate free chain rule over manifolds, for details see Hauser (2018), we can skip a formalism of a differential geometry and use an ordinary derivatives and chain rule instead of a directional derivatives in manifolds and a right Lie-group actions (defined using "pull-backs") on a frame bundles.

Having in mind this simplification (of retracting to formalism of linear algebra instead of differential geometry one) an infinitesimally small line element in output (tangent) space relates to an input vector by Jacobian

(1)

where Einstein summation over index is used.

Similarly for an output layer (a probabilistic manifold of probability distributions defined by a choice of Bregmann loss function corresponding to a cumulant function

) we have its metric tensor defined as

(2)

where

is a random variable of the output layer Exponential family distribution derived from a dual Bregman divergence,

and being convex conjugates, for details see Supplementary material and Banerjee et al. (2005) and where the second equation is a consequence of a dual flatness, Amari (2016).

Because is not (invertible) coordinate transformation, to relate distances of input and output coordinate representations properly is not trivial and it is done formally in Hauser (2018). However, formally involved an idea follows a geometric intuition as follows. We leverage an induced geometry of the output layer which is given by our choice of the loss, i.e. probabilistic model of the output layer. Note that by constraining ourselves to use of Bregman divergence as a loss999that covers common cost functions, such as log-likelihood (KL divergence), cross-entropy, , etc., used for both classification and prediction models, see Banerjee et al. (2005) both inner products are known and thus , as a smooth map by assumption, relates distances between input and output (tangent) spaces. We emphasize that this combination of input and output geometry is a crucial element that enables us to advance previous works on the topic. Omitting rigorous definitions for brevity (see Appendix A for more) we follow an intuitive notion of a "generalization" of around datum : the larger the vicinity of given input point is mapped into a fixed output space ’around’ , i.e., a unit hyper-ball, the better function generalizes in its vicinity. A formal build-up, definitions and proof, and additional explanatory figures are deferred to the accompanying Supplementary material in Appendix A.

Proposition 3.1 (Informal).

In the context of the above, assuming activation functions used in architecture of

are 1-Lipschitz a pull-back metric from an output layer into an input Euclidean manifold around a datum is up to a constant defined by the following positive semidefinite matrix:

(3)

where is a real matrix with elements:

(4)

is fully determined by the chosen Bregman divergence loss. Further is a positive real function formed by products of activation functions’ derivatives along the path such that and where is a set of all "back-propagation" paths connecting any input layer node to an output layer node through the network such that each layer has exactly one node present in the path. Then is a product of all weights from an input node to an output node along the path .

If we follow an idea of generalization of being described as a "flatness" of a positive semidefinite bilinear form101010or rather the operator norm of its matrix but we’ll keep abusing notation as noted before, around a datum , then according to the preceding Proposition (3.1), can be used to asses the degree of this flatness by looking at the eigenvalues of the positive definite matrix . Next corollary shows that the degree of flatness is upper bounded by the products of the largest eigenvalues of .

Corollary 3.0.1 (Spectral products, Informal).

Under conditions of Proposition 3.1 the largest eigenvalue of from Eq.3 can be bounded from above by a following product of eigenvalues:

(5)

where denotes set of layers of and denotes the largest eigenvalue of matrix comprising the weights of the layer and similarly denotes the largest eigenvalue of the positive semidefinite outer layer metric tensor .

Proof.

By definition of , an equivalence of norms on finite vector spaces and a 1-Lipschitz property of the activation functions by assumption. Full proof is provided in the Supplementaty material to be found in Appendix A. ∎

Proposition 3.1 and corollary 3.0.1 relate a geometric notion of the generalization as a "flatness" to a parametric space of weights. In particular, it conveys that functions defined by a model architecture, activation functions, loss function, in line with Kawaguchi et al. (2017), and a particular point in the weight space generalize well at point if sums of products of the largest eigenvalues of layer weight matrices are small. Despite a tighter bound can be given using traces111111follows also from norm equivalence on finite vector spaces or sum of weight products along input-output paths, for a development of our method in next section this bound suffices.

Remark (Towards Global Generalization ).

Notably the upper bound (5) of Corollary 3.0.1 takes a form of product and the part does not depend on input data . It is given by structure of , i.e. composition of layers, activation functions and weights yet it is "global". If this part is kept low and is bounded on 121212it is for example in case of a regular model of the output layer and a finite dataset the largest eigenvalue of (curvature around ) would be small for all . This is the driving idea for the design of generalization layer in the next section.

Strikingly, it follows that under flatness constraint on every distribution indexed by , overfitting, i.e., reaching zero training error, is not a problem as it could be in the case of no such regularizer. On contrary under imposed constraints on "flatness" of the pulled back bilinear form as introduced above (Proposition 3.1 and A.0.1) reaching zero training error improves accuracy and is demanded. On the other hand, reaching it requires necessary capacity e.g. depth of the model.

In the extreme case of shallow 1 hidden layer feed-forward network, i.e. an input layer is followed by an output layer means the networks is close to a constant function no matter how wide it is. That is it has extremely limited capacity. Notably going deeper can be achieved by keeping eigenvalues of some layers small and that enables the remaining layers to be more expressive because they act in the product with previous layers. The same way the generalization layer is designed and experimentally verified to work in the following sections.

Hence the depth enables both to reach zero training error by increasing the capacity of the network as well as to keep all pulled back metrics around training data flat, which is generalize well. In the explicit experiment (motivated differently though) it is demonstrated in Hauser (2018), Fig.4.7.

Nevertheless, the learning in the manifold of distributions suggests the explanation of a long puzzling phenomenon: the extreme accuracy and generalization the deep neural networks are able to reach at the same time. Moreover, it renders depth being the essential enabler of this capability.

3.1 Generalization Layer (GL) with Skip Connections

Corollary 3.0.1 of the previous section suggests that good generalization can be achieved by keeping products of maximum eigenvalues of layers "small". In this section we design one such widely applicable method as later on shown in experiment section 4.

Idea is to add an additional structure into the existing architecture as depicted in Fig.3 with hooks to control size of weights (blue lines in Fig.3) during the training.

This additional structure called generalization layer is defined by the tuple: (nodes (), weights () and skip connection () parametrized by a scalar ). Number of new nodes is the same as to match dimensionality of weight matrix of layer . Similarly skip connections are chosen to match the dimensionality of . In the simplifying diagram Fig.3 skip connections are of the form (), where represents the identify matrix because and have the same number of nodes. In general case however, when and are of different dimensionality, defines a linear projection matrix in line with notation used in He et al. (2016).

(a) before
(b) after
Figure 3: Generalization layer: Depicting the insertion of the "generalization layer" (blue) in (b) between layer and . This structure comprises additional (blue) nodes of the same size as , additional weights (blue lines) and skip connections (dashed blue lines) ensuring gradient flow is not broken into the original architecture (a).

This construction straightforwardly generalize to convolutional or any other architecture.

Keeping the weight products under control To deliver a desired regularizing effect it is necessary to prevent newly added weights from growth while ensuring the rest of the network is trained. This could be achieved by many means.

We have opted for training with 10 times lower learning rate applied compared to the rest of the network in addition to steering large gradients towards the lower layer through shortcuts

controlled by a hyperparameter

. In particular constraining the weight of the generalization layer dampens the backpropagated gradients during training and may stall or stop training whatsoever. To avoid breaking the gradient flow the skip connections are parametrized by scalar and this hyperparameter is linearly decayed during the training. For details see section 4.1.1.

The method just described delivers the desired effect and the resulting model performs on par or better than original ResNet models He et al. (2016) despite in our design we took out the whole second residual block of stack convolutional layers in ResNet and replaced it by one FF layer as in Fig.3 and thus our model has significantly less parameters. Results are presented in Experimental section 4

(a) epoch 2000
(b) epoch 2000
(c) epoch 500
(d) epoch 500
(e) epoch 20,000
(f) epoch 20,000
(g) epoch 20,000
(h) epoch 20,000
Figure 4: Regularization by skip connections: Least square regression of noisy quadratic function in the first two columns and mixture of quadratic and cubic function in the two rightmost columns. Presented are selected epochs during SGD training of the residual feed-forward architecture with ReLu activations, 119 layers and 36,160 trainable weights regularized by (long) skip connections over every hidden layer as described in Experimental setup 4.1

. While the (cyan) model in the second and the fourth column is trained by vanilla SGD, the first column (purle) and third column (green) is the result of SGD training using the adaptive learning rate (Adam). Figure demonstrates that vanilla SGD in combination with long skips regularizer effectively enforces piecewise linear structure on fitted function from early epochs (b) and (d) to over 20 000 epochs in both cases, quadratic function (f) as well as in case of more complex the third degree noisy polynom in (h). This is in stark contrast to an adaptive training by Adam (not reported but similar results were produced by RMSProp as well) that explores flatter directions in loss landscape

Goodfellow et al. (2016) and recovers almost exact generating function, as shown in (a) and (c), however if there is no early stopping applied it eventually overfits, in (e) and (g), despite the regularization of skip connections (same as in vanilla SGD case). A training data consists of randomly generated values of on 100 data point grid on the interval (-8,8) (gray), test data similarly on interval (-2,2).

3.2 Generalization Layer with DropOut

Previous section derived one particular method called Generalization Layer (GL), making use of the shortcut connections was designed to allow experimental verification of the theory. However, as noted therein the GL is not "the only" method. On the contrary we believe many are to be discovered.

In this section, we design another easy-to-implement generalization layer that works well in practice as will shortly be demonstrated in experiments with the CIFAR10 dataset, see 4.2.

Same as before in the case of GL the idea is to add structure into the existing architecture as depicted in Fig.5 with hooks to control the size of weights (blue lines in Fig.5) during the training.

This additional structure labeled "GLD" is defined by the tuple: (nodes , weights () and a hyperparameter ). Dropout hyperparameter defines a probability of the success in Bernoulli trials of removing the node of the GLD layer (only) from the computational graph for one particular batch. Applied independently on all nodes of the GLD layer. Importantly a dropout is applied only on the GLD layer and independently on other regularization techniques used for training, incl. dropout.

The number of new nodes is the same as to match the dimensionality of the weight matrix of layer . In the simplifying diagram Fig.5 and have the same number of nodes. In general case however, when and are of different dimensionalities , i.e. number of outgoing links, colorcoded blue in Fig.5, is adjusted accordingly.

Once embedded into an architecture of the desired model this extended model is trained by standard (stochastic) gradient descent (backpropagation) techniques with a caveat of an additional application of "GLD" dropout on this layer.

(a) before embedding GLD
(b) the extended architecture after
Figure 5: Generalization Layer with Dropout (GLD): Depicting the insertion of the "generalization layer" (blue) in (b) between layer and . This structure comprises additional (blue) nodes of the same size as , additional weights (blue lines) and probability of dropping added nodes of GLD layer from training (forward-backward) step. For instructive purposes a figures simulate the case when a half of the GLD nodes were ruled out for an step.

This construction straightforwardly generalizes to convolutional or any other architecture and includes any additional structure that fits incoming and outgoing dimensionality.

On the other hand, because Corollary 3.0.1 is based on the upper bound that may be loose, there are differences in effectiveness across different versions of generalization layer, as shown in Experiments i.e. GL vs. GLD in this paper. We believe many other forms of "generalization layer(s)" are to be explored as Proposition 3.1. only requires to control (ideally but not necessarily) all backpropagation paths no matter architecture or the way it is delivered.

4 Experiments

Our experiments include both feed forward and convolutional architectures and show the following:

  • Fig.2 demonstrates on the example that vanilla SGD alone eventually converges to a highly complex function that overfits training data in accordance to capacity of the model and suggests that an implicit regularization of SGD is not enough to achieve good generalization on its own.

  • Fig.4 and Fig.7 report that it is possible to keep model in a "simple function" mode over long training periods by manipulating an architecture of the models. In this case we use additional skip connections over 7 layers to regularize as outlined in Zhang et al. (2019). It is of utmost importance to state that short skip connections without use of batch normalization readily overfit as demonstrated in Fig.2.

  • Other hereby not reported experiments we ran suggest the regularization by GL if overdone may keep a map "too simple" preventing SGD from improving training error for long periods. Further research on hyperparameter optimization an on alternative training regimes e.g. cyclical learning rates Smith (2017) is suggested.

  • On classification task on the CIFAR10 Fig.6 demonstrate that GL replacing the middle stack of resnet blocks, see He et al. (2016), and still achieving on-par or better performance while using less parameters.

  • Further Fig.8 and Fig.9 report experiments with another version of generalization layer with dropout instead of skip connections, this time placed beyond the encoder block of ResNet. Outstanding results of this encoding suggest that placement of the generalization layer within the mode architecture plays an essential role, see discussion.

  • Table1 reports on popular ResNet architectures and the CIFAR10 image classification dataset, the 56 layers ResNet model enhanced by GLD structural regularizer outperformed all other original ResNets, He et al. (2016), including the one with 1202 layers.

4.1 Experimental Setup and "Generalization Layer" Training Regime

In experiments reported in Fig.2 and Fig.4 we use the least square regression of two noisy polynomial functions:

  • and otherwise (on ).

For regression on in Fig.2 a feed forward architecture of varying depth is used in counter example of SGD bias towards simple functions. All models are of feed forward (FF) architecture. The deepest (labeled "Res Net w/o BN" in the figure) has 119 hidden linear layers neurons each and one dimensional input and output so that number of trainable parameters (36,160) is as close as possible to the 7 hidden layers (Deep) model with 35,989 parameters and 1 hidden layer (Shallow) model with 35,998 trainable parameters for comparison purposes. All models use ReLu activations and are trained by SGD with a constant learning rate of and no regularization (referred to as "vanilla SGD") unless stated otherwise, over 20,000 epochs. Synthetic training dataset for regression consists of randomly generated data points of and in (Fig.4) on 100 data points grid on the interval (-8,8) reported, test data similarly on interval (-2,2).

In Fig.4 identity skip connections a.k.a. "shortcuts", see He et al. (2016); Zhang et al. (2019), are used to explore their regularization capability. In particular we use skips to shortcut every feed forward layer by identity and we refer to this model in Fig.4 as regularized by skip connections. We train the model with both vanilla SGD and Adam to showcase the affect of training with adaptive learning rate (Adam) with results elaborated on in the caption.

Fig.7 reports on the functional fit of experiment by making use of multiple1313133 blocks of FFL-ReLu-FFL with overarching -weighted shortcut over each. Altogether has 90 hidden layers with 45 skip connections (out of which 3 skips belong to GL). Width of the layer is 17 units. "GL" layers it is possible to recover the generating cubic function the way adaptive learning rate (Adam) SGD did (the best MSE achieved in the experiments), see Fig.4(a) and Fig.4(c). We explicitly note that results of Fig.7 have been achieved with GL as the only explicit regularizer used, i.e. no batch normalization, weight decay or else, using the training method described in Section 3 of the paper.

4.1.1 Training with a "Generalization Layer"

Experimental design for Fig.4 uses "regularizing layer" as designed in 3.1 together with a scheduled weight decay applied on the skip connection parameter over the course of training. This regime is necessary to ensure gradient flow is not disconnected by "generalization" layer whose weights we’d like to keep low. To propagate gradient beyond this layer skip connections are used. Their strength is toned down during the training starting from and linearly decayed down to for the last 20% of the training period141414bringing is not wanted as it would disconnect smoothness of coordinate transformations over layers, as noted in Zhang et al. (2019); Hauser (2018).

On top of previous 10 times lower learning rates of generalization layer weights were used after initial 20 epochs to slow down the growth of the largest eigenvalue of this layer in line with 3.0.1.

All experiments were designed and coded in PyTorch

Paszke et al. (2017) and executed on regular 10 GPU cluster.

4.2 Results of Experiments

Experiments on Noisy Polynomial Functions

Most of the results are elaborated on in the captions of figures Fig. 2 and Fig.4. In relation to "generalization by design" method specifically Fig.4 demonstrates a usefull regularization effect of skip connections, as covered in Zhang et al. (2019). As described in the caption of the figure shortcuts in connection with vanilla SGD training produces piece-wise linear fuction even after 20000 epochs. However if used in an adaptive learning rate SGD training regime it recovers generating functions perfectly yet without early stopping it continues to lower training error and overfit eventualy.

To shed more light on regularization effect of skip connection it is important to compare Fig.2 to 4. Fig.2 shows that ResNet architecture with short skips every second layer trained by vanilla SGD, i.e., without batch normalization151515that besides reducing a covariate shift also regularizes, see Ioffe and Szegedy (2015), (labeled "Res Net w/o BN" in the figure) tends to heavily overfit from an early stage of the training. This is taken into account when designing "regularizing layer", where skips connections are used to steer gradient flow away from "bottleneck" layer in early stage of training rather than for their regularizing effect itself. See section 3.1 for details.

Skip Connections GL On the CIFAR10 Experiments

Experiments on CIFAR10 dataset are reported to demonstrate applicability and effectiveness of a "regularization layer" from previous section 3.1 on the real world dataset and popular ResNet architecture. For results see caption of Fig.6.

Figure 6: Generalizing layer (GL) on CIFAR10: Both architectures GL ResNet20 (blue) and GL ResNet56 (green) based on original ResNet architectures referenced in the labels have fewer parameters than original models because they replaced the whole second ResNet convolutional block by one generalizing layer with skip connections defined in 3.1. Yet they reach on-par or better results than reported in the original paper He et al. (2016) - color-coded dotted lines indicate reference accuracy levels reached therein. A color-coding of epochs goes from lighter early ones to darker later ones.
(a) epoch 1
(b) epoch 30
(c) epoch 50
(d) epoch 100
(e) epoch 200
(f) epoch 300
(g) epoch 400
(h) epoch 500
(i) epoch 1,000
Figure 7: Generalization layer (GL) as the only regularizer Fitting a noisy cubic function: and otherwise (on ), where . As shown in the snapshots from training (a)-(i) a generating function is recovered comparably to Fig4.(c) result of adaptive learning rate (Adam). The only explicit regularizer used is the "generalization layer" (GL) of 3 blocks of FFL-ReLu-FFL with a weighted shortcut overarching every one of the three blocks as described in the paper Section 3. The model consists of 90 hidden feed-forward layers (FFL) of 17 nodes wide with 45 skip connections (out of which three skips belong to GL weighted by hyperparameter and the rest are identities). The models were trained by vanilla SGD for over 1,000 epochs with a learning rate of decayed every 200 epochs by 0.1. Further, GL was trained according to the regime from Section 3 with a learning rate and linearly decayed to value 0.1 at epoch 500 and stayed on. A training data consists of randomly generated values of on 100 data point grid on the interval (-8,8) (gray)

The results of this alternative implementation demonstrate that improved generalization is achieved irrespective of the method of implementing Proposition 3.1. and its Corollary and as such is not a consequence of the method but rather of the concept presented.

Drop-out Generalization Layer (GLD) On the CIFAR10 Experiments

Experiments on the CIFAR10 dataset demonstrate applicability and effectiveness of a "regularization layer" from previous section 3.2 on the real world dataset and popular ResNet architecture. For results see caption of Fig.9.

Figure 8: Structural generalization layer with drop-out (GLD) on CIFAR10: Both architectures ResNet20 (blue) and ResNet56 (green) based on original ResNet architectures referenced in the labels have an additional GLD layer between encoder and decoder part of the architecture. For this additional cost they outperformed original ResNet models by quite a margin (see Table1 for details) compared to the original paper He et al. (2016) - color-coded dotted lines indicate the reference accuracy levels reached therein. A color-coding of epochs goes from lighter early epochs to darker later ones.

The best results of the tested models based on ResNet architecture with additional GLD layers are depicted in Fig.9.

Figure 9: Hyperparameter adjusted structural generalization layer with drop-out (GLD) on CIFAR10: This figure presents the best results achieved in the experiments on CIFAR10 and ResNets. The ResNet56 with GLD layer and a dropout rate of 0.6 outperformed all models from original ResNet paper He et al. (2016) including the deepest ResNet1202 model (reference line in red), see Table 1 for details. The color- coding used is the same as in Fig. 8

Summarized in the Table1 overall results of ResNet architectures of varying depth on the CIFAR10 dataset are presented for comparison. Achieved test errors of the original paper in column and implementation of the same by Idelbayev in the last column are compared to models using generalization layer - the dropout variant, denoted GLD. The table is sorted according to the achieved test errors (the last column) in descending order. The rows in bold demonstrate results of models with generalization layer based on ResNet20 and ResNet56, in the and two last rows respectively. As the last columns show ResNet20 GLD outperformed larger model ResNet32 and ResNet56 GLD gained the best results of all the models including the ResNet1202 with 1202 layers and 19.4 million parameters.

Name Layers Params Test err (orig.) Test err
ResNet20 20 0.27M 8.75% 8.27%
ResNet32 32 0.46M 7.51% 7.37%
ResNet20 GLD 0.75 22 0.3M –% 7.17%
ResNet44 44 0.66M 7.17% 6.90%
ResNet56 56 0.85M 6.97% 6.61%
ResNet110 110 1.7M 6.43% 6.32%
ResNet1202 1202 19.4M 7.93% 6.18%
ResNet56 GLD 0.5 58 0.93M –% 6.11%
ResNet56 GLD 0.6 58 0.93M –% 5.74%
Table 1: ResNet with and without Generalization Layer with Drop-Out (GLD). GLD is placed after the encoder block of ResNet as opposed to GD with Skip connection experiments earlier, testing the effect of an invariance principle to arbitrary coordinate representations outlined in the Discussion section. Outstanding results of GLD presented here are in support of placement of the generalization layer beyond the layers that are supposed to generalize well, e.g, encoder block, in line with this invariance principle.

5 Discussion and Future Work

As argued in Zhang et al. (2016) the norm of weights does not necessary captures good generalization. They show in particular that generalization in ReLu networks is invariant along hyper planes corresponding to reciprocal rescaling in and out side of a nonlinearity by some constant and respectively. If

is absorbed into weights a norm along such hyperplane gets arbitrarily large despite it represents the same function and thus has the same generalization properties. As can be seen such a rescaling has no effect on a bound (

5) of Corollary 3.0.1 because ’s would cancel out along the path products involved. Regularizing a path products as in our method is more subtle than regularizing norms as follows from geometric vs. arithmetic mean or more general Jensen’s inequality. More over with regularized spectral products as in Proposition (3.1) share many properties with low spectral rank used to characterize simple functions in Huh et al. (2021).

On the other hand, because Corollary 3.0.1 is based on the upper bound that may be loose, there are differences in effectiveness across different versions of generalization layer, as shown in Experiments i.e. GL vs. GLD in this paper. We believe many other forms of "generalization layer(s)" are to be explored as Proposition 3.1. only requires to control (ideally but not necessarily) all backpropagation paths no matter architecture or the way it is done so. Also, in experiments with dropout variant of generalization layer (GLD), varying results of different drop-out hyperparameter presented in Table 1 suggest that a hyperparameter optimization may bring about even further improvements. All these suggestions are left for future work.

On the depth and the width of the model The Generalization layer keeps the data independent part of the the upper bound (5) in the Corollary 3.0.1 small. Nevertheless the part, defined as the largest eigenvalue of the outer layer inner product of the tangent space at , plays its role too. The Corollary shows that the "generalization" benefits from flatter FIM, i.e. small . We argue this is easier met by a high capacity model that is capable to reach more optima, because of its flexibility, and converge to a "good" one. Same as Remark [Towards Global Generalization] in section 3 this also suggests the generalization is conditioned by model’s capacity. In particular, that means the model has to be deep enough because eigenvalues of layers are kept low by 3.0.1 enforcing generalization and limiting a capacity due to the width. Proving this conjecture is left as future work.

In addition, Proposition 3.1 also provides a view on the role of the depth of the network with regards to a generalization seen as smoothness of the transformation . An element is a product of pointwise derivatives of 1-Lipschitz activation functions along the path. Since the common activation functions like ReLu, , and their variants are 1-Lipschitz, having a pointwise derivative in the range , the deeper the network the smaller gets. Hence the ’simpler’ and better generalizing is obtained.

As opposed to the depth of the model the effect of a width of layers is more involved. On one hand, it contributes with more paths to the sum in Eq.4 on the other hand weights, if initialized randomly in a common way, e.g. ’He’ or ’Xavier’, He et al. (2015); Glorot and Bengio (2010)

, have zero mean and variance that corrects for a number of units. So the path products involved are of both signs

161616despite the overall Eq.3 is always non-negative due to the positive semi-definiteness of and therefore the sum of the products is not guaranteed to grow over the limit even long after the initialization. Moreover, the width contributes to the capacity of the network, which is essential for generalization as argued above.

Popular regularizers in the light of Corollary 3.0.1

As noted in the Introduction of the paper there are many regularizers at hand to be combined with SGD training and that works provably and empirically well. Next we relate most common regularization techniques to our results and show that they are in line and supportive each other.

Weight decay Goodfellow et al. (2016) As outlined in the Discussion section in the main body of the paper norm regularization of weights could be linked to the upper bound Eq.(5) by trace and operator norm inequality (see Supplementary Material) on layer weight matrices, i.e. from the proof of Corollary A.0.1. Thus keeping norm of weights small keeps the upper bound of the layer’s largest eigenvalue small and hence contributes to smaller Eq.(5). Note however that this bound is rather loose in general and secondly, as noted earlier a weight decay is scale dependent and as such may rule out the optima of a large norm that generalize well according to our results and in line with Zhang et al. (2016). Nevertheless a stratified or selective weight decay applied only on the "generalization layer" may be just another way of keeping path or eigenvalue products low and therefore beneficial for generalization of the model. We leave this and other alternatives of the "generalization layer" design for a future work as well as cyclical learning rate Smith (2017) performing especially well in case of resnet and other deep architectures Smith (2018).

Batch normalization (BN) Authors of BN in Ioffe and Szegedy (2015), section 3.3 and 3.4, elaborate on the regularization and effect BN has on weights. BN arguably makes training more resilient to the parameter scale. In particular they argue that back-propagation through the layer is unaffected by the scale of its parameters and more over, larger weights lead to smaller gradients due to larger variance in nodes and thus also the denominator in the BN. In the effect BN stabilize the parameter growth. Referring to the Corollary 3.0.1 it stabilizes the growth of eigenvalues in layers and in and therefore slows down the rate of convergence towards "complex" functions that overfit which seems to be the inevitable course of actions of vanilla SGD as we have shown in Fig.2 of the paper.

Drop-out See. Srivastava et al. (2014). The authors of batch normalization in their work Ioffe and Szegedy (2015), see above, suggest based on experiments that batch normalization reduces, partially or completely, the need of drop-out, suggesting similar effect on the training. The same arguments as for BN above would apply here. Indeed dropput by multiplying, a random or deterministic, subset of layers’ outputs by zero Srivastava et al. (2014); Goodfellow et al. (2016) and thus deactivating weights leading to those units from gradient update at the given step slows the growth of the weight parameters similarly to BN. Or alternatively using Hinton et al. (2012) to approximate dropout effect as the full model but with outgoing weights of node multiplied by probability of including the unit . Applied to the path products Eq. (4) all (independent by method design) probabilities multiply leading to the regularizing effect of the dropout. The deeper the network the larger the effect by this approximation.

Early stopping Combined with a common random initialization of weights around zero, i.e. of He et al. (2015); Glorot and Bengio (2010) zero mean and variance that corrects for number of units of the network weights are gradually updated over the course of the learning as also shown in the experiment of Fig.1. Results of Fig.2 also suggests that vanilla SGD leads towards complex over fitting map characterized by large weights. Early stopping is a effective and robust way to stop weights along the way, Li et al. (2020), and earlier it stops the smaller the upper bound of max eigenvalues in Eq. (5) is obtained hence producing ’simpler’ functions.

Coordinate Representation Invariance Principle

Another interesting topic for future work is to explore the number and placement of generalization layer in the original architecture. Experiments, Fig.6 vs. Fig.8 suggest that better results are achieved when generalization layer is placed between encoder and decoder block of ResNet. Moreover, significantly better generalization properties of GLD models raise the research question of why that is so when according to Proposition 3.1 placements should not matter.

It can be motivated by following invariance principle171717for the idea of invariance see Amari (2016); Chentsov (1982) where it is applied on the transformation of variables, however. That is in a different context. with regards to an arbitrarily chosen input representation.

Recall that in the "coordinate representation of data manifold" view the input representation is rather arbitrary according to Hauser (2018). So while Proposition 3.1 addresses generalization properties w.r.t. inputs or seen from a forward pass perspective w.r.t. all the layers before the GL because it regularizes corresponding path products, all the layers following the GL one are unregularized and (may) cause an over-fitting. It follows from the application of Proposition 3.1 on any layer placed after the generalization layer and considering it a new input representation of the shallower model.

6 Conclusions

This paper develops a novel approach to the generalization of deep learning, a unifying geometrical perspective, the Learning in the manifold of distributions. It encompasses both classification and prediction neural networks models. Devised theory and Corollary 3.0.1 is used to design a new method called "generalization layer" that is embedded into the architecture of the model as a structural regularizer. Further, the developed framework suggests that in deep enough models, as opposed to shallow models, such a regularizer enables both, extreme accuracy and generalization, to be reached.

In the experimental section we empirically verify that even simple setups, i.e. an imputing an extra "generalization layer" and keeping its eigenvalues low, improves the generalization. Another variant of structural regularizer based on

generalization layer concept using drop-out is developed. To confirm many ways to implement the generalization by structure are possible. And more importantly to test the role the placement of the generalization layer in the architecture plays.

On that note the outstanding results on the CIFAR10 dataset corroborate the theory as well as the validity of invariance to coordinate representations principle from the discussion section.

In conclusion, to impose the invariance of the model to the arbitrary coordinate representation of data manifold the generalization layer has to be placed after all the layers that are to generalize well. The experiments with generalizing layer with drop-out placed after encoder block reported in Table 1 confirmed these conclusions and with only 56 layers it outperforms by a margin the deepest 1202 layers ResNet model from original paper He et al. (2016).

Further we discuss common regularization techniques that are placed into a perspective of this paper and are shown to be in line with its theory. Overall we believe that "generalization by design" provides both theoretical and methodological novelties and we hope to inspire a new line of research leading to better generalizing architectures.

References

  • A. Ali, E. Dobriban, and R. Tibshirani (2020) The implicit regularization of stochastic gradient flow for least squares. In

    International Conference on Machine Learning

    ,
    pp. 233–244. Cited by: §2.
  • S. Amari (2016) Information geometry and its applications. Vol. 194, Springer. Cited by: Appendix A, Appendix A, Appendix A, Appendix A, §2, §3, §3, footnote 17, footnote 18, footnote 19, footnote 3, footnote 5.
  • A. Banerjee, I. Dhillon, J. Ghosh, and S. Merugu (2004) An information theoretic analysis of maximum likelihood mixture estimation for exponential families. In Proceedings of the twenty-first international conference on Machine learning, pp. 8. Cited by: Appendix B.
  • A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh (2005) Clustering with bregman divergences. Journal of machine learning research 6 (Oct), pp. 1705–1749. Cited by: Appendix A, Appendix B, Appendix B, Appendix B, §1, §3, footnote 20, footnote 9.
  • H. H. Bauschke, P. L. Combettes, et al. (2011) Convex analysis and monotone operator theory in hilbert spaces. Vol. 408, Springer. Cited by: Appendix B.
  • R. Bhatia (1997) Matrix analysis. Springer New York (eng). External Links: ISBN 9781461206538, 0387948465, 1461206537, 1461268575, 9780387948461 Cited by: Appendix A, §3.
  • N. Chentsov (1982) Statistical decision rules and optimal inference. transl. math. Monographs, American Mathematical Society, Providence, RI. Cited by: footnote 17.
  • L. Chizat and F. Bach (2020) Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. In Conference on Learning Theory, pp. 1305–1338. Cited by: §2.
  • F. R. Gantmakher (1959) The theory of matrices. Vol. 131, American Mathematical Soc.. Cited by: Appendix A, §3.
  • X. Glorot and Y. Bengio (2010) Understanding the difficulty of training deep feedforward neural networks. In

    Proceedings of the thirteenth international conference on artificial intelligence and statistics

    ,
    pp. 249–256. Cited by: §5, §5.
  • I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio (2016) Deep learning. Vol. 1, MIT press Cambridge. Cited by: §1, Figure 4, §5, §5.
  • M. B. Hauser (2018) Principles of riemannian geometry in neural networks. Cited by: Appendix A, Appendix A, Appendix A, §2, §2, §2, §3, §3, §3, §3, §3, §3, §5, Remark, footnote 14, footnote 18, footnote 8.
  • F. He, T. Liu, and D. Tao (2020) Why resnet works? residuals generalize. IEEE transactions on neural networks and learning systems 31 (12), pp. 5349–5362. Cited by: §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2015)

    Delving deep into rectifiers: surpassing human-level performance on imagenet classification

    .
    In

    Proceedings of the IEEE international conference on computer vision

    ,
    pp. 1026–1034. Cited by: §5, §5.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 770–778. Cited by: §1, §2, §3.1, §3.1, Figure 6, Figure 8, Figure 9, 4th item, 6th item, §4.1, §6.
  • G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov (2012) Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580. Cited by: §5.
  • J. Hiriart-Urruty and C. Lemaréchal (2012) Fundamentals of convex analysis. Springer Science & Business Media. Cited by: Appendix B, Appendix B, Appendix B.
  • G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §2.
  • M. Huh, H. Mobahi, R. Zhang, B. Cheung, P. Agrawal, and P. Isola (2021) The low-rank simplicity bias in deep networks. arXiv preprint arXiv:2103.10427. Cited by: §1, §2, §5.
  • [20] Y. Idelbayev Proper ResNet implementation for CIFAR10/CIFAR100 in PyTorch. Note: https://github.com/akamaster/pytorch_resnet_cifar10Accessed: 20xx-xx-xx Cited by: §4.2.
  • S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pp. 448–456. Cited by: §1, §5, §5, footnote 15.
  • A. Jacot, F. Gabriel, and C. Hongler (2018) Neural tangent kernel: convergence and generalization in neural networks. arXiv preprint arXiv:1806.07572. Cited by: §1.
  • D. Kalimeris, G. Kaplun, P. Nakkiran, B. L. Edelman, T. Yang, B. Barak, and H. Zhang (2019) sgd On neural networks learns functions of increasing complexity. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, Cited by: §2.
  • K. Kawaguchi, L. P. Kaelbling, and Y. Bengio (2017) Generalization in deep learning. arXiv preprint arXiv:1710.05468. Cited by: Appendix B, §1, §3.
  • M. Li, M. Soltanolkotabi, and S. Oymak (2020) Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks. In International Conference on Artificial Intelligence and Statistics, pp. 4313–4324. Cited by: §5.
  • Q. Li, C. Tai, and E. Weinan (2017) Stochastic modified equations and adaptive stochastic gradient algorithms. In International Conference on Machine Learning, pp. 2101–2110. Cited by: §2.
  • P. Nakkiran, G. Kaplun, D. Kalimeris, T. Yang, B. L. Edelman, F. Zhang, and B. Barak (2019) Sgd on neural networks learns functions of increasing complexity. arXiv preprint arXiv:1905.11604. Cited by: §2.
  • A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: §4.1.1.
  • D. A. Roberts (2021) Sgd implicitly regularizes generalization error. arXiv preprint arXiv:2104.04874. Cited by: §2.
  • O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §2.
  • F. Rousseau and R. Fablet (2018) Residual networks as geodesic flows of diffeomorphisms. arXiv preprint arXiv:1805.09585. Cited by: §2.
  • S. Sharma and S. Sharma (2017) Activation functions in neural networks.

    Towards Data Science

    6 (12), pp. 310–316.
    Cited by: Appendix A.
  • L. N. Smith (2017) Cyclical learning rates for training neural networks. In 2017 IEEE winter conference on applications of computer vision (WACV), pp. 464–472. Cited by: 3rd item, §5.
  • L. N. Smith (2018) A disciplined approach to neural network hyper-parameters: part 1–learning rate, batch size, momentum, and weight decay. arXiv preprint arXiv:1803.09820. Cited by: §5.
  • S. L. Smith, B. Dherin, D. G. Barrett, and S. De (2021) On the origin of implicit regularization in stochastic gradient descent. arXiv preprint arXiv:2101.12176. Cited by: §2.
  • S. L. Smith and Q. V. Le (2017) A bayesian perspective on generalization and stochastic gradient descent. arXiv preprint arXiv:1710.06451. Cited by: §2.
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §5.
  • V. Volhejn and C. Lampert (2021) Does sgd implicitly optimize for smoothness?. Pattern Recognition 12544, pp. 246. Cited by: §2.
  • M. J. Wainwright and M. I. Jordan (2008) Graphical models, exponential families, and variational inference. Now Publishers Inc. Cited by: Appendix B, Appendix B, Appendix B, footnote 20, footnote 21.
  • Z. Xie, I. Sato, and M. Sugiyama (2020) A diffusion theory for deep learning dynamics: stochastic gradient descent exponentially favors flat minima. arXiv e-prints, pp. arXiv–2002. Cited by: §2.
  • C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals (2016) Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530. Cited by: Figure 1, §1, §2, §3, §3, §5, §5.
  • J. Zhang, B. Han, L. Wynter, K. H. Low, and M. Kankanhalli (2019) Towards robust resnet: a small step but a giant leap. arXiv preprint arXiv:1902.10887. Cited by: §2, 2nd item, §4.1, §4.2, footnote 14.

7 Supplementary Material

Appendix A Proof of Proposition 3.1

This section is dedicated to proof of Proposition 3.1 from the main body of the paper and its consequences. It includes only a necessary minimum of definitions for brevity and we refer to an excellent manuscript Hauser [2018] or other resources where needed. The section concludes with general remarks on wider consequences of statements proven.

From now on we use upper indices to denote coordinates while lower ones index vectors/tensors. Also the Einstein summation is used whenever pair of indexes appears in the equation and it is not stated otherwise.

Without loss of generality, with a note that popular neural network architecture with ReLu activations can be seen as a limit of models using softmax (or softplus) activations Sharma and Sharma [2017], consider being a smooth a non-invertible map between input manifold with a coordinate system denoted and output manifold with a coordinate system . As such it defines push-forward operator that acts on tangent spaces: and it can be viewed as a generalized coordinate free derivative. We leave the details out and refer interested reader to Hauser [2018] for more.

Following the main document we’d like to link generalization of the network to its structure given by a composition of layer to layer maps, defined by layer dimension, activation function(s) and weights of layer .

Because an input space is an Euclidean space with an ortho-normal basis it is the same as its tangent space. Let’s denote an inner product on the input space. It is a constant identity matrix in our case, i.e. and denoted further as . Also as of this point further the manifold coincides with input data layer . We’ll use both notations interchangeably from now on in this section.

Similarly for an output layer (a probabilistic manifold of probability distributions defined by a choice of Bregmann loss function corresponding to a cummulant function ) we have its metric tensor defined as

(6)

where is a random variable of the output layer Exponential family distribution derived from a dual Bregman divergence using (29), and being convex conjugates, see Section B or Banerjee et al. [2005] and where the second equation is a consequence of a dual flatness of probabilistic manifold, Amari [2016], Theorem 2.1. therein.

An infinitesimally small line element in an ouput tangent space relates to an input vector by Jacobian

(7)
(8)

where Einstein summation over index is used.

A shape on the output layer is given by its metric tensor, i.e., Fisher information matrix (FIM) of the probabilistic model induced by the chosen Bregman divergence, see Section B or Amari [2016]. At a given point it characterized by positive semidefinite matrix (hessian) its curvature can be analyzed by exploring its eigenvalues. In particular the flatter the landscape w.r.t. outputs of network , that play the role of natural parameters of induced Exponential family see Appendix B, the smaller the change in likelihood as a function of in its neighbourhood and thus the better it generalizes.

Back-propagated Inner Product of the Manifold of Distributions
Nevertheless we’d like to know how the log likelihood changes with regards to an input layer and how does it depend on the model parameters. We’ll follow the "flatness" of the loss landscape as measure of generalization above. But we measure the curvature of the "back-propagated Hessian" ("pulled back"181818Hauser [2018] show that it is equivalent to pulling back the (output layer) frame bundle of a data manifold along the map and such it is well defined operation. We omit formal definitions for brevity and interested reader is kindly referred to Hauser [2018] and Amari [2016] metric) in the input layer instead. It is defined by its elements as follows:

(9)

whenever , and . By the convexity of that follows from the choice of Bregman divergence as a loss the defines a real positive semidefinite and symmetric quadratic form on a finite dimensional input vector space .

This "back-propagated Hessian" carries the information of functional characteristics of a map 191919similar to Riemann-Christoffel (RC) curvature tensor that captures the change of vector transported back to the origin along the closed loop curve. This round-the-world transport changes the original vector depending on the curvature of the manifold, see Amari [2016] along which we pull the outer Hessian back to an input layer and we can use it to capture the degree of the "generalization" of the map at datum .

We again emphasize that here we for brevity leave out the details of defining "pull-back" on frame bundles and refer to Hauser [2018] where it is properly done ensuring that this "back-propagation" is well defined.

Max eigenvalue of and well generalizing functions
In a case was (invertible) coordinate transformation we’d have two metric tensors related by standard Jacobian relation:

(10)

where is a Jacobian matrix which depends on in general. Local distances between input and output manifolds would relate as .

Since is not a change of coordinates in general we cannot use the last two expressions to relate distances in tangent spaces in a usual way.

Instead we make use of a dual flatness of the output layer with an inner product induced by a choice of Bregman loss. A small local distance in the output layer can be written in output layer coordinates as well as in input layer ones as follows (using Einstein summation):

(11)

where the first equality follows from output layer being metric space and we plugged in Eq.(7) and Eq.(6) to obtain the second equality. More over, following the notation of Amari [2016], denotes a coordinate curve (in our case of a dually flat space it is a geodesics) and a tangent vector is defined as a partial derivative operator that operates on a differentiable function and it gives its derivative in the direction of a coordinate curve .

Formula is understood accordingly as a composition of two derivative operators, acting on differentiable functions and consequently, along the coordinate curves and , see Amari [2016], Section 5.

By of convexity of we have that is a positive semidefinite symmetric real matrix of a dimension of the input layer, , for all as defined in Eq.9. As such its eigenvalues are all nonegative. Let’s denote its largest eigenvalue evaluated at as follows:

Definition A.1 ().

Given a datum and the smooth map the is defined as the largest positive eigenvalue positive semidefinite matrix defined by Eq.9.

(12)

where and denote a determinant and the identity matrix respectively.

Note that this is a standard definition and we restated it only to capture and emphasize its dependence on input datum .

Remark (Generalization driven by Jacobian of ).

To assess the degree of "flatness" of the loss landscape with regards to a neighbourhood of the given input the largest eigenvalue of pulled-back metric can be used as it defines the curvature of this neighbourhood as depicted in Fig. 10. The larger the the more curved is the loss landscape (over all directions in ) in the neighbourhood of the input . Note that this curvature is given by two factors: 1.) curvature of the probability output manifold, i.e. how close prediction is to the ML estimate of the induced probabilistic model as well as 2.) the derivatives of w.r.t. that capture how changes w.r.t. its inputs. This is an essential concept when generalization instead of Hessian relies on Jacobians (first derivatives of or consequently loss if output layer is taken into consideration). See. also section 2 for the idea of generalization within learning in the manifold of distributions concept.

Figure 10: Pulled-back metric

Further let’s note that the definition (A.1) is ’local’ and ’differential’ in the sense it depends on a data point of the data manifold and valid in an infinitesimal neighbourhood of (and therefor from its assumed smoothness) by the use of differential geometry tools. This ’locality’ is a given by the fact that the outer layer inner product (FIM) smoothly varies over coordinates of an output layer as a consequence of its dual flatness and more over a map , which could be thought of as a "coordinate transformation" between tangent spaces, also changes non-linearly with in general.

Next we restate the Proposition 3.1 a proof of which is now straightforward consequence of the preceding text:

Proposition A.1 (Proposition 3.1).

In the context of the above, assuming activation functions used in architecture of are 1-Lipschitz (for definition see B.1) a pull-back metric from output layer into an input Euclidean manifold around datum is up to a constant defined by following positive semidefinite matrix:

(13)

where is a real matrix with elements:

(14)

is fully determined by chosen Bregman divergence as a loss. Further is positive real function formed from products of activation functions derivatives along the path such that and where is a set of all "back-propagation" paths connecting any input layer node to an output layer node through the network such that each layer has exactly one node present in the path. Then is a product of all weights from input node to output node along the path .

Proof.

The first statement of Eq.13 is a rewritten definition of , Eq.9, into a matrix form where we take to be Jacobian of with elements defined in Eq.10 that has ) dimensionality.

The second statement, Eq.14 follows from above and writing product of layer weight matrices in a form of sum together with assumption on being 1-Lipschitz, defined in B.1. ∎

Corollary A.0.1 (Corollary 3.0.1, Spectral products, Informal).

Under conditions of Corollary (A.1) the largest eigenvalue of can be bounded from above by a following product of eigenvalues:

(15)

where denotes all layers of and denotes the largest eigenvalue of matrix comprising the weights of the layer and similarly denotes the largest eigenvalue of the positive semidefinite outer layer metric tensor from Eq.14.

Proof.

We can rewrite from A.1 in a gradient back-propagation style as a product of weight matrices and activation derivatives over layers:

(16)

where denotes weight matrix of the layer and is a vector of layer nodes activation functions’s derivative and operation between matrix and vector of suitable dimension is defined as .

Next we will make use of the following well known matrix relations (see for instance Bhatia [1997], Gantmakher [1959]):

(’cyclic property of a trace’)
for any three real matrices and such that products and traces involved are defined
(17)
(18)

Applying (18) on (16) and further by applying ’cyclic property of a trace’ to reorder the products of Eq.18 and 17 we get:

(19)

where the last inequality follows by applying the left inequality in Eq.18 and a 1-Lipschitz property (see B.1 for definition) of the activation functions by assumption. The constant comes from lower bound of Eq.18 and is a product of dimensions of layers. By absorbing into the statement follows. ∎

Appendix B Background on Bregman divergences, Exponential family and Notation

Let a neural network of layers be defined as the composition:

(20)

where each vector function is an activation function applied onto a result of matrix and vector product. We denoted collation of all network weights into a tensor as .

For reasons to be revealed shortly we define loss function as Bregman divergence Banerjee et al. [2005] :

(21)

, where is strictly convex function.

Overall, following the generalization framework of Kawaguchi et al. [2017] we aim to minimize given training dataset indexed by set and hypothesis captured in composition of :

(22)

From now on we will abuse notation and use instead of to denote either all or subset of weight(s) depending on the context. We will also use index instead of a function argument to denote dataset over which loss is evaluated, i.e. instead of . And we refer to value of loss over batch as .

Further in this paper we consider back-prop training of network using stochastic gradient descent (SGD) with the constant learning rate over mini-batch samples indexed by

(23)

In case of mini-batch being whole dataset we may refer to it as (full) gradient descent (GD) throughout the text.

Usefull properties of Bregman divergence

Minimizing square-loss, cross-entropy and in general negative-(log)likelihood (KL-divergence) and many other objectives can be suitably captured by choice of strictly convex, differentiable function

  • square-loss:

  • KL-divergence: s.t. (this also covers use of cross-entropy loss for classification tasks)

For reference see e.g. Banerjee et al. [2005], Hiriart-Urruty and Lemaréchal [2012].
Using the Bregman divergence as loss function allows us to derive general results for a wide range of losses, including both classification and prediction problems.

An important property of the Bregman divergence is that its derivative w.r.t the first argument at datum evaluates as

(24)

Further it can be shown that there exist an isomorphic dual space such that

(duality)

where is a convex conjugate to . For more details see Hiriart-Urruty and Lemaréchal [2012]. This notion will crucial in developing generalization error surrogate in the next section.

Mapping between Exponential families and Bregman divergence

As presented in Banerjee et al. [2005], Theorem 4, there exists a one-to-one mapping between the regular exp. family of distributions generated by sufficient statistics, base measure and Bregman div. , (informaly)

(25)

where the related exponential family has the following form

(26)

and

(27)

is uniquely determined given the base measure 202020Note that exp. family is defined with respect to some carrier measure. Then density correspond to Radon-Nykodym derivative where is absolutely continuous w.r.t. the Lebesque or counting (carrier) measure for continuous and discrete r.v. respectively in an alignment with Wainwright and Jordan [2008], Banerjee et al. [2005].. Further for a clearer notation and without loss of generality we assume embedding of inputs into a real vector space of dimensionality , i.e. , a -algebra of Borel sets on . To make this explicit we replace in notation and let denote the element of Borel sets -algebra on .

Let mean and natural parameters be denoted and respectively. Since and are Legendre duals there are also following known properties, see. Wainwright and Jordan [2008]

(28)
(29)

for so that exists.

The conjugate function can be expressed as ,212121follows from definition of and because the supremum is attained at . We skip technicalities in definitions for a supremum to be attainable for the sake of space and brevity, see Wainwright and Jordan [2008] for details. and thus we can write log likelihood of from Eq.(26) as

(30)

Therefore for any and we can write:

(31)

Max Likelihood in Exp. family

Assume r.v. follows distribution from exponential family w.r.t. to some base measure as defined in (26). A duality of and leads to so called Fenchel’s inequality for , for reference see Wainwright and Jordan [2008], variational representation of cumulant function, Theorem 3.4:

(32)

, where belongs to natural parameter space and is from interior of mean value parameter space. Relating it to (31) we see the right hand part of (32) is a negative Bregman divergence .

It is well known that by maximizing this lower bound, corresponding to , the inequality (32

) turns into equality if and only if the mean value parameters are equal to the observed moments:

for , where denotes index set of observed nodes (data points) , Wainwright and Jordan [2008]. In such a case is a negative Shannon entropy of the distribution matching given moments and it is maximal among all such distributions Wainwright and Jordan [2008], Hiriart-Urruty and Lemaréchal [2012].

Dually coupled Exponential family

There is an intriguing property of the Bregman divergences stating that the Bregman divergence equals to the Bregman divergence on the dual space defined by gradient mapping , for details see Bauschke et al. [2011], Banerjee et al. [2004].

Thus the "dual" Bregman divergence defines a dually coupled exp. family over the dual space222222To avoid confusion with other mean values and more we do not use mean value and natural labels of the duals in our settings.

In lieu of this duality there are two dually coupled parametrizations of the related exponential family …

  1. (primal) …defined by a cumulant function and neural network being a parametrized function of sufficient statistics. Mean value parameters are given by targets , given by Eq. (25).

  2. (dual) …defined by a cumulant function and sufficient statistics . Neural network is a parametrized subspace of a natural parameter space of the family in this formulation.

Note that the gradient mapping to the dual space is fully determined by the choice of (or equivalently ). For instance in case of the square loss, i.e. , the gradient mapping onto dual space is an identity map.

Definition B.1.

A function such that for all and , where is a constant independent of and , is called Lipschitz function.

For example, any continuous function with the first derivative bounded in absolute value by must be Lipschitz.