Log In Sign Up

Geometric Regularization from Overparameterization explains Double Descent and other findings

by   Nicholas J. Teague, et al.

The volume of the distribution of possible weight configurations associated with a loss value may be the source of implicit regularization from overparameterization due to the phenomenon of contracting volume with increasing dimensions for geometric figures demonstrated by hyperspheres. This paper introduces geometric regularization and explores potential applicability to several unexplained phenomenon including double descent, the differences between wide and deep networks, the benefits of He initialization and retained proximity in training, gradient confusion, fitness landscape properties, double descent in other learning paradigms, and other findings for overparameterized learning. Experiments are conducted by aggregating histograms of loss values corresponding to randomly sampled initializations in small setups, which find directional correlations in zero or central mode dominance from deviations in width, depth, and initialization distributions. Double descent is likely due to a regularization phase change when a training path reaches low enough loss that the loss manifold volume contraction from a reduced range of potential weight sets is amplified by an overparameterized geometry.


Do We Need Zero Training Loss After Achieving Zero Training Error?

Overparameterized deep networks have the capacity to memorize training d...

Deep Double Descent via Smooth Interpolation

Overparameterized deep networks are known to be able to perfectly fit th...

Deep Double Descent: Where Bigger Models and More Data Hurt

We show that a variety of modern deep learning tasks exhibit a "double-d...

Avoiding The Double Descent Phenomenon of Random Feature Models Using Hybrid Regularization

We demonstrate the ability of hybrid regularization methods to automatic...

On the Lipschitz Constant of Deep Networks and Double Descent

Existing bounds on the generalization error of deep networks assume some...

Deep Linear Networks Dynamics: Low-Rank Biases Induced by Initialization Scale and L2 Regularization

For deep linear networks (DLN), various hyperparameters alter the dynami...

Understanding Gradient Descent on Edge of Stability in Deep Learning

Deep learning experiments in Cohen et al. (2021) using deterministic Gra...

1 Introduction

The aggregate geometry of possible weight configurations associated with a loss value has been an unexplored property of neural networks to our knowledge, likely due to intractability of derivation at such high dimensions. If one could model the full geometry of a loss manifold then backpropagation would not be required. We attempt to circumvent that challenge by considering meta properties of geometry that can be inferred independent of fine grained details.

A key contribution of this work is identifying the relationship between an extent of overparameterization and volume of such geometry by relating to a well understood property of hyperspheres, which have a zero volume asymptotic trend with increasing dimensionality. Such contracting volume should serve as a form of regularization by restricting degrees of freedom to weight sets along a training path

anonymous_DeepReg, which we refer to as geometric regularization. Our experiments of histograms of loss values for randomly sampled weight sets demonstrate that with decreasing loss value these distributions often demonstrate a consistently contracting volume below a central mode when loss is averaged across multiple training samples, although for single samples the histograms may demonstrate multiple modes with either a dominate zero or central mode. Double descent is likely due to a regularization phase change when a training path reaches low enough loss that the loss manifold volume contraction from a reduced range of potential weight sets is amplified by an overparameterizated geometry.

Our histogram experiments demonstrate a contributing factor to differences between network layer width and depth, as distributions aggregated for single training samples appeared to demonstrate directional correlations in dominant zero or central modes arising from variations along these axes, with wider networks transitioning the histograms towards a zero mode and vice versa. We thus suspect wider networks may be easier to train because architectures with dominant zero mode profiles have more minima than dominant central mode profiles, and further that proper initialization serves to find a balance between these two modes. Our discussions relate loss manifold geometry to several other phenomenon, including He scaling, sparsity of deep networks, fitness landscapes, gradient confusion, and double descent in other learning paradigms.

2 Hyperspheres

Consider the equations for a three dimensional unit sphere: , where the volume is simply , which for a unit sphere is . Now consider a hypersphere where we increase the number of dimensions governed by the similar formula . To visualize in an abstract way, consider the difference between a perfect sphere and a collection of fronds at the top of a palm tree. As may be a surprising finding, for hyperspheres both the volume and surface area briefly increase with increasing dimensions until they reach a peak, after which point they progressively shrink to an asymptote at zero [Fig 1]. The paradox of hyperspheres is that with this decreasing volume and surface area, the expected distance between two sampled points will actually increase with parameterization Tu2002RandomDD

. Due to the curse of dimensionality, once a manifold starts to reach thresholds beyond order of 10 dimensions, evaluating fine-grained structure from random sampling becomes exponentially hard

Intrinsic_Erba_2019. Mathematicians currently have better understanding of hyperspheres in comparison to other high dimensional objects, even for simple shapes like hypercubes Granata_accurateestimation

, however this type of zero asymptotic volume convergence appears likely to arise when a shape is constrained through dimensional adjustment to a single scale, such as for a hypersphere could be the unit radius or for a loss manifold could be a specific loss value. We just don’t know where the peak of the volume curve would occur. It is possible the machine learning community may have found another framing that can approximate the location of such a peak by way of the double descent interpolation threshold.

One way to think about the loss function of a neural network is as an unconstrained formula with the weights as the variables, e.g. for cross entropy loss

the formula is , and through backpropagation we are trying to minimize . However when you consider that the fitness landscape will in general have a global minimum, the loss function through backpropagation is shifted in direction towards a minimum loss as . This also applies to any given value for , that is for any given loss, the formula is a constrained formula where each weight has some distribution of potential values associated with that loss, similar to how in a hypersphere there is some distribution of each variable associated with a specific radius. Thus can be approximated as a constrained formula around the weight set associated with the global minimum as well as for losses in the backpropagation states preceding reaching the global minimum, and where the volume of the distribution of weights are expected to shrink as the loss approaches the global minimum as there become fewer weight sets capable of achieving better performance, and the volume will converge to a point (a single weight set) at the global minimum.

Figure 1: Volume (V) and Surface Area (S) of unit hypersphere with increasing dimensions (n). wikipedia_image

Gaps in loss manifold volume refer to the distinction for a training path of directional updates that result in increased loss from the prior epoch, which will not be considered a viable path barring some form of momentum or otherwise deviation from a pure gradient signal. The volume of a loss manifold can be considered in two framings: the volume of weight set distributions that are capable of realizing an exact loss value, or the volume of weight set distributions that are capable of achieving a loss lower than the prior epoch, in which case the second framing’s volume plus the volume of the corresponding gaps will equal the volume of the full range of initialization sampling. Note that an initialization sampled from a normal distribution means that there will be a (vanishingly small) probability of an infinite range of possible weight values sampled at initialization, thus when we talk about the volume of initialization sampling it needs a probabilistic element to be meaningful, e.g. volume of initialized weights for


Now drawing further on our geometry analogy, the properties of volume and surface area for increasing dimensions translated to our high dimensional loss function is really just another way of saying that with increasing dimensions by parameterization the degrees of freedom associated with each weight corresponding to a given loss value will be diminished, kind of similar to what happens with L1 regularization which promotes collective sparsity of a weight set bengio2012practical. However here am not talking about the sparsity of a single collective weight set, more referring to sparsity of the the distribution of all weight sets corresponding to a loss value (a loss manifold). This implies that individual weights will also result in, for a given weight , for that distribution of corresponding to a given loss value, the sparsity of that single weight’s distribution will also increase with increasing parameterization.

The main idea of regularization theory is to restrict the class of admissible solutions by introducing a priori constraints on possible solutions WebbFunctional. Thus, with increasing parameterization, the trend toward decreasing volume and surface area of geometric figures implies a trend towards increasing sparsity of each weight’s distribution associated with a loss value which enforces a kind of regularization by constraining degrees of freedom for weight distributions traversed through a training path, which is the explanation for regularizing effects of overparameterized networks. The expectation is, if an adjacent regularizer is not dominant, a phase change to geometric regularization will emerge as a training path reaches loss values approaching a global minimum due to the overparameterized geometry aligning with the distribution volume reductions from fewer weight sets capable of achieving such loss, explaining the emergence of double descent. This can be inferred because we know that at the global minimum the loss manifold of will have one intrinsic dimension, so we expect that through training the path will traverse a loss manifold tendril of progressively shrinking dimensionality, imposing progressive geometric regularization once that surrounding intrinsic dimensionality retracts across its own peak in a volume verses dimensions curve.

This is somewhat of a conjecture, but with supporting evidence of geometric properties, our histogram experiments described below, and the reality of double descent we believe is the most credible hypothesis to date.

3 Related Work

Overparameterization is commonly considered as the set of neural network architectures with number of weights exceeding the threshold where the count of weights equals that of training samples, although distinctions such as mild overparameterization verses heavy overparameterization may also come into play. A notable unexpected property of the overparameterization regime is that the conventional wisdom for the bias-variance tradeoff in training appears to be contradicted, with emergence of a double descent training curve

Belkin15849, in which progressing through epochs initially manifests overfit that reaches a peak before continued training results in a recovery of test performance to realize a better generalization than what was achieved prior to the overfit state.

The convention appears to have several benefits. Empirical studies demonstrate reduced risk of overfit NEURIPS2018_54fe976b and remarkably small generalization error DBLP:conf/iclr/ZhangBHRV17. The benefits appear to manifest across frameworks and modalities of application. Overparameterized models appear to result in smoother fitness landscapes with a smaller ratio of saddle points to global minima pmlr-v139-simsek21a. The resulting models appear more robust to covariate shift, meaning distributional discrepancies between train and test data with retained label correlations Tripuraneni_overparam, and their interpolations are smoother with a smaller Lipschitz constant bubeck2021a. Although increasing parameters has resource and latency impacts to inference, the resulting models can often be pruned with little or no cost to generalization Barsbey_heacytails.

Although there can be a not insignificant increase in time, cost, and complexity of training these models, the resulting performance characteristics appear to more than offset, and progressively higher thresholds of overparameterized transformers NIPS2017_3f5ee243

have led to natural language implementations with few shot learning capabilities like GPT-3

NEURIPS2020_1457c0d6 and several emerging foundation models since bommasani2021opportunities.

It has been somewhat of a mystery to researchers the source of this phenomenon. It has been demonstrated empirically that size alone does not explain it, and that some form of capacity control or implicit regularization is at play DBLP:journals/corr/NeyshaburTS14

. The phase transition to a double descent phenomenon has been directly linked to varying the ratio between number of parameters to samples in unregularized networks

NEURIPS2020_37740d59. There appears to be some relevance to model initialization consideration as models tend to learn a network close to the initialized random weights NEURIPS2018_54fe976b. Counter to classical stochastic optimization theory, these models appear to train better with a constant SGD learning rate without momentum pmlr-v119-sankararaman20a. Perhaps even more perplexing, aspects of the phenomenon are not limited to neural networks, with a similar double descent curve and generalization benefits with increasing model complexity being demonstrated in other paradigms like kernel methods, nearest neighbors pmlr-v80-belkin18a

, and decision tree paradigms like random forest and gradient boosting


Noise injections to training features result in an increased threshold for number of parameters needed to reach the overparameterization regime pmlr-v139-dhifallah21a

- which we speculate is associated with additional perturbation vectors causing an increase to the intrinsic dimension of the modeled transformation function in a manner similar to data augmentation’s impact to intrinsic dimension


The practice has mostly positive impacts to performance, although it does have the possibility of lazy training where a model converges exponentially fast to zero training loss to recover a linear model NEURIPS2019_ae614c55

, which may occur under some choices of hyperparameters. Properly sampled initializations benefit the optimization

pmlr-v28-sutskever13 and help to avoid the undesirable lazy training phenomenon stoger2021small. There have been reports of degradation of performance on under-represented subgroups, however this appears to resolve with better calibration to a classification output layer menon2021overparameterisation.

Theoretical study of the phenomenon has followed several branches, this paper’s survey isn’t exhaustive. An influential channel of inquiry was to consider neural layers approaching the infinite width limit where the network’s modeled function become a Gaussian distributed process at initialization, which assumption underlies the neural tangent kernel equivalency

NEURIPS2018_5a4be1fa that can be used to represent networks as a kernel function. This equivalent kernel’s positive definiteness fasshauer2011positive can be used to evaluate network convergence properties. Other researchers have attempted to reason about translations in fitness landscape properties between different parameter regimes, which are closer to the theme of this work.

An overparameterized network is capable of learning any function represented by a corresponding network of fewer parameters Sun_SampleEffic

. This property appears to extend to networks of discrete activations learning the functions of smaller networks of smooth activations as has been proven for three layer networks modeling two layers

NEURIPS2019_62dad6e2. Networks have a universal connectivity property, so that if a modeled function may exist in parameter space it at least has the potential to be reached in training from a random initialization pmlr-v80-draxler18a

, which finding has also been extended to ReLU activations

freeman2017topology. It has been considered that a point in the fitness landscape of a network will translate to a corresponding manifold in the fitness landscape of a larger network pmlr-v139-simsek21a, which observation is the closest we’ve seen in the literature to reasoning about geometry of loss function distributions as is the focus of this paper, and we’ll comment on this further in Section 8.

There appears also to be some differences in whether the source of overparameterization is from increasing network width or depth. It has been observed that wider networks are easier to train NEURIPS2018_5a4be1fa, while deeper models have an implicit bias towards sparsity gissin2020the. Expected parameterization needed to reach generalization has been considered higher for deeper than wider networks, although it has recently been suggested that a mild overparameterization can also be used for deep networks chen2021how. Deeper models have been shown to be more efficient at modeling higher complexity functions than shallow networks pmlr-v49-eldan16, although we expect this may be from a different mechanism as will be discussed in this paper. Gradient confusion refers to negatively correlated gradients between mini-batches, which has been shown to trend higher with deeper networks pmlr-v119-sankararaman20a. One of the considerations of this work is that wide or deep networks may have different loss landscape properties manifesting as left tail thickness in the distribution of loss values for randomly sampled weight sets resulting in different transitions of regularization intensity by surrounding volume deltas seen by a training path approaching a minima.

Implementing overparameterization in practice involves several considerations. Theorists have suggested choosing a width based on the point at which learning algorithms can provably learn a zero loss in non-convex training, and then if increasing the number of training samples the parameterization can be increased by widening layers in a corresponding manner Song_Subquadratic. Deeper networks may realize similar benefits to wider networks with a common degree of mild overparmeterization chen2021how, although some of further discussions will demonstrate why a distinction may still be merited. Discontinuous activations like ReLU are still appropriate and train faster than smooth activations Panigrahi2020Effect. As noted above vanilla SGD with a constant learning rate has been found to outperform scheduled learning rate methods pmlr-v119-sankararaman20a

. The He initialization heuristic appears to lie at the boundary of the well-performing regime

Song_Subquadratic, which since the models tend to learn a model close to initialization is an important consideration. Note that overparmeterization can even be achieved by introducing intermediate linear layers that after training can be contracted algebraically to realize a more compact model NEURIPS2020_0e1ebad6.

4 Histograms

We sought to visualize loss manifold geometries by exploring meta properties of weight set distributions with a kind of monte carlo evaluation to derive histograms of binary cross-entropy loss values realized from randomly initialized weight sets [Appendix B

]. In order to realize meaningful insights with a reasonable number of samples, we stripped the domain down to a minimum configuration, essentially modeling a network of one or more dense ReLU layers with input of three tabular features and returning a sigmoid for binary classification, with loss evaluated based on inference without training and deriving a binary cross-entropy loss from 1 or more training samples. Although these histograms didn’t reveal the distribution volume itself, they do demonstrate the relative volumes between different loss values for a given network configuration as a loss with higher number of possible weight configurations should stochastically demonstrate an increased probability of representation from random sampling, with improved fidelity from an increased number of samples. By comparing trends across histograms in varied network configurations, we can infer aspects of distribution property effects realized from different degrees of overparameterization, width, depth, activation functions, initializations, and etc. At the scale of sampling we applied these histograms often didn’t include representation of weight configurations associated with global minimum, however in many cases, and especially when considering loss from a single training sample, there were characteristic shapes and trends demonstrated in the left tail of the distribution which is the area of most interest for considerations of double descent.

Figure 2: One training sample, dominant central mode

There were several characteristic patterns that emerged which we will survey in the appendix. Although some of the peaks phased in and out with small variations of parameters and configurations in a manner resembling the appearance of harmonics, we suspect the effect was from bin boundaries established by sampled minimum, and in aggregate trends were still visible.

A common characteristic of these histograms was the presence of a central mode in the distribution (visible as a peak), which appeared to universally align with the loss value realized from a 0.5 sigmoid output, it turned out the peak was a relic from the use of ReLU activations which at such small widths may often return all zero values in a preceding layer. When we ran comparable setups with tanh for a smooth activation, the mode transitioned from a single value peak to a curve, although it was still present. In some cases for ReLU activations a second mode would appear. We speculate that the zero value secondary mode, as visible in Figure 2, might be associated with the phenomenon of lazy training NEURIPS2019_ae614c55 noted above. In a small number of cases a second mode also appeared in the right tail.

We focussed most of our attention on histograms derived from a single sample to promote trend visibility. Two characteristic patterns appeared, one with the central mode and a reduced left tail [Fig 2], and the second with progressive volume towards a dominant zero mode [Fig 3]. In some cases these two conventions appeared with the same network configuration when evaluated against different training samples without He scaling. We expect these two cases may be aligned with what has been shown by other researchers when training on a single sample of having two scenarios of a model either memorizing or learning a representation DBLP:conf/iclr/ZhangBHRV17.

Figure 3: One training sample, dominant zero mode

The distinction between evaluations of a single training sample verses multiple training samples was notable. When loss was averaged over multiple training samples the left tail representation (for values below the primary mode) was greatly diminished, in most cases not visible at our sampling rate [Fig 4], although we knew that it existed when training a model with a given architecture, which would realize a loss value well below the minimum sampled in the histogram. This is consistent with expectations that a weight set that can represent transformations of multiple samples is much rarer than one representing a single sample. That this kind of central mode dominated distribution would arise even when aggregating loss across samples with majority zero mode dominant distributions like Figure 3 suggests that the zero mode dominant distributions for each sample have weight set distributions that are mostly non-overlapping.

Figure 4: 50 training samples, invisible left tail

5 Width verses Depth

Some discussions on the impact of increasing parameterization by network width or depth was noted above in the related work section, including the points that wider networks appear easier to train NEURIPS2018_5a4be1fa and deeper networks appear biased towards sparsity gissin2020the and are more efficient at modeling higher complexity pmlr-v49-eldan16.

The histogram framing can help illustrate implications, but first a brief conjecture on the subject as we believe aspects of these phenomenon may be simpler than expected. Consider the toy example where you have 3 features fed to 2 dense layers of 6 neurons. In this case the 18 weights from the 6 first layer neurons influence the 36 weights of the 6 second layer neurons, resulting in a

ratio of influence. Now consider the alternate configuration with equivalent parameter count where you have 3 dense layers of 4 neurons. In this case the 12 weights from the 4 neurons of the first layer influence a combined 32 weights of the layer 2 and 3 neurons, and the 16 weights of the 4 second layer neurons also influence 16 weights of the 4 third layer neurons, resulting in a ratio of influence from the same number of weights. We suspect each of these channels of influence adds an additional piece of complexity capacity to the model. For the two configurations, when realizing the same loss value they are modeling from the same set of candidate functions. Because the deeper network has more complexity potential, it can represent the same set of functions with sparser weight sets.

This sparsity property would also be consistent with deeper networks increasing the intrinsic dimension of the modeled function, as when sampling from a high dimensional manifold, increasing the intrinsic dimension has the effect of making an increasingly large range of distances devoid of observations Granata_accurateestimation. As an aside, the finding that Gaussian noise injected to features increased the interpolation threshold proportionately to the number of independent perturbation vectors pmlr-v139-dhifallah21a may suggest that the commonly accepted formula of threshold derivation based on weight count equal to sample count might need more variables associated with data set intrinsic dimension. We did not see this consideration otherwise discussed in the literature in such a precise manner, although Belkin15849 sort of eluded to it by noting that regularization can attenuate the interpolation peak by changing the capacity of the function class.

An interesting revelation from the histograms was that the influence of width and depth appeared to produce inverse directions in transitions from regimes similar to Figure 2 and Figure 3. Histograms following the central mode dominated Figure 2 notably shifted in direction of the zero mode dominated Figure 3

with increasing width, and increasing depth shifted exactly in the counter direction of increased kurtosis to the central mode. The trends endured when inverting labels or adding batch normalization.

6 Tail Properties

We thus believe there may be an additional contributor to geometric regularization in the left tail of these distributions that is associated with architecture considerations other than just parameter count, and arrises from what is manifesting as zero or central mode dominance in our histograms that correlates with width and depth configurations, so that in addition to an aggregate weight distribution manifold volume contraction with overparameterization, there may also be potential for a tail volume contraction associated with architecture considerations [Appendix E].

It is probably worth reiterating that most of these discussions so far have focused on histograms derived from single training samples. As additional samples are added to the inference basis of the loss values, the left tail of the histogram distribution quickly shrinks to invisibility for our extent of sampling which aligns with intuition. Even though we don’t have visibility of this tiny left tail, we can infer properties from what we have demonstrated takes place in loss manifold distributions on single training samples, after all the aggregate binary cross entropy loss value is derived from nothing more than the average of the loss for each training sample. Thus if we can identify trends in the histograms of single training samples, we can expect they will also manifest in the invisible left tail of the aggregate histogram across all training samples.

We did not find that the aggregate histograms strongly aligned with any of the traditional left tail bounded distributions like lognormal, gamma, or Weibull (we evaluated a few with statistical tests using the Wolfram Language WolframLanguage by deriving distribution parameters with FindDistributionParameters and then deriving a p-value with DistributionFitTest). However they still demonstrated some characteristic features of single mode distributions, and after averaging across multiple samples any secondary modes appeared to contract towards the central until losing visibility [Appendix G]. Among those features was the presence of a single mode, what appeared to be an unbounded right tail, and a bounded left tail. Again this left tail in many cases became invisible to our sampling rate, however we could infer from assessing a loss value after inference from a comparable architecture trained to overfit and the connectivity principle pmlr-v80-draxler18a that such a tail exists.

We knew from the depth and width experiments that wider networks will have a greater proportion of low loss value weight configurations available than equivalently parameterized deep networks, which also aligns with the ratio of influence framing noted above. What we didn’t know is what the aggregate loss manifold volume was in comparisons between wider and deeper networks at a common loss value. Just because a deeper network has a stronger representation in the central mode region of sampled loss values, it might still have similar volumes of loss manifolds at points in the left tail in comparison to wider networks. The histograms only reveal relative distribution volume relationships associated with different loss values for a common architecture within the same sampling operation. We attempted to circumvent this challenge by continued sampling of single sample configurations until reaching a common threshold for number of sampled left region values between configurations in order to evaluate the region in isolation. This approach yielded a similar pattern, with increasing depth causing a transition from zero mode to central mode dominated characteristics.

This finding suggests that for aggregated loss values across multiple training samples, where the left tail region becomes invisible to our sampling assessment and these zero mode verses central dominated profiles average out across samples to a central mode profile, the path followed by backpropagation towards minimum loss will traverse through a different profile of loss manifold geometry transitions in wider verses deeper networks. We expect the left tail region of wider networks will thus have thicker tails in this histogram space than corresponding deeper networks, suggesting that the geometric regularization experienced by deeper networks will be of greater intensity once reaching sufficient depth into the left tail [Appendix E]. We expect that the convention that thicker tails correspond to increased kurtosis may not apply for this consideration since the emergence of tail thickness is associated with presence of a second mode in the aggregated loss values as opposed to a dispersion from central mode. The idea that deeper networks will have greater constraints on weight configurations is also consistent with the finding that deeper networks trend towards more sparse weight configurations gissin2020the.

To demonstrate an extreme example, consider network architectures approaching infinite width. The neural tangent kernel framing NEURIPS2018_5a4be1fa suggests that these will converge to weights modeling a Gaussian distributed process at initialization. This known property of the modeled function may become an interesting channel for researchers to relate histogram properties to properties of resulting function. When we modeled large width scenarios in comparison to corresponding parameterized deeper models with a tanh activation, which didn’t exhibit the absolute zero mode like ReLU, we found the dominant mode also trended leftward when approaching asymptotic width [Appendix C].

We thus believe that wide and deep architectures may offer a tradeoff between number of minima and complexity capacity of the network as evidenced by zero mode dominance and trends toward sparsity, and we’ll now show that proper initialization scaling has a similar impact that may help to retain balance between these two regimes.

7 He Initialization

Proper initialization benefits optimization pmlr-v28-sutskever13, and He initialization he2015delving, which scales weight sampling based on architecture of dimensions of inputs to a layer, appears to lie at the boundary of the well-performing regime Song_Subquadratic. Note that He differs from Xavier initialization by not including the layer output dimensions in the scale derivation’s denominator, resulting in a larger scale, and has been demonstrated as more suitable for nonlinear activations like ReLU kumar2017weight.

He initialization samples from Gaussian, or a similar scaling may be adapted for sampling from a Uniform distribution. One of the experiments involved aggregating histograms from initializations by Normal and Uniform, and then comparing each again to other scales. Interestingly, He scaled Normal [Fig

2] and Uniform shared a similar appearance of histogram characteristics, which for the demonstration architecture exhibited a balance between central and zero mode, however when we increased the scale of Normal and compared to a corresponding Uniform variation, it appeared that Uniform was more quick to shift into a dominant zero mode [Fig 5], suggesting that Normal is a more of a stable configuration than Uniform [Appendix F].

Figure 5: Random Uniform initialization with increasing scale

We interpret the alignment of increased initialization scale of shifting the histogram towards dominant zero mode having a similar effect to widening layers as suggesting that networks have a larger proportion of low loss weight configurations when the input to layers have greater range of activations. The introduction of batch normalization did not appear to change this property, confirming that it was not arising from weight magnitude but spread. Increasing initialization scale is known to help break symmetry between units GoodBengCour16 and impact the implicit regularization of gradient descent Ba2020Generalization. The paradox of increased regularization with scale pmlr-v75-li18a in conjunction with a zero mode dominated tail possibly suggests for wider networks, which also approach a zero mode, the minima may be more spread out in the loss manifold than for deeper networks. Note that the mode balance demonstrated with He scaling may become a channel for aligning width and depth configurations, e.g. when adding more training samples.

8 Other Related Phenomenon

This section offers a speculative survey of how geometric properties of the loss manifold might be related to other phenomenon observed with overparameterization.

  • Gradient confusion pmlr-v119-sankararaman20a: With a thicker left tail in the histogram space as well as a tendency for reduced sparsity, the set of weight configurations that can approximate a modeled function will be larger for wider networks, and the corollary is that deeper networks will have a greater density of gaps in the weight distribution manifold associated with loss values exceeding that realized in the prior epoch, such that the statistical variations between mini-batches may result in diverting the direction of a training path due to proximity to one of these gaps, causing increased gradient confusion for deeper networks.

  • Retained proximity to initialization NEURIPS2018_54fe976b: The volume contraction of a weight distribution manifold in overparameterization converges to the mass of a hypersphere being concentrated in its center, or at the infinite dimensional case to a projection in a one-dimensional world as a single point kuketayev2013probability, explaining initialization proximity. It is a kind of paradox of the curse of dimensionality that with this convergence, the expected euclidean distance between two sampled points will increase with dimensionality.

  • Smoother fitness landscape pmlr-v139-simsek21a: Borrowing the analogy that a point on a fitness landscape will translate to a manifold on a higher dimensional landscape, if the points on the higher dimensional fitness landscape are spread further apart Tu2002RandomDD, then the landscape should appear to have a smoother characteristic when comparing comparably scaled weight deltas to an underparameterized model, which possibly explains the benefit of constant learning rates pmlr-v119-sankararaman20a. The paradox of a point translating to a manifold of decreased volume also suggests some form of contraction from a manifold of similar points on an underparameterized landscape into aggregated smaller volume manifold with overparmeterization. The observation of reduced concentration of saddle points in an overparameterized fitness landscape pmlr-v139-simsek21a suggests that these points have a higher prevalence of consolidations.

  • Regularization dampening double descent Belkin15849: We suspect the result that regularization dampens the interpolation peak at a minimum is associated with a more consistent degree of regularization through training, as geometric regularization may not be a dominant feature until the training path reaches well into the left tail of the loss manifold distributions. Without a secondary regularization at play a training path in effect will experience a phase transition from no regularization to dominant geometric regularization near the interpolation peak.

  • Dropout regularization JMLR:v15:srivastava14a: This is speculative, but our discussions related to wide vs deep networks may be relevant to dropout regularization. Consider that when randomly dropping neurons, the network is then channeling backpropagation in a path consistent with what would be realized for a narrower width network, with all of the implications discussed for sparser representations and thinner tail region in the loss histogram. This may be an increased factor during later epochs as loss values reach into left tail where geometric regularization is a factor.

  • Double descent in other learning paradigms pmlr-v80-belkin18a: Our comparison of the hypersphere to a set of neural network weight distributions associated with a loss value we expect will be equally valid for any learning paradigm in which a large number of parameters are tuned in the direction of a decreasing loss signal toward a global minimum. In any form, as a fitness landscape approaches a global minimum the volume of possible weight configurations will constrict, and with sufficient overparameterization that constriction somewhere around the interpolation peak will start to align with dimensionality volume contraction to see an emergence of geometric regularization.

9 Conclusion

This paper has introduced the principle of geometric regularization, which is associated with volume contraction with increasing parameterization of the distribution of possible weight sets associated with a loss value, and was inferred based on related properties demonstrated by hyperspheres. Geometric regularization explains several phenomenon for overparameterized learning that have puzzled researchers. We believe double descent is a result of a training path reaching a sufficiently low loss value that the loss manifold volume contraction from overparameterization aligns with volume contraction from a reduced set of potential weight set candidates to produce a regularization phase change.

We also demonstrated related phenomenon by way of loss value histograms from randomly sampled weight sets with different architecture configurations and initializations. We found some trends were most visible when evaluating loss associated with a single training sample, which included the emergence one or more dominant modes in the histogram distribution. The relative dominance between a central mode or zero mode appeared to directionally correlate with deviations in architecture configurations like depth, width, and initialization scaling. These type of histogram framings may become a useful tool for researchers to explore loss manifold implications of other neural network conventions.

As some properties of high dimensional figures are intractable in today’s theory, we suspect the well vetted empirical demonstration of the double descent phenomenon, as well as as the concrete derivation methods available for the associated interpolation threshold, may become an alternate channel for mathematicians to investigate properties of high dimensioned geometries in future research.


Appendix A Table of Contents

Appendix B Histogram experiments

The histogram experiments were conducted in a series of jupyter notebooks on the Colaboratory platform. A representative notebook is provided with the supplemental material. Many of the setups had some degree of variation over this template. (The neural network architecture was initially modeled as formulas in the cells of a spreadsheet, we soon found we could incorporate more elaborate architecture conventions and larger depth with the support of Tensorflow

tensorflow2015-whitepaper.) We used excerpts from the Titanic data set encoded with the Automunge library as features.

The notebook effectively demonstrates the initialization of a small network based on specification of width, depth, activation, and initialization type. Once a network was initialized with random weights, the features were passed to a predict operation similar to what would be performed on a trained model in inference, however in this case the predict was applied on the initialized weights without any form of training. The output of the predict and the corresponding labels were then fed to a binary cross entropy loss evaluation (without the “from_logits" option) to realize a single loss value recorded as one count within the histogram. Histogram aggregation involved repeating this setup a number of times based on either a designated sample count or in some cases running samples for duration of the Colaboratory 24 hour run time window.

We limited the feature count to three, with one numeric and two categoric, because we expected that the more features that were applied, the larger the scale of sampling would be required to get reasonable representation in histogram tail regions. For similar reasons, we focussed most of our inspections on loss values calculated for a single sample.

We expect most readers in this venue will be familiar with the conventions of dense ReLU layers and sigmoid activation output with respect to how to interpret from a mathematical sense. However since this is the appendix will provide a brief demonstration of what it means when we talk about the loss equation . [Table 1] is a simple example of three features fed to a dense layer of three ReLU activations channeled through a single sigmoid output for binary classification, using uniform initialized weights to derive a loss value, and requiring 12 weight variables to implement. When we talk about distribution volume, we’re talking about the range of possible values of these 12 weights corresponding to some given loss value. This table represents one sample for a histogram, and by resampling the set of 12 weights, e.g. from a uniform or normal distribution of some scaling, then an additional loss value can be derived for an additional histogram point. This is similar to what is being performed in Tensorflow.

Table 1: Simple network equations

One way to think about what is taking place with the histograms is that we are compressing the geometry of loss manifold distributions down to a binary cross entropy comparison between sampled state and a designated lowest energy state, where lowest energy refers to the case where the sampled state’s transformation function matches the natural label generating function at a global minima . Note that this lowest energy state is not an inherent property of the geometry, it is low energy in comparison to some targeted label generating function of potentially lower intrinsic dimensions than the dimensions of the weight set. The sampled loss density is a kind of proxy for volume, and we can infer by the shape of the histogram curve which loss values will have larger manifold volumes relative to other loss values from the same architecture and initialization, as well as geometry transitions (i.e. surrounding volume expansion or contraction) that will be experienced by a training path.

The remaining appendices will survey several representative histograms aligned with discussion points from the paper.

Appendix C Smooth activations

The distinction between ReLU and tanh was noticable [Fig 6]. Relu exhibited sharp peaks at the modes whereas tanh was more of a smooth curve with only one visible mode. Tanh also appeared to be more stable with architecture variations. In some cases Relu would shift significantly while the corresponding tanh would be hard to distinguish [Fig 7]. However when we modeled architectures approaching infinite width, and especially with shallower networks, a similar zero mode transition could be seen in the tanh [Fig 8].

Figure 6: Corresponding ReLU (top) and tanh (bottom) example

Figure 7: Increasing width to 18, ReLU (top) and tanh (bottom)

Figure 8: Approaching inf width, (1) 2560 layer, tanh

Appendix D Width verses depth

We devoted some attention in the paper to implications of wide verses deep networks which we refer the reader to Section 5. This appendix offers a few representative examples. [Fig 9] demonstrates the transition from increasing width of a three layer network through 6, 9, and 12 neurons, demonstrating the characteristic shift towards a dominant zero mode. [Fig 10] demonstrates a similar transtition through increasing depth.

Figure 9: Transitions through increasing width

Figure 10: Transitions through increasing depth

Appendix E Tail region

This series is another example of increasing depth, this time with a wider network starting point than the preceding appendix, and averaged across two training samples where the left tail is still visible. The left tail follows a progression with increasing parameterization by depth of transitioning towards a central mode and shrinking zero peak.

Figure 11: Tail region closeup with depths 2-6

Figure 12: Tail region closeup with depths 8-12

Appendix F Initialization scaling

We noted in the writeup that in some cases dominance between the two mode conventions appeared with the same network configuration when evaluated against different training samples. It turns out this appeared to be more prevalent when applying initializations sampled from a uniform instead of normal distribution, as demonstrated here He scaled Normal appears to be stable across these three representative training samples [Fig 13], while Uniform, with an arbitrary scale of +/-1, has some deviation in mode dominance characteristics across those same samples [Fig 14]. Note that He scaling is not a complicated or overly theoretic formula, it is merely a heuristic that has been found to generalize well.

Figure 13: He initialization with different training samples

Figure 14: Uniform initialization with different training samples

Appendix G Additional training samples

The majority of our focus on histograms in this paper was inspecting derivations performed based on a single training sample due to visibility of left tail characteristics in comparison to aggregated samples. In [Fig 15] a histogram is shown based on 1, 2, and 3 aggregated samples. It appears that the first two samples were similar enough that no complicated function was needed to relate their inference, so the left tail was retained with their aggregation. The addition of a third sample resulted in the left tail compressing to close proximity to the central mode [Fig 16]. Note that a small indication of a left tail is visible adjacent to the left peak. 50 training samples produced further contraction [Fig 17]

Figure 15: Additional training samples

Figure 16: Close-up of 3 training samples

Figure 17: Close-up of 50 training samples