Naively, we might expect that over-parameterized models will overfit the training data and that under-parameterized models will be better since they have fewer degrees of freedom. However, it turns out that over-parameterized models can find better solutions than the under-parameterized models - a paradoxical phenomenon known as the double-descent curve[Belkin (2021), Nakkiran et al. (2020)]. One possible explanation for this behaviour is that over-parameterized models are subject to an Occam’s razor that filters out unnecessarily complex solutions in favour of simpler solutions.
Typically we might expect an Occam’s razor to take the form of a complexity measure on the number of model parameters or the size of the hypothesis space, for instance [Zhang et al. (2017)]. However, for neural networks, the precise form of this hypothesized Occam’s razor is not known, since it is not explicitly enforced during training. There has been some progress recently to identify sources of implicit regularization that may play a role here [Barrett and Dherin (2021), Smith et al. (2021), Ma and Ying (2021), Blanc et al. (2020)]. For instance, recent work has exposed a hidden from of regularization in Stochastic Gradient Descent (SGD) called Implicit Gradient Regularization (IGR) [Barrett and Dherin (2021), Smith et al. (2021)] which penalizes learning trajectories that have large loss gradients.
For over-parameterized neural networks trained with SGD, we hypothesize that the hidden Occam’s razor takes the form of a geometric complexity measure. Our key contributions are as follows: (1) define this notion of geometric complexity; (2) show that the Dirichlet energy can be used as a proxy for geometric complexity; (3) show that the IGR mechanism from SGD puts a regularization pressure on the geometric complexity; and (4) show that the strength of this pressure increases with the size of the learning rate, which we verify with numerical experiments.
2 The Geometric Occam’s razor in 1-dimensional regression
To build intuition, we begin with a simple 1-dimensional example. Consider a ReLU neural network consisting of 3 layers with 300 units per layer, trained using SGD, without any form of explicit regularization to perform 1-dimensional regression using only 10 data points. In this extreme setting, we should expect the network to overfit the dataset, since the function space described by that neural network is extremely large - consisting of piecewise linear functions with thousands of linear pieces(Arora et al., 2018). Yet, if we plot the learned function during training from the first step all the way up to interpolation, as in Figure 1, we observe that the learned function is the ‘simplest’ possible function, in some sense, among all functions with the same training error.
But what do we mean by ‘simple’? Our key intuition in this example is that the arc length of the learned function over the smallest interval containing the data points provides our measure of model complexity. At the end of training, we see that the arc length of the learned function is close to that of the shortest possible path interpolating between the data points, suggesting that this measure of geometric complexity is somehow optimised during training.
3 Dirichlet energy as a measure of function complexity
In the previous section, we used the arc length of the learnt 1-dimensional function as a measure of its geometric complexity. What is the corresponding notion for a function in a high-dimensional feature space ? In this case, we can define the geometric complexity of a function as the volume of its graph
restricted to the feature polytope ; that is, the polytope with smallest volume containing all the feature points of the dataset . From differential geometry (do Carmo, 1976), for a smooth function , the graph is an -dimensional smooth submanifold of . Using the Riemannian metric on induced from the Euclidean metric on and its corresponding Riemannian volume form, the volume of the graph of can be expressed as
This can in turn be approximated using a first-order Taylor series expansion so that
is the Dirichlet energy of the function over . The computation above suggests that both the function volume or its Dirichlet energy can be used as a measure of a function’s geometric complexity.
One way to compute the Dirichlet energy numerically is to use a quadrature formula summing up over a number of points in and multiplying the summands by the volume element of the point. So, if we use the data points themselves for evaluation and as a proxy for the volume element, we obtain a discrete version of the Dirichlet energy which we call the discrete Dirichlet energy, denoted by . This provides an easily computable measure of a function’s geometric complexity:
4 How neural networks tame model complexity
We now argue that the geometric complexity, as measured by the discrete Dirichlet energy, is implicitly regularized during the training of neural nets with vanilla SGD.
In recent work Smith et al. (2021)
, it was shown that the discrete steps of SGD from epoch to epoch closely follow, on average, the gradient flow of a modified loss of the form:
where is the original loss and is the error between the prediction and the true label . This means that during SGD the quantities at each data point , are implicitly regularized, with the learning rate acting as an implicit regularization rate.
Now, for models whose losses come from the application of a maximum likelihood estimation on a conditional probability distribution in the exponential family such as the least-square loss or the cross-entropy loss, we obtain loss gradients that have the following form:
where is the signed residual, yielding
From that last expression for , we see that the terms are implicitly regularized at each data point and even more so in the region where the residual errors are large, such as the beginning of training.
We now argue that for neural networks, in particular, the regularization pressure on the gradient of the network with respect to the parameters acts as a regularization pressure on the gradient of the network with respect to the input . Hence, this creates a pressure for the Dirichlet energy to be implicitly regularized during training. In fact, this follows from the fact that for neural networks their derivatives with respect to the inputs and the parameters can be related as follows (proof in Appendix A):
Consider a neural network with layers where with
being the vector of layer weight matricesand biases and the
’s are the layer activation functions. Then we have that
where is the sub-network from input to layer and is the spectral norm of the weight matrix .
From equation:gradient_relation, we see that the regularization pressure from IGR translates into a regularization pressure on the discrete Dirichlet energy when the positive quantities
remain small. Note that this is expected to happen at the beginning of training when the spectral norms of the layers are close to zero, while they tend to grow as the training progresses if no spectral regularization Miyato et al. (2018) is applied. Furthermore, note here that preventing the ’s from becoming too large during training may be an important consideration which informs the choice of model architecture and layer regularization.
From Equation equation:resdidual_loss, since the strength of IGR is a function of the learning rate, we should expect an increased pressure on the Dirichlet energy as a result of Equation equation:gradient_relation when training with higher learning rates. We verify this prediction for a ResNet-18 trained to classify CIFAR-10 images. Measuring the discrete Dirichlet energy at the time of maximal test accuracy for a range of learning rates, we observe this predicted behaviour, consistent with our theory; see Figure2.
Note also that for linear models (i.e., neural networks with a single linear layer), the Dirichlet energy of the network coincides in this case with the L2-norm of the parameters. Therefore, this results recovers the already known fact that linear models trained with SGD have an inductive bias towards low L2-norm solutions (see Zhang et al. (2017)). This also points toward the fact that the Dirichlet energy may be the right generalization of the L2-norm for a general network.
5 Related work, Future directions, and Discussion
Splines and connections to harmonic function theory: The Dirichlet energy equation:DE is well-known in harmonic function theory (Axler et al., 2013) where it can be shown using calculus of variations that harmonic functions subject to a boundary condition minimize the Dirichlet energy over the space of differentiable functions. This is known as Dirichlet’s principle. The minimization of the Dirichlet energy itself is also related to the theory of splines Jin and Montufar (2021). Our work seems to indicate that neural networks are biased towards (a notion of) harmonic functions with the dataset acting as the boundary condition. Complexity theory: The notion of geometric complexity introduced has similarities to the Kolgomorov complexity Schmidhuber (1997) as well as the minimum description length given in Hinton and Zemel (1994). Smoothness regularization: The notion of geometric complexity considered here is related to the notion of smoothness with respect to the input as discussed in Rosca et al. (2020) as well as to the Sobolev regularization effect of SGD discussed in Ma and Ying (2021), where inequalities similar to equation:gradient_relation but involving only the first layer are considered. In particular, various forms of gradient penalties, reminiscent of the Dirichlet energy, have been devised to achieve Lipschitz smoothness Elsayed et al. (2018); Gulrajani et al. (2017); Fedus et al. (2018); Arbel et al. (2018); Kodali et al. (2018). It has been shown that the discrete Dirichlet energy (evaluated at the data points) is a powerful regularizer Hoffman et al. (2020); Varga et al. (2018) and in Rosca et al. (2020) that it has advantages over other form of smoothness regularization (such as spectral norm regularization Miyato et al. (2018); Lin et al. (2021)). Our analysis shows that we can control this form of regularization cheaply through the learning rate. In image processing, the Dirichlet energy is also called the Rudin–Osher–Fatemi total variation and it as been introduced as a powerful explicit regularizer for image denoising; see Rudin et al. (1992) and Getreuer (2012). It may be possible that these various forms of smoothness regularization are useful because they provide implicit control over the model geometric complexity. Regularization through noise: The discrete Dirichlet energy is reminiscent of the Tikhonov regularizer which is implicitly regularized with added input noise Bishop (1995). The modified loss in equation:resdidual_loss is also very reminiscent of the modified loss in Blanc et al. (2020)
, which is argued to be implicitly minimized by SGD when a random white noise is added to the labels. In Section 3 ofMescheder et al. (2018), it is argued that explicit gradient regularization with respect to input and noise instance produce similar types of regularization. Altogether, this suggests that feature noise, label noise, and the optimization scheme all conspire to implicitly tame the geometric complexity in the case of neural networks trained with gradient-based optimization schemes. Regularization through the number of layers: In Equation equation:gradient_relation, one sees that each layer contributes an additional positive term, increasing the pressure on the Dirichlet energy. This suggests that the pressure on the model geometric complexity may increase with the neural network depth in a similar spirit as Gao and Jojic (2016). Training of GANs: For GANs, explicit gradient regularization both with respect to the input (Gulrajani et al., 2017; Arbel et al., 2018; Fedus et al., 2018; Kodali et al., 2018; Miyato et al., 2018) and the parameters Rosca et al. (2021); Qin et al. (2020); Balduzzi et al. (2018); Mescheder et al. (2017); Nagarajan and Kolter (2017) has been proven to be beneficial and related to smoothness. Our main theorem provides a way to relate gradient penalties with respect to the input and with respect to the parameters for neural networks, in a way where the spectral norm of the weight matrices plays a key role. This points toward geometric complexity being a useful notion to relate and understand these different forms of regularization (including spectral normalization as in Miyato et al. (2018) and Lin et al. (2021)).
In conclusion, we have found that neural networks trained with SGD are subject to an implicit Geometric Occam’s razor, which selects parameter configurations that have low geometric complexity ahead of configurations with high geometric complexity. This geometric complexity is given by the arc length in 1-dimensinal regression; is linearly related to the Dirichlet energy in higher-dimensional settings; and has many intriguing similarities to other known quantities, including various forms of implicit and explicit regularisation and model complexity. More generally, our work develops promising new theoretical connections between optimization and the geometry of over-parameterised neural networks.
We would like to thank Mihaela Rosca, Maxim Neumann, Yan Wu, Samuel Smith, and Soham De for helpful discussion and feedback. We would also like to thank Patrick Cole and Shakir Mohamed for their support.
- Arbel et al. (2018) Michael Arbel, Danica J Sutherland, Mikolaj Binkowski, and Arthur Gretton. On gradient regularizers for mmd gans. In Advances in Neural Information Processing Systems, volume 31, 2018.
Arora et al. (2018)
Raman Arora, Amitabh Basu, Poorya Mianjy, and Anirbit Mukherjee.
Understanding deep neural networks with rectified linear units.In International Conference on Learning Representations, 2018.
- Axler et al. (2013) Sheldon Axler, Paul Bourdon, and Ramey Wade. Harmonic Function Theory, volume 137. Springer, 2013.
Balduzzi et al. (2018)
David Balduzzi, Sebastien Racaniere, James Martens, Jakob Foerster, Karl Tuyls,
and Thore Graepel.
The mechanics of n-player differentiable games.
International Conference on Machine Learning, 2018.
- Barrett and Dherin (2021) David G.T. Barrett and Benoit Dherin. Implicit gradient regularization. In International Conference on Learning Representations, 2021.
- Belkin (2021) Mikhail Belkin. Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation. 2021. URL https://arxiv.org/abs/2105.14368.
- Bishop (1995) Chris M. Bishop. Training with noise is equivalent to tikhonov regularization. Neural Computation, 7(1), 1995.
- Blanc et al. (2020) Guy Blanc, Neha Gupta, Gregory Valiant, and Paul Valiant. Implicit regularization for deep neural networks driven by an ornstein-uhlenbeck like process. In Conference on Learning Theory, 2020.
- do Carmo (1976) Manfredo P. do Carmo. Differential geometry of curves and surfaces. Prentice Hall, 1976.
- Elsayed et al. (2018) Gamaleldin Elsayed, Dilip Krishnan, Hossein Mobahi, Kevin Regan, and Samy Bengio. Large margin deep networks for classification. In Advances in Neural Information Processing Systems, volume 31, 2018.
- Fedus et al. (2018) William Fedus, Mihaela Rosca, Balaji Lakshminarayanan, Andrew M. Dai, Shakir Mohamed, and Ian J. Goodfellow. Many paths to equilibrium: Gans do not need to decrease a divergence at every step. In International Conference on Learning Representations, 2018.
Gao and Jojic (2016)
Tianxiang Gao and Vladimir Jojic.
Degrees of freedom in deep neural networks.
Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, 2016.
- Getreuer (2012) Pascal Getreuer. Rudin-Osher-Fatemi Total Variation Denoising using Split Bregman. Image Processing On Line, 2012.
- Gulrajani et al. (2017) Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In Advances in neural information processing systems, 2017.
- Hinton and Zemel (1994) Geoffrey E Hinton and Richard Zemel. Autoencoders, minimum description length and helmholtz free energy. In J. Cowan, G. Tesauro, and J. Alspector, editors, Advances in Neural Information Processing Systems, volume 6. Morgan-Kaufmann, 1994.
- Hoffman et al. (2020) Judy Hoffman, Daniel A. Roberts, and Sho Yaida. Robust learning with jacobian regularization. 2020. URL https://arxiv.org/abs/1908.02729.
- Jin and Montufar (2021) Hui Jin and Guido Montufar. Implicit bias of gradient descent for mean squared error regression with wide neural networks. 2021. URL https://arxiv.org/abs/2006.07356.
- Kodali et al. (2018) Naveen Kodali, James Hays, Jacob Abernethy, and Zsolt Kira. On convergence and stability of GANs. In Advances in Neural Information Processing Systems, 2018.
- Lin et al. (2021) Zinan Lin, Vyas Sekar, and Giulia C. Fanti. Why spectral normalization stabilizes gans: Analysis and improvements. In Advances in neural information processing systems, 2021.
- Ma and Ying (2021) Chao Ma and Lexing Ying. The sobolev regularization effect of stochastic gradient descent. 2021. URL https://arxiv.org/abs/2105.13462.
- Mescheder et al. (2017) Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. The numerics of gans. In Advances in Neural Information Processing Systems, 2017.
- Mescheder et al. (2018) Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for GANs do actually converge? In Proceedings of the 35th International Conference on Machine Learning, 2018.
Miyato et al. (2018)
Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida.
Spectral normalization for generative adversarial networks.In International Conference on Learning Representations, 2018.
- Nagarajan and Kolter (2017) Vaishnavh Nagarajan and J Zico Kolter. Gradient descent gan optimization is locally stable. In Advances in neural information processing systems, 2017.
- Nakkiran et al. (2020) Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: Where bigger models and more data hurt. In International Conference on Learning Representations, 2020.
Qin et al. (2020)
Chongli Qin, Yan Wu, Jost Tobias Springenberg, Andy Brock, Jeff Donahue,
Timothy Lillicrap, and Pushmeet Kohli.
Training generative adversarial networks by solving ordinary differential equations.In Advances in Neural Information Processing Systems, volume 33, 2020.
- Rosca et al. (2020) Mihaela Rosca, Theophane Weber, Arthur Gretton, and Shakir Mohamed. A case for new neural network smoothness constraints. In Proceedings on ”I Can’t Believe It’s Not Better!” at NeurIPS Workshops, volume 137 of Proceedings of Machine Learning Research, 2020.
- Rosca et al. (2021) Mihaela Rosca, Yan Wu, Benoit Dherin, and David G.T. Barrett. Discretization drift in two-player games. In Proceedings of the 38th International Conference on Machine Learning, volume 139, 2021.
- Rudin et al. (1992) Leonid Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenomena, 60(1), 1992.
- Schmidhuber (1997) J. Schmidhuber. Discovering neural nets with low kolmogorov complexity and high generalization capability. Neural networks : the official journal of the International Neural Network Society, 10 5, 1997.
- Seong et al. (2018) Sihyeon Seong, Yegang Lee, Youngwook Kee, Dongyoon Han, and Junmo Kim. Towards flatter loss surface via nonmonotonic learning rate scheduling. In Proceedings of the Thirty-Fourth Conference on Uncertainty in Artificial Intelligence, 2018.
- Smith et al. (2021) Samuel L Smith, Benoit Dherin, David G.T. Barrett, and Soham De. On the origin of implicit regularization in stochastic gradient descent. In International Conference on Learning Representations, 2021.
- Varga et al. (2018) Dániel Varga, Adrián Csiszárik, and Zsolt Zombori. Gradient regularization improves accuracy of discriminative models. 2018. URL https://arxiv.org/abs/1712.09936.
- Zhang et al. (2017) Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, 2017.
Appendix A Proof of Theorem 1
Consider a neural network with layers where with being the vector of layer weight matrices and biases and the ’s are the layer activation functions. We will use the notation or instead of when we want to consider as dependent on the ’s or ’s only.
For this model structure and following Pythagoras, we have:
For each layer , we can rewrite the network function as
where consists of the deeper layers above and consists of the shallower layers below . The idea now, inspired from Seong et al. (2018), is to show that a small perturbation of the input is equivalent to a small perturbation of the weights of layer . We will use this idea to prove the following two lemmas.
In the notation above, for each layer , we have:
Consider a small perturbation of the input . We start by showing that we can always find a corresponding perturbation of the weight matrix in layer such that
Namely, because , to show this, it is enough to find such that
where we identify with its linear approximation around for small . Then equation:condition is always satisfied if we set
since . Now taking the derivative with respect to at
on both sides of Equation equation:equivalence, and using the chain rule and thatis linear in , we obtain a relation between the network derivative with respect to the weight matrices and w.r.t the input:
Taking the norm on both sides, squaring, and rearranging the terms yields equation:weight_derivative.
Following the same strategy, we now prove a corresponding lemma for the biases at each layer:
In the notation above, for each layer , we have:
Consider a small perturbation of the input . We start again by showing that we can always find a corresponding perturbation of the biases in layer such that
Namely, because , to show this, it is enough to find such that
where we again identify with its linear approximation around for small . Then equation:condition2 is always satisfied if we set this time
Now taking the derivative with respect to at on both sides of Equation equation:equivalence, and using the chain rule and that is linear in , we obtain a relation between the network derivative w.r.t. the biases and w.r.t the input:
Taking the norm on both sides, squaring, and rearranging the terms yields equation:bias_derivative.
Using this in quality for each layer in the decomposition given by Equation equation:decomposition, we obtain that
Now if we use equation:weight_derivative and equation:bias_derivative in equation:decomposition, we finally obtain that