Minimum norm solutions do not always generalize well for over-parameterized problems

by   Vatsal Shah, et al.
The University of Texas at Austin
Rice University

Stochastic gradient descent is the de facto algorithm for training deep neural networks (DNNs). Despite its popularity, it still requires fine hyper-parameter tuning in order to achieve its best performance. This has led to the development of adaptive methods, that claim automatic hyper-parameter tuning. Recently, researchers have studied both algorithmic classes via thoughtful toy problems: e.g., for over-parameterized linear regression, [1] shows that, while SGD always converges to the minimum-norm solution (similar to the case of the maximum margin solution in SVMs that guarantees good prediction error), adaptive methods show no such inclination, leading to worse generalization capabilities. Our aim is to study this conjecture further. We empirically show that the minimum norm solution is not necessarily the proper gauge of good generalization in simplified scenaria, and different models found by adaptive methods could outperform plain gradient methods. In practical DNN settings, we observe that adaptive methods often perform at least as well as SGD, without necessarily reducing the amount of tuning required.


page 1

page 2

page 3

page 4


On Generalization of Adaptive Methods for Over-parameterized Linear Regression

Over-parameterization and adaptive methods have played a crucial role in...

The Marginal Value of Adaptive Gradient Methods in Machine Learning

Adaptive optimization methods, which perform local optimization with a m...

Normalized Direction-preserving Adam

Optimization algorithms for training deep models not only affects the co...

Understanding the robustness difference between stochastic gradient descent and adaptive gradient methods

Stochastic gradient descent (SGD) and adaptive gradient methods, such as...

Minnorm training: an algorithm for training over-parameterized deep neural networks

In this work, we propose a new training method for finding minimum weigh...

Embedded hyper-parameter tuning by Simulated Annealing

We propose a new metaheuristic training scheme that combines Stochastic ...

Is SGD a Bayesian sampler? Well, almost

Overparameterised deep neural networks (DNNs) are highly expressive and ...

Please sign up or login with your details

Forgot password? Click here to reset