Minimum norm solutions do not always generalize well for over-parameterized problems

11/16/2018
by   Vatsal Shah, et al.
0

Stochastic gradient descent is the de facto algorithm for training deep neural networks (DNNs). Despite its popularity, it still requires fine hyper-parameter tuning in order to achieve its best performance. This has led to the development of adaptive methods, that claim automatic hyper-parameter tuning. Recently, researchers have studied both algorithmic classes via thoughtful toy problems: e.g., for over-parameterized linear regression, [1] shows that, while SGD always converges to the minimum-norm solution (similar to the case of the maximum margin solution in SVMs that guarantees good prediction error), adaptive methods show no such inclination, leading to worse generalization capabilities. Our aim is to study this conjecture further. We empirically show that the minimum norm solution is not necessarily the proper gauge of good generalization in simplified scenaria, and different models found by adaptive methods could outperform plain gradient methods. In practical DNN settings, we observe that adaptive methods often perform at least as well as SGD, without necessarily reducing the amount of tuning required.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/28/2020

On Generalization of Adaptive Methods for Over-parameterized Linear Regression

Over-parameterization and adaptive methods have played a crucial role in...
research
05/23/2017

The Marginal Value of Adaptive Gradient Methods in Machine Learning

Adaptive optimization methods, which perform local optimization with a m...
research
09/13/2017

Normalized Direction-preserving Adam

Optimization algorithms for training deep models not only affects the co...
research
08/13/2023

Understanding the robustness difference between stochastic gradient descent and adaptive gradient methods

Stochastic gradient descent (SGD) and adaptive gradient methods, such as...
research
06/03/2018

Minnorm training: an algorithm for training over-parameterized deep neural networks

In this work, we propose a new training method for finding minimum weigh...
research
06/04/2019

Embedded hyper-parameter tuning by Simulated Annealing

We propose a new metaheuristic training scheme that combines Stochastic ...
research
06/26/2020

Is SGD a Bayesian sampler? Well, almost

Overparameterised deep neural networks (DNNs) are highly expressive and ...

Please sign up or login with your details

Forgot password? Click here to reset