The Marginal Value of Adaptive Gradient Methods in Machine Learning

05/23/2017
by   Ashia C. Wilson, et al.
0

Adaptive optimization methods, which perform local optimization with a metric constructed from the history of iterates, are becoming increasingly popular for training deep neural networks. Examples include AdaGrad, RMSProp, and Adam. We show that for simple overparameterized problems, adaptive methods often find drastically different solutions than gradient descent (GD) or stochastic gradient descent (SGD). We construct an illustrative binary classification problem where the data is linearly separable, GD and SGD achieve zero test error, and AdaGrad, Adam, and RMSProp attain test errors arbitrarily close to half. We additionally study the empirical generalization capability of adaptive methods on several state-of-the-art deep learning models. We observe that the solutions found by adaptive methods generalize worse (often significantly worse) than SGD, even when these solutions have better training performance. These results suggest that practitioners should reconsider the use of adaptive methods to train neural networks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/18/2018

Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks

Adaptive gradient methods, which adopt historical gradient information t...
research
08/13/2023

Understanding the robustness difference between stochastic gradient descent and adaptive gradient methods

Stochastic gradient descent (SGD) and adaptive gradient methods, such as...
research
11/16/2018

Minimum norm solutions do not always generalize well for over-parameterized problems

Stochastic gradient descent is the de facto algorithm for training deep ...
research
08/08/2020

Why to "grow" and "harvest" deep learning models?

Current expectations from training deep learning models with gradient-ba...
research
12/20/2017

Improving Generalization Performance by Switching from Adam to SGD

Despite superior training outcomes, adaptive optimization methods such a...
research
05/31/2023

Toward Understanding Why Adam Converges Faster Than SGD for Transformers

While stochastic gradient descent (SGD) is still the most popular optimi...
research
05/25/2022

Mirror Descent Maximizes Generalized Margin and Can Be Implemented Efficiently

Driven by the empirical success and wide use of deep neural networks, un...

Please sign up or login with your details

Forgot password? Click here to reset