How Does Adaptive Optimization Impact Local Neural Network Geometry?

11/04/2022
by   Kaiqi Jiang, et al.
0

Adaptive optimization methods are well known to achieve superior convergence relative to vanilla gradient methods. The traditional viewpoint in optimization, particularly in convex optimization, explains this improved performance by arguing that, unlike vanilla gradient schemes, adaptive algorithms mimic the behavior of a second-order method by adapting to the global geometry of the loss function. We argue that in the context of neural network optimization, this traditional viewpoint is insufficient. Instead, we advocate for a local trajectory analysis. For iterate trajectories produced by running a generic optimization algorithm OPT, we introduce R^OPT_med, a statistic that is analogous to the condition number of the loss Hessian evaluated at the iterates. Through extensive experiments, we show that adaptive methods such as Adam bias the trajectories towards regions where R^Adam_med is small, where one might expect faster convergence. By contrast, vanilla gradient methods like SGD bias the trajectories towards regions where R^SGD_med is comparatively large. We complement these empirical observations with a theoretical result that provably demonstrates this phenomenon in the simplified setting of a two-layer linear network. We view our findings as evidence for the need of a new explanation of the success of adaptive methods, one that is different than the conventional wisdom.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/08/2022

Adaptive Self-supervision Algorithms for Physics-informed Neural Networks

Physics-informed neural networks (PINNs) incorporate physical knowledge ...
research
06/04/2021

Local Adaptivity in Federated Learning: Convergence and Consistency

The federated learning (FL) framework trains a machine learning model us...
research
06/12/2020

Adaptive Gradient Methods Can Be Provably Faster than SGD after Finite Epochs

Adaptive gradient methods have attracted much attention of machine learn...
research
10/21/2019

Faster Stochastic Algorithms via History-Gradient Aided Batch Size Adaptation

Various schemes for adapting batch size have been recently proposed to a...
research
01/26/2019

Escaping Saddle Points with Adaptive Gradient Methods

Adaptive methods such as Adam and RMSProp are widely used in deep learni...
research
06/11/2021

LocoProp: Enhancing BackProp via Local Loss Optimization

We study a local loss construction approach for optimizing neural networ...
research
10/07/2021

G̅_mst:An Unbiased Stratified Statistic and a Fast Gradient Optimization Algorithm Based on It

-The fluctuation effect of gradient expectation and variance caused by p...

Please sign up or login with your details

Forgot password? Click here to reset