Stochastic Gradient Methods with Block Diagonal Matrix Adaptation

05/26/2019
by   Jihun Yun, et al.
0

Adaptive gradient approaches that automatically adjust the learning rate on a per-feature basis have been very popular for training deep networks. This rich class of algorithms includes Adagrad, RMSprop, Adam, and recent extensions. All these algorithms have adopted diagonal matrix adaptation, due to the prohibitive computational burden of manipulating full matrices in high-dimensions. In this paper, we show that block-diagonal matrix adaptation can be a practical and powerful solution that can effectively utilize structural characteristics of deep learning architectures, and significantly improve convergence and out-of-sample generalization. We present a general framework with block-diagonal matrix updates via coordinate grouping, which includes counterparts of the aforementioned algorithms, prove their convergence in non-convex optimization, highlighting benefits compared to diagonal versions. In addition, we propose an efficient spectrum-clipping scheme that benefits from superior generalization performance of Sgd. Extensive experiments reveal that block-diagonal approaches achieve state-of-the-art results on several deep learning tasks, and can outperform adaptive diagonal methods, vanilla Sgd, as well as a modified version of full-matrix adaptation proposed very recently.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/08/2022

A Mini-Block Natural Gradient Method for Deep Neural Networks

The training of deep neural networks (DNNs) is currently predominantly d...
research
07/15/2020

Analysis of Q-learning with Adaptation and Momentum Restart for Gradient Descent

Existing convergence analyses of Q-learning mostly focus on the vanilla ...
research
06/24/2020

Randomized Block-Diagonal Preconditioning for Parallel Learning

We study preconditioned gradient-based optimization methods where the pr...
research
01/03/2020

A note on 2× 2 block-diagonal preconditioning

For 2x2 block matrices, it is well-known that block-triangular or block-...
research
09/12/2016

CompAdaGrad: A Compressed, Complementary, Computationally-Efficient Adaptive Gradient Method

The adaptive gradient online learning method known as AdaGrad has seen w...
research
01/01/2021

An iterative K-FAC algorithm for Deep Learning

Kronecker-factored Approximate Curvature (K-FAC) method is a high effici...
research
06/17/2021

An efficient parallel block coordinate descent algorithm for large-scale precision matrix estimation using graphics processing units

Large-scale sparse precision matrix estimation has attracted wide intere...

Please sign up or login with your details

Forgot password? Click here to reset