Sketchy: Memory-efficient Adaptive Regularization with Frequent Directions

02/07/2023
by   Vladimir Feinberg, et al.
1

Adaptive regularization methods that exploit more than the diagonal entries exhibit state of the art performance for many tasks, but can be prohibitive in terms of memory and running time. We find the spectra of the Kronecker-factored gradient covariance matrix in deep learning (DL) training tasks are concentrated on a small leading eigenspace that changes throughout training, motivating a low-rank sketching approach. We describe a generic method for reducing memory and compute requirements of maintaining a matrix preconditioner using the Frequent Directions (FD) sketch. Our technique allows interpolation between resource requirements and the degradation in regret guarantees with rank k: in the online convex optimization (OCO) setting over dimension d, we match full-matrix d^2 memory regret using only dk memory up to additive error in the bottom d-k eigenvalues of the gradient covariance. Further, we show extensions of our work to Shampoo, placing the method on the memory-quality Pareto frontier of several large scale benchmarks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/08/2018

The Case for Full-Matrix Adaptive Regularization

Adaptive regularization methods come in diagonal and full-matrix variant...
research
05/29/2019

Matrix-Free Preconditioning in Online Learning

We provide an online convex optimization algorithm with regret that inte...
research
03/10/2016

Low-rank passthrough neural networks

Deep learning consists in training neural networks to perform computatio...
research
02/22/2017

Sketchy Decisions: Convex Low-Rank Matrix Optimization with Optimal Storage

This paper concerns a fundamental class of convex matrix optimization pr...
research
11/06/2020

Ridge Regression with Frequent Directions: Statistical and Optimization Perspectives

Despite its impressive theory & practical performance, Frequent Directio...
research
09/28/2018

Efficient Linear Bandits through Matrix Sketching

We prove that two popular linear contextual bandit algorithms, OFUL and ...
research
02/12/2019

Extreme Tensoring for Low-Memory Preconditioning

State-of-the-art models are now trained with billions of parameters, rea...

Please sign up or login with your details

Forgot password? Click here to reset