Compressing Gradient Optimizers via Count-Sketches

02/01/2019
by   Ryan Spring, et al.
4

Many popular first-order optimization methods (e.g., Momentum, AdaGrad, Adam) accelerate the convergence rate of deep learning models. However, these algorithms require auxiliary parameters, which cost additional memory proportional to the number of parameters in the model. The problem is becoming more severe as deep learning models continue to grow larger in order to learn from complex, large-scale datasets. Our proposed solution is to maintain a linear sketch to compress the auxiliary variables. We demonstrate that our technique has the same performance as the full-sized baseline, while using significantly less space for the auxiliary variables. Theoretically, we prove that count-sketch optimization maintains the SGD convergence rate, while gracefully reducing memory usage for large-models. On the large-scale 1-Billion Word dataset, we save 25 11.7 GB) by compressing the Adam optimizer in the Embedding and Softmax layers with negligible accuracy and performance loss.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/25/2019

DEAM: Accumulated Momentum with Discriminative Weight for Stochastic Optimization

Optimization algorithms with momentum, e.g., Nesterov Accelerated Gradie...
research
05/30/2019

Global Momentum Compression for Sparse Communication in Distributed SGD

With the rapid growth of data, distributed stochastic gradient descent (...
research
08/15/2018

Frank-Wolfe Style Algorithms for Large Scale Optimization

We introduce a few variants on Frank-Wolfe style algorithms suitable for...
research
01/30/2019

Memory-Efficient Adaptive Optimization for Large-Scale Learning

Adaptive gradient-based optimizers such as AdaGrad and Adam are among th...
research
04/06/2020

A Count Sketch Kaczmarz Method For Solving Large Overdetermined Linear Systems

In this paper, combining count sketch and maximal weighted residual Kacz...
research
08/20/2022

Sharp Analysis of Sketch-and-Project Methods via a Connection to Randomized Singular Value Decomposition

Sketch-and-project is a framework which unifies many known iterative met...
research
10/03/2016

Adaptive Neuron Apoptosis for Accelerating Deep Learning on Large Scale Systems

We present novel techniques to accelerate the convergence of Deep Learni...

Please sign up or login with your details

Forgot password? Click here to reset