MKOR: Momentum-Enabled Kronecker-Factor-Based Optimizer Using Rank-1 Updates

06/02/2023
by   Mohammad Mozaffari, et al.
0

This work proposes a Momentum-Enabled Kronecker-Factor-Based Optimizer Using Rank-1 updates, called MKOR, that improves the training time and convergence properties of deep neural networks (DNNs). Second-order techniques, while enjoying higher convergence rates vs first-order counterparts, have cubic complexity with respect to either the model size and/or the training batch size. Hence they exhibit poor scalability and performance in transformer models, e.g. large language models (LLMs), because the batch sizes in these models scale by the attention mechanism sequence length, leading to large model size and batch sizes. MKOR's complexity is quadratic with respect to the model size, alleviating the computation bottlenecks in second-order methods. Because of their high computation complexity, state-of-the-art implementations of second-order methods can only afford to update the second order information infrequently, and thus do not fully exploit the promise of better convergence from these updates. By reducing the communication complexity of the second-order updates as well as achieving a linear communication complexity, MKOR increases the frequency of second order updates. We also propose a hybrid version of MKOR (called MKOR-H) that mid-training falls backs to a first order optimizer if the second order updates no longer accelerate convergence. Our experiments show that MKOR outperforms state -of-the-art first order methods, e.g. the LAMB optimizer, and best implementations of second-order methods, i.e. KAISA/KFAC, up to 2.57x and 1.85x respectively on BERT-Large-Uncased on 64 GPUs.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/04/2021

KAISA: An Adaptive Second-order Optimizer Framework for Deep Neural Networks

Kronecker-factored Approximate Curvature (K-FAC) has recently been shown...
research
05/16/2022

Optimizing the optimizer for data driven deep neural networks and physics informed neural networks

We investigate the role of the optimizer in determining the quality of t...
research
12/14/2020

An Adaptive Memory Multi-Batch L-BFGS Algorithm for Neural Network Training

Motivated by the potential for parallel implementation of batch-based al...
research
02/16/2023

FOSI: Hybrid First and Second Order Optimization

Though second-order optimization methods are highly effective, popular a...
research
11/25/2022

PipeFisher: Efficient Training of Large Language Models Using Pipelining and Fisher Information Matrices

Pipeline parallelism enables efficient training of Large Language Models...
research
01/28/2022

Gradient Descent on Neurons and its Link to Approximate Second-Order Optimization

Second-order optimizers are thought to hold the potential to speed up ne...
research
08/12/2022

A Practical Second-order Latent Factor Model via Distributed Particle Swarm Optimization

Latent Factor (LF) models are effective in representing high-dimension a...

Please sign up or login with your details

Forgot password? Click here to reset