Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training

05/23/2023
by   Hong Liu, et al.
0

Given the massive cost of language model pre-training, a non-trivial improvement of the optimization algorithm would lead to a material reduction on the time and cost of training. Adam and its variants have been state-of-the-art for years, and more sophisticated second-order (Hessian-based) optimizers often incur too much per-step overhead. In this paper, we propose Sophia, Second-order Clipped Stochastic Optimization, a simple scalable second-order optimizer that uses a light-weight estimate of the diagonal Hessian as the pre-conditioner. The update is the moving average of the gradients divided by the moving average of the estimated Hessian, followed by element-wise clipping. The clipping controls the worst-case update size and tames the negative impact of non-convexity and rapid change of Hessian along the trajectory. Sophia only estimates the diagonal Hessian every handful of iterations, which has negligible average per-step time and memory overhead. On language modeling with GPT-2 models of sizes ranging from 125M to 770M, Sophia achieves a 2x speed-up compared with Adam in the number of steps, total compute, and wall-clock time. Theoretically, we show that Sophia adapts to the curvature in different components of the parameters, which can be highly heterogeneous for language modeling tasks. Our run-time bound does not depend on the condition number of the loss.

READ FULL TEXT
research
06/01/2020

ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning

We introduce AdaHessian, a second order stochastic optimization algorith...
research
10/20/2022

HesScale: Scalable Computation of Hessian Diagonals

Second-order optimization uses curvature information about the objective...
research
02/12/2021

Kronecker-factored Quasi-Newton Methods for Convolutional Neural Networks

Second-order methods have the capability of accelerating optimization by...
research
09/28/2020

Apollo: An Adaptive Parameter-wise Diagonal Quasi-Newton Method for Nonconvex Stochastic Optimization

In this paper, we introduce Apollo, a quasi-Newton method for nonconvex ...
research
02/05/2019

A Modular Approach to Block-diagonal Hessian Approximations for Second-order Optimization Methods

We propose a modular extension of the backpropagation algorithm for comp...
research
06/15/2021

Scalable Second Order Optimization for Deep Learning

Optimization in machine learning, both theoretical and applied, is prese...
research
11/05/2016

Loss-aware Binarization of Deep Networks

Deep neural network models, though very powerful and highly successful, ...

Please sign up or login with your details

Forgot password? Click here to reset