adaQN: An Adaptive Quasi-Newton Algorithm for Training RNNs

11/04/2015
by   Nitish Shirish Keskar, et al.
0

Recurrent Neural Networks (RNNs) are powerful models that achieve exceptional performance on several pattern recognition problems. However, the training of RNNs is a computationally difficult task owing to the well-known "vanishing/exploding" gradient problem. Algorithms proposed for training RNNs either exploit no (or limited) curvature information and have cheap per-iteration complexity, or attempt to gain significant curvature information at the cost of increased per-iteration cost. The former set includes diagonally-scaled first-order methods such as ADAGRAD and ADAM, while the latter consists of second-order algorithms like Hessian-Free Newton and K-FAC. In this paper, we present adaQN, a stochastic quasi-Newton algorithm for training RNNs. Our approach retains a low per-iteration cost while allowing for non-diagonal scaling through a stochastic L-BFGS updating scheme. The method uses a novel L-BFGS scaling initialization scheme and is judicious in storing and retaining L-BFGS curvature pairs. We present numerical experiments on two language modeling tasks and show that adaQN is competitive with popular RNN training algorithms.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/09/2019

An Adaptive Stochastic Nesterov Accelerated Quasi Newton Method for Training RNNs

A common problem in training neural networks is the vanishing and/or exp...
research
01/27/2014

A Stochastic Quasi-Newton Method for Large-Scale Optimization

The question of how to incorporate curvature information in stochastic a...
research
11/16/2022

SketchySGD: Reliable Stochastic Optimization via Robust Curvature Estimates

We introduce SketchySGD, a stochastic quasi-Newton method that uses sket...
research
05/01/2023

ISAAC Newton: Input-based Approximate Curvature for Newton's Method

We present ISAAC (Input-baSed ApproximAte Curvature), a novel method tha...
research
11/09/2018

Complex Unitary Recurrent Neural Networks using Scaled Cayley Transform

Recurrent neural networks (RNNs) have been successfully used on a wide r...
research
06/01/2022

Stochastic Gradient Methods with Preconditioned Updates

This work considers non-convex finite sum minimization. There are a numb...
research
10/26/2020

An Efficient Newton Method for Extreme Similarity Learning with Nonlinear Embeddings

We study the problem of learning similarity by using nonlinear embedding...

Please sign up or login with your details

Forgot password? Click here to reset