Dynamically Adjusting Transformer Batch Size by Monitoring Gradient Direction Change

05/05/2020
by   Hongfei Xu, et al.
0

The choice of hyper-parameters affects the performance of neural models. While much previous research (Sutskever et al., 2013; Duchi et al., 2011; Kingma and Ba, 2015) focuses on accelerating convergence and reducing the effects of the learning rate, comparatively few papers concentrate on the effect of batch size. In this paper, we analyze how increasing batch size affects gradient direction, and propose to evaluate the stability of gradients with their angle change. Based on our observations, the angle change of gradient direction first tends to stabilize (i.e. gradually decrease) while accumulating mini-batches, and then starts to fluctuate. We propose to automatically and dynamically determine batch sizes by accumulating gradients of mini-batches and performing an optimization step at just the time when the direction of gradients starts to fluctuate. To improve the efficiency of our approach for large models, we propose a sampling approach to select gradients of parameters sensitive to the batch size. Our approach dynamically determines proper and efficient batch sizes during training. In our experiments on the WMT 14 English to German and English to French tasks, our approach improves the Transformer with a fixed 25k batch size by +0.73 and +0.82 BLEU respectively.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/06/2017

AdaBatch: Efficient Gradient Aggregation Rules for Sequential and Parallel Stochastic Gradient Methods

We study a new aggregation operator for gradients coming from a mini-bat...
research
02/08/2018

Mini-Batch Stochastic ADMMs for Nonconvex Nonsmooth Optimization

In the paper, we study the mini-batch stochastic ADMMs (alternating dire...
research
05/03/2020

Adaptive Learning of the Optimal Mini-Batch Size of SGD

Recent advances in the theoretical understandingof SGD (Qian et al., 201...
research
05/19/2017

EE-Grad: Exploration and Exploitation for Cost-Efficient Mini-Batch SGD

We present a generic framework for trading off fidelity and cost in comp...
research
04/13/2023

Statistical Analysis of Fixed Mini-Batch Gradient Descent Estimator

We study here a fixed mini-batch gradient decent (FMGD) algorithm to sol...
research
09/15/2019

Empirical study towards understanding line search approximations for training neural networks

Choosing appropriate step sizes is critical for reducing the computation...
research
06/15/2020

The Limit of the Batch Size

Large-batch training is an efficient approach for current distributed de...

Please sign up or login with your details

Forgot password? Click here to reset