Optimizer Fusion: Efficient Training with Better Locality and Parallelism

04/01/2021
by   Zixuan Jiang, et al.
0

Machine learning frameworks adopt iterative optimizers to train neural networks. Conventional eager execution separates the updating of trainable parameters from forward and backward computations. However, this approach introduces nontrivial training time overhead due to the lack of data locality and computation parallelism. In this work, we propose to fuse the optimizer with forward or backward computation to better leverage locality and parallelism during training. By reordering the forward computation, gradient calculation, and parameter updating, our proposed method improves the efficiency of iterative optimizers. Experimental results demonstrate that we can achieve an up to 20 Since our methods do not alter the optimizer algorithm, they can be used as a general "plug-in" technique to the training process.

READ FULL TEXT
research
03/03/2023

Ada-Grouper: Accelerating Pipeline Parallelism in Preempted Network by Adaptive Group-Scheduling for Micro-Batches

Pipeline parallelism has been demonstrated to be a remarkable approach t...
research
03/16/2021

An Efficient Vectorization Scheme for Stencil Computation

Stencil computation is one of the most important kernels in various scie...
research
11/27/2020

Progressively Stacking 2.0: A Multi-stage Layerwise Training Method for BERT Training Speedup

Pre-trained language models, such as BERT, have achieved significant acc...
research
03/25/2020

Pipelined Backpropagation at Scale: Training Large Models without Batches

Parallelism is crucial for accelerating the training of deep neural netw...
research
12/18/2017

Parallel Complexity of Forward and Backward Propagation

We show that the forward and backward propagation can be formulated as a...
research
03/16/2021

Reducing Redundancy in Data Organization and Arithmetic Calculation for Stencil Computations

Stencil computation is one of the most important kernels in various scie...
research
12/10/2021

Layer-Parallel Training of Residual Networks with Auxiliary-Variable Networks

Gradient-based methods for the distributed training of residual networks...

Please sign up or login with your details

Forgot password? Click here to reset