2.5-dimensional distributed model training

05/30/2021
by   Boxiang Wang, et al.
0

Data parallelism does a good job in speeding up the training. However, when it comes to the case when the memory of a single device can not host a whole model, data parallelism would not have the chance to do anything. Another option is to split the model by operator, or horizontally. Megatron-LM introduced a 1-Dimensional distributed method to use GPUs to speed up the training process. Optimus is a 2D solution for distributed tensor parallelism. However, these methods have a high communication overhead and a low scaling efficiency on large-scale computing clusters. To solve this problem, we investigate the 2.5-Dimensional distributed tensor parallelism.Introduced by Solomonik et al., 2.5-Dimensional Matrix Multiplication developed an effective method to perform multiple Cannon's algorithm at the same time to increase the efficiency. With many restrictions of Cannon's Algorithm and a huge amount of shift operation, we need to invent a new method of 2.5-dimensional matrix multiplication to enhance the performance. Absorbing the essence from both SUMMA and 2.5-Dimensional Matrix Multiplication, we introduced SUMMA2.5-LM for language models to overcome the abundance of unnecessary transmission loss result from the increasing size of language model parallelism. Compared to previous 1D and 2D model parallelization of language models, our SUMMA2.5-LM managed to reduce the transmission cost on each layer, which could get a 1.45X efficiency according to our weak scaling result between 2.5-D [4,4,4] arrangement and 2-D [8,8,1] arrangement.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/16/2021

Speeding Up Private Distributed Matrix Multiplication via Bivariate Polynomial Codes

We consider the problem of private distributed matrix multiplication und...
research
05/30/2021

Maximizing Parallelism in Distributed Training for Huge Neural Networks

The recent Natural Language Processing techniques have been refreshing t...
research
09/05/2022

Orthogonal layers of parallelism in large-scale eigenvalue computations

We address the communication overhead of distributed sparse matrix-(mult...
research
03/23/2023

Scalability of 3D-DFT by block tensor-matrix multiplication on the JUWELS Cluster

The 3D Discrete Fourier Transform (DFT) is a technique used to solve pro...
research
07/25/2022

Dive into Big Model Training

The increasing scale of model size and continuous improvement of perform...
research
01/20/2023

ATP: Adaptive Tensor Parallelism for Foundation Models

Foundation models have impressive performance and generalization capabil...
research
06/04/2021

Layered gradient accumulation and modular pipeline parallelism: fast and efficient training of large language models

The advent of the transformer has sparked a quick growth in the size of ...

Please sign up or login with your details

Forgot password? Click here to reset