MiCS: Near-linear Scaling for Training Gigantic Model on Public Cloud

04/30/2022
by   Zhen Zhang, et al.
0

Existing general purpose frameworks for gigantic model training, i.e., models with billions to trillions of parameters, cannot scale efficiently on public cloud environments due to large communication overheads. In this paper, we propose MiCS, which Minimizes the Communication Scale to bring down communication overhead. Specifically, by decreasing the number of participants in a communication collective, MiCS can utilize existing heterogeneous network bandwidth on the cloud, reduce network traffic over slower links, and amortize expensive global gradient synchronization overheads. Our evaluation on AWS shows that the system throughput of MiCS is up to 2.89× that of the state-of-the-art large model training systems. MiCS achieves near-linear scaling efficiency, which is up to 1.27× that of DeepSpeed. MiCS allows us to train a proprietary model with 100 billion parameters on 512 GPUs with 99.4 theoretical computation power of each GPU on a public cloud with less GPU memory and more restricted networks than DGX-A100 clusters.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/18/2021

ZeRO-Offload: Democratizing Billion-Scale Model Training

Large-scale model training has been a playing ground for a limited few r...
research
06/16/2023

ZeRO++: Extremely Efficient Collective Communication for Giant Model Training

Zero Redundancy Optimizer (ZeRO) has been used to train a wide range of ...
research
07/22/2023

Optimized Network Architectures for Large Language Model Training with Billions of Parameters

This paper challenges the well-established paradigm for building any-to-...
research
08/30/2019

Pacer: Network Side-Channel Mitigation in the Cloud

An important concern for many Cloud customers is data confidentiality. O...
research
03/11/2018

Salable Breadth-First Search on a GPU Cluster

On a GPU cluster, the ratio of high computing power to communication ban...
research
03/11/2018

Scalable Breadth-First Search on a GPU Cluster

On a GPU cluster, the ratio of high computing power to communication ban...
research
08/30/2022

Analysis of Distributed Deep Learning in the Cloud

We aim to resolve this problem by introducing a comprehensive distribute...

Please sign up or login with your details

Forgot password? Click here to reset