ZeRO++: Extremely Efficient Collective Communication for Giant Model Training

06/16/2023
by   Guanhua Wang, et al.
0

Zero Redundancy Optimizer (ZeRO) has been used to train a wide range of large language models on massive GPUs clusters due to its ease of use, efficiency, and good scalability. However, when training on low-bandwidth clusters, or at scale which forces batch size per GPU to be small, ZeRO's effective throughput is limited because of high communication volume from gathering weights in forward pass, backward pass, and averaging gradients. This paper introduces three communication volume reduction techniques, which we collectively refer to as ZeRO++, targeting each of the communication collectives in ZeRO. First is block-quantization based all-gather. Second is data remapping that trades-off communication for more memory. Third is a novel all-to-all based quantized gradient averaging paradigm as replacement of reduce-scatter collective, which preserves accuracy despite communicating low precision data. Collectively, ZeRO++ reduces communication volume of ZeRO by 4x, enabling up to 2.16x better throughput at 384 GPU scale.

READ FULL TEXT
research
04/30/2022

MiCS: Near-linear Scaling for Training Gigantic Model on Public Cloud

Existing general purpose frameworks for gigantic model training, i.e., m...
research
01/18/2021

ZeRO-Offload: Democratizing Billion-Scale Model Training

Large-scale model training has been a playing ground for a limited few r...
research
10/04/2019

ZeRO: Memory Optimization Towards Training A Trillion Parameter Models

Training large DL models with billions and potentially trillions of para...
research
11/08/2021

Synthesizing Collective Communication Algorithms for Heterogeneous Networks with TACCL

Large ML models and datasets have necessitated the use of multi-GPU syst...
research
08/12/2021

PatrickStar: Parallel Training of Pre-trained Models via a Chunk-based Memory Management

The pre-trained model (PTM) is revolutionizing Artificial intelligence (...
research
04/25/2023

Stable and low-precision training for large-scale vision-language models

We introduce new methods for 1) accelerating and 2) stabilizing training...
research
11/12/2022

Exploiting the Partly Scratch-off Lottery Ticket for Quantization-Aware Training

Quantization-aware training (QAT) receives extensive popularity as it we...

Please sign up or login with your details

Forgot password? Click here to reset