ZeRO: Memory Optimization Towards Training A Trillion Parameter Models

10/04/2019
by   Samyam Rajbhandari, et al.
1

Training large DL models with billions and potentially trillions of parameters is challenging. Existing solutions exhibit fundamental limitations to obtain both memory and scaling (computation/communication) efficiency together. Data parallelism does not help reduce memory footprint per device: a model with 1.5 billion parameters or more runs out of memory. Model parallelism hardly scales efficiently beyond multiple devices of a single node due to fine-grained computation and expensive communication. We develop a novel solution, Zero Redundancy Optimizer (ZeRO), to optimize memory, achieving both memory efficiency and scaling efficiency. Unlike basic data parallelism where memory states are replicated across data-parallel processes, ZeRO partitions model states instead, to scale the model size linearly with the number of devices. Furthermore, it retains scaling efficiency via computation and communication rescheduling and by reducing the model parallelism degree required to run large models. Our analysis on memory requirements and communication volume demonstrates: ZeRO has the potential to scale beyond 1 Trillion parameters using today's hardware (e.g., 1024 GPUs, 64 DGX-2 nodes). To meet near-term scaling goals and serve as a demonstration of ZeRO's capability, we implemented stage-1 optimizations of ZeRO (out of 3 stages in total described in the paper) and tested this ZeRO-OS version. ZeRO-OS reduces memory and boosts model size by 4x compared with the state-of-art, scaling up to 100B parameters. Moving forward, we will work on unlocking stage-2 optimizations, with up to 8x memory savings per device, and ultimately stage-3 optimizations, reducing memory linearly with respect to the number of devices and potentially scaling to models of arbitrary size. We are excited to transform very large models from impossible to train to feasible and efficient to train!

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/16/2020

Memory-Efficient Pipeline-Parallel DNN Training

Many state-of-the-art results in domains such as NLP and computer vision...
research
01/18/2021

ZeRO-Offload: Democratizing Billion-Scale Model Training

Large-scale model training has been a playing ground for a limited few r...
research
11/07/2020

Exploring the limits of Concurrency in ML Training on Google TPUs

Recent results in language understanding using neural networks have requ...
research
06/28/2023

Towards a Better Theoretical Understanding of Independent Subnetwork Training

Modern advancements in large-scale machine learning would be impossible ...
research
08/12/2021

PatrickStar: Parallel Training of Pre-trained Models via a Chunk-based Memory Management

The pre-trained model (PTM) is revolutionizing Artificial intelligence (...
research
06/16/2023

ZeRO++: Extremely Efficient Collective Communication for Giant Model Training

Zero Redundancy Optimizer (ZeRO) has been used to train a wide range of ...
research
02/13/2020

Training Large Neural Networks with Constant Memory using a New Execution Algorithm

Widely popular transformer-based NLP models such as BERT and GPT have en...

Please sign up or login with your details

Forgot password? Click here to reset