ZeRO-Offload: Democratizing Billion-Scale Model Training

by   Jie Ren, et al.

Large-scale model training has been a playing ground for a limited few requiring complex model refactoring and access to prohibitively expensive GPU clusters. ZeRO-Offload changes the large model training landscape by making large model training accessible to nearly everyone. It can train models with over 13 billion parameters on a single GPU, a 10x increase in size compared to popular framework such as PyTorch, and it does so without requiring any model change from the data scientists or sacrificing computational efficiency. ZeRO-Offload enables large model training by offloading data and compute to CPU. To preserve compute efficiency, it is designed to minimize the data movement to/from GPU, and reduce CPU compute time while maximizing memory savings on GPU. As a result, ZeRO-Offload can achieve 40 TFlops/GPU on a single NVIDIA V100 GPU for 10B parameter model compared to 30TF using PyTorch alone for a 1.4B parameter model, the largest that can be trained without running out of memory. ZeRO-Offload is also designed to scale on multiple-GPUs when available, offering near linear speedup on up to 128 GPUs. Additionally, it can work together with model parallelism to train models with over 70 billion parameters on a single DGX-2 box, a 4.5x increase in model size compared to using model parallelism alone. By combining compute and memory efficiency with ease-of-use, ZeRO-Offload democratizes large-scale model training making it accessible to even data scientists with access to just a single GPU.


ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning

In the last three years, the largest dense deep learning models have gro...

Elixir: Train a Large Language Model on a Small GPU Cluster

In recent years, the number of parameters of one deep learning (DL) mode...

MiCS: Near-linear Scaling for Training Gigantic Model on Public Cloud

Existing general purpose frameworks for gigantic model training, i.e., m...

ZeRO: Memory Optimization Towards Training A Trillion Parameter Models

Training large DL models with billions and potentially trillions of para...

ZeRO++: Extremely Efficient Collective Communication for Giant Model Training

Zero Redundancy Optimizer (ZeRO) has been used to train a wide range of ...

M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining

Recent expeditious developments in deep learning algorithms, distributed...

Communication-Efficient TeraByte-Scale Model Training Framework for Online Advertising

Click-Through Rate (CTR) prediction is a crucial component in the online...

Please sign up or login with your details

Forgot password? Click here to reset