Varuna: Scalable, Low-cost Training of Massive Deep Learning Models

11/07/2021
by   Sanjith Athlur, et al.
0

Systems for training massive deep learning models (billions of parameters) today assume and require specialized "hyper-clusters": hundreds or thousands of GPUs wired with specialized high-bandwidth interconnects such as NV-Link and Infiniband. Besides being expensive, such dependence on hyper-clusters and custom high-speed inter-connects limits the size of such clusters, creating (a) scalability limits on job parallelism; (b) resource fragmentation across hyper-clusters. In this paper, we present Varuna, a new system that enables training massive deep learning models on commodity networking. Varuna makes thrifty use of networking resources and automatically configures the user's training job to efficiently use any given set of resources. Therefore, Varuna is able to leverage "low-priority" VMs that cost about 5x cheaper than dedicated GPUs, thus significantly reducing the cost of training massive models. We demonstrate the efficacy of Varuna by training massive models, including a 200 billion parameter model, on 5x cheaper "spot VMs", while maintaining high training throughput. Varuna improves end-to-end training time by up to 18x compared to other model-parallel approaches and up to 26 parallel approaches. The code for Varuna is available at https://github.com/microsoft/varuna.

READ FULL TEXT
research
04/16/2021

ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning

In the last three years, the largest dense deep learning models have gro...
research
10/01/2020

PipeTune: Pipeline Parallelism of Hyper and System Parameters Tuning for Deep Learning Clusters

DNN learning jobs are common in today's clusters due to the advances in ...
research
08/30/2022

EasyScale: Accuracy-consistent Elastic Training for Deep Learning

Distributed synchronized GPU training is commonly used for deep learning...
research
09/26/2019

Elastic deep learning in multi-tenant GPU cluster

Multi-tenant GPU clusters are common nowadays due to the huge success of...
research
06/05/2023

How Can We Train Deep Learning Models Across Clouds and Continents? An Experimental Study

Training deep learning models in the cloud or on dedicated hardware is e...
research
11/24/2019

Stage-based Hyper-parameter Optimization for Deep Learning

As deep learning techniques advance more than ever, hyper-parameter opti...
research
11/30/2022

COMET: A Comprehensive Cluster Design Methodology for Distributed Deep Learning Training

Modern Deep Learning (DL) models have grown to sizes requiring massive c...

Please sign up or login with your details

Forgot password? Click here to reset