COMET: A Comprehensive Cluster Design Methodology for Distributed Deep Learning Training

11/30/2022
by   Divya Kiran Kadiyala, et al.
0

Modern Deep Learning (DL) models have grown to sizes requiring massive clusters of specialized, high-end nodes to train. Designing such clusters to maximize both performance and utilization to amortize their steep cost is a challenging task requiring careful balance of compute, memory, and network resources. Moreover, a plethora of each model's tuning knobs drastically affect the performance, with optimal values often depending on the underlying cluster's characteristics, which necessitates a complex cluster-workload co-design process. To facilitate the design space exploration of such massive DL training clusters, we introduce COMET a holistic cluster design methodology and workflow to jointly study the impact of parallelization strategies and key cluster resource provisioning on the performance of distributed DL training. We develop a step-by-step process to establish a reusable and flexible methodology, and demonstrate its application with a case study of training a Transformer-1T model on a cluster of variable compute, memory, and network resources. Our case study demonstrates COMET's utility in identifying promising architectural optimization directions and guiding system designers in configuring key model and cluster parameters.

READ FULL TEXT

page 8

page 9

research
09/13/2019

DL2: A Deep Learning-driven Scheduler for Deep Learning Clusters

More and more companies have deployed machine learning (ML) clusters, wh...
research
01/03/2014

A Framework for Creating a Distributed Rendering Environment on the Compute Clusters

This paper discusses the deployment of existing render farm manager in a...
research
11/20/2021

Doing More by Doing Less: How Structured Partial Backpropagation Improves Deep Learning Clusters

Many organizations employ compute clusters equipped with accelerators su...
research
03/16/2021

Distributed Deep Learning Using Volunteer Computing-Like Paradigm

Use of Deep Learning (DL) in commercial applications such as image class...
research
08/04/2022

A Container-Based Workflow for Distributed Training of Deep Learning Algorithms in HPC Clusters

Deep learning has been postulated as a solution for numerous problems in...
research
11/07/2021

Varuna: Scalable, Low-cost Training of Massive Deep Learning Models

Systems for training massive deep learning models (billions of parameter...
research
07/05/2022

Blink: Lightweight Sample Runs for Cost Optimization of Big Data Applications

Distributed in-memory data processing engines accelerate iterative appli...

Please sign up or login with your details

Forgot password? Click here to reset