GADGET: Online Resource Optimization for Scheduling Ring-All-Reduce Learning Jobs

02/02/2022
by   Menglu Yu, et al.
0

Fueled by advances in distributed deep learning (DDL), recent years have witnessed a rapidly growing demand for resource-intensive distributed/parallel computing to process DDL computing jobs. To resolve network communication bottleneck and load balancing issues in distributed computing, the so-called “ring-all-reduce” decentralized architecture has been increasingly adopted to remove the need for dedicated parameter servers. To date, however, there remains a lack of theoretical understanding on how to design resource optimization algorithms for efficiently scheduling ring-all-reduce DDL jobs in computing clusters. This motivates us to fill this gap by proposing a series of new resource scheduling designs for ring-all-reduce DDL jobs. Our contributions in this paper are three-fold: i) We propose a new resource scheduling analytical model for ring-all-reduce deep learning, which covers a wide range of objectives in DDL performance optimization (e.g., excessive training avoidance, energy efficiency, fairness); ii) Based on the proposed performance analytical model, we develop an efficient resource scheduling algorithm called GADGET (greedy ring-all-reduce distributed graph embedding technique), which enjoys a provable strong performance guarantee; iii) We conduct extensive trace-driven experiments to demonstrate the effectiveness of the GADGET approach and its superiority over the state of the art.

READ FULL TEXT

page 1

page 9

research
07/16/2022

On Scheduling Ring-All-Reduce Learning Jobs in Multi-Tenant GPU Clusters with Communication Contention

Powered by advances in deep learning (DL) techniques, machine learning a...
research
05/28/2021

A Sum-of-Ratios Multi-Dimensional-Knapsack Decomposition for DNN Resource Scheduling

In recent years, to sustain the resource-intensive computational needs f...
research
08/21/2019

Dynamic Scheduling of MPI-based Distributed Deep Learning Training Jobs

There is a general trend towards solving problems suited to deep learnin...
research
08/06/2021

Toward Efficient Online Scheduling for Distributed Machine Learning Systems

Recent years have witnessed a rapid growth of distributed machine learni...
research
01/17/2019

Scheduling Jobs with Random Resource Requirements in Computing Clusters

We consider a natural scheduling problem which arises in many distribute...
research
05/11/2018

Peacock: Probe-Based Scheduling of Jobs by Rotating Between Elastic Queues

In this paper, we propose Peacock, a new distributed probe-based schedul...
research
04/15/2018

Improving all-reduce collective operations for imbalanced process arrival patterns

Two new algorithms for the all-reduce operation, optimized for imbalance...

Please sign up or login with your details

Forgot password? Click here to reset