CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters

08/01/2023
by   Sudarsanan Rajasekaran, et al.
0

We present CASSINI, a network-aware job scheduler for machine learning (ML) clusters. CASSINI introduces a novel geometric abstraction to consider the communication pattern of different jobs while placing them on network links. To do so, CASSINI uses an affinity graph that finds a series of time-shift values to adjust the communication phases of a subset of jobs, such that the communication patterns of jobs sharing the same network link are interleaved with each other. Experiments with 13 common ML models on a 24-server testbed demonstrate that compared to the state-of-the-art ML schedulers, CASSINI improves the average and tail completion time of jobs by up to 1.6x and 2.5x, respectively. Moreover, we show that CASSINI reduces the number of ECN marked packets in the cluster by up to 33x.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/24/2020

Communication Contention Aware Scheduling of Multiple Deep Learning Training Jobs

Distributed Deep Learning (DDL) has rapidly grown its popularity since i...
research
02/13/2018

SLAQ: Quality-Driven Scheduling for Distributed Machine Learning

Training machine learning (ML) models with large datasets can incur sign...
research
10/06/2018

Towards Self-Tuning Parameter Servers

Recent years, many applications have been driven advances by the use of ...
research
02/05/2019

Gradient Boosting to Boost the Efficiency of Hydraulic Fracturing

In this paper we present a data-driven model for forecasting the product...
research
01/31/2023

Partitioning Distributed Compute Jobs with Reinforcement Learning and Graph Neural Networks

From natural language processing to genome sequencing, large-scale machi...
research
08/20/2023

I/O Burst Prediction for HPC Clusters using Darshan Logs

Understanding cluster-wide I/O patterns of large-scale HPC clusters is e...
research
05/11/2023

Scheduling Multi-Server Jobs with Sublinear Regrets via Online Learning

Nowadays, multi-server jobs, which request multiple computing devices an...

Please sign up or login with your details

Forgot password? Click here to reset