A Scalable Deep Reinforcement Learning Model for Online Scheduling Coflows of Multi-Stage Jobs for High Performance Computing

12/21/2021
by   Xin Wang, et al.
13

Coflow is a recently proposed networking abstraction to help improve the communication performance of data-parallel computing jobs. In multi-stage jobs, each job consists of multiple coflows and is represented by a Directed Acyclic Graph (DAG). Efficiently scheduling coflows is critical to improve the data-parallel computing performance in data centers. Compared with hand-tuned scheduling heuristics, existing work DeepWeave [1] utilizes Reinforcement Learning (RL) framework to generate highly-efficient coflow scheduling policies automatically. It employs a graph neural network (GNN) to encode the job information in a set of embedding vectors, and feeds a flat embedding vector containing the whole job information to the policy network. However, this method has poor scalability as it is unable to cope with jobs represented by DAGs of arbitrary sizes and shapes, which requires a large policy network for processing a high-dimensional embedding vector that is difficult to train. In this paper, we first utilize a directed acyclic graph neural network (DAGNN) to process the input and propose a novel Pipelined-DAGNN, which can effectively speed up the feature extraction process of the DAGNN. Next, we feed the embedding sequence composed of schedulable coflows instead of a flat embedding of all coflows to the policy network, and output a priority sequence, which makes the size of the policy network depend on only the dimension of features instead of the product of dimension and number of nodes in the job's DAG.Furthermore, to improve the accuracy of the priority scheduling policy, we incorporate the Self-Attention Mechanism into a deep RL model to capture the interaction between different parts of the embedding sequence to make the output priority scores relevant. Based on this model, we then develop a coflow scheduling algorithm for online multi-stage jobs.

READ FULL TEXT

page 1

page 9

page 10

page 11

research
06/02/2021

Learning to schedule job-shop problems: Representation and policy learning using graph neural network and reinforcement learning

We propose a framework to learn to schedule a job-shop problem (JSSP) us...
research
06/03/2018

Efficient Two-Level Scheduling for Concurrent Graph Processing

With the rapidly growing demand of graph processing in the real scene, t...
research
12/21/2020

Scheduling Coflows with Dependency Graph

Applications in data-parallel computing typically consist of multiple st...
research
04/30/2020

Learning to Ask Screening Questions for Job Postings

At LinkedIn, we want to create economic opportunity for everyone in the ...
research
10/03/2018

Learning Scheduling Algorithms for Data Processing Clusters

Efficiently scheduling data processing jobs on distributed compute clust...
research
08/10/2020

Bilevel Learning Model Towards Industrial Scheduling

Automatic industrial scheduling, aiming at optimizing the sequence of jo...
research
05/15/2020

DeepSoCS: A Neural Scheduler for Heterogeneous System-on-Chip Resource Scheduling

In this paper, we present a novel scheduling solution for a class of Sys...

Please sign up or login with your details

Forgot password? Click here to reset