Task Placement and Resource Allocation for Edge Machine Learning: A GNN-based Multi-Agent Reinforcement Learning Paradigm

02/01/2023
by   Yihong Li, et al.
0

Machine learning (ML) tasks are one of the major workloads in today's edge computing networks. Existing edge-cloud schedulers allocate the requested amounts of resources to each task, falling short of best utilizing the limited edge resources for ML tasks. This paper proposes TapFinger, a distributed scheduler for edge clusters that minimizes the total completion time of ML tasks through co-optimizing task placement and fine-grained multi-resource allocation. To learn the tasks' uncertain resource sensitivity and enable distributed scheduling, we adopt multi-agent reinforcement learning (MARL) and propose several techniques to make it efficient, including a heterogeneous graph attention network as the MARL backbone, a tailored task selection phase in the actor network, and the integration of Bayes' theorem and masking schemes. We first implement a single-task scheduling version, which schedules at most one task each time. Then we generalize to the multi-task scheduling case, in which a sequence of tasks is scheduled simultaneously. Our design can mitigate the expanded decision space and yield fast convergence to optimal scheduling solutions. Extensive experiments using synthetic and test-bed ML task traces show that TapFinger can achieve up to 54.9 average task completion time and improve resource efficiency as compared to state-of-the-art schedulers.

READ FULL TEXT

page 1

page 2

page 6

page 12

page 13

page 16

research
09/21/2015

A Multi-Agent System Approach to Load-Balancing and Resource Allocation for Distributed Computing

In this research we use a decentralized computing approach to allocate a...
research
07/02/2019

Themis: Fair and Efficient GPU Cluster Scheduling for Machine Learning Workloads

Modern distributed machine learning (ML) training workloads benefit sign...
research
01/08/2018

Towards General Distributed Resource Selection

The advantages of distributing workloads and utilizing multiple distribu...
research
05/11/2023

Optimizing Memory Mapping Using Deep Reinforcement Learning

Resource scheduling and allocation is a critical component of many high ...
research
05/10/2023

Collaborative Learning-Based Scheduling for Kubernetes-Oriented Edge-Cloud Network

Kubernetes (k8s) has the potential to coordinate distributed edge resour...
research
04/22/2020

OL4EL: Online Learning for Edge-cloud Collaborative Learning on Heterogeneous Edges with Resource Constraints

Distributed machine learning (ML) at network edge is a promising paradig...
research
08/24/2017

Ease.ml: Towards Multi-tenant Resource Sharing for Machine Learning Workloads

We present ease.ml, a declarative machine learning service platform we b...

Please sign up or login with your details

Forgot password? Click here to reset