Ease.ml: Towards Multi-tenant Resource Sharing for Machine Learning Workloads

08/24/2017
by   Tian Li, et al.
0

We present ease.ml, a declarative machine learning service platform we built to support more than ten research groups outside the computer science departments at ETH Zurich for their machine learning needs. With ease.ml, a user defines the high-level schema of a machine learning application and submits the task via a Web interface. The system automatically deals with the rest, such as model selection and data movement. In this paper, we describe the ease.ml architecture and focus on a novel technical problem introduced by ease.ml regarding resource allocation. We ask, as a "service provider" that manages a shared cluster of machines among all our users running machine learning workloads, what is the resource allocation strategy that maximizes the global satisfaction of all our users? Resource allocation is a critical yet subtle issue in this multi-tenant scenario, as we have to balance between efficiency and fairness. We first formalize the problem that we call multi-tenant model selection, aiming for minimizing the total regret of all users running automatic model selection tasks. We then develop a novel algorithm that combines multi-armed bandits with Bayesian optimization and prove a regret bound under the multi-tenant setting. Finally, we report our evaluation of ease.ml on synthetic data and on one service we are providing to our users, namely, image classification with deep neural networks. Our experimental evaluation results show that our proposed solution can be up to 9.8x faster in achieving the same global quality for all users as the two popular heuristics used by our users before ease.ml.

READ FULL TEXT

page 3

page 10

research
03/17/2018

Multi-device, Multi-tenant Model Selection with GP-EI

Bayesian optimization is the core technique behind the emergence of Auto...
research
06/01/2019

Quantitative Overfitting Management for Human-in-the-loop ML Application Development with ease.ml/meter

Simplifying machine learning (ML) application development, including dis...
research
06/01/2019

Ease.ml/meter: Quantitative Overfitting Management for Human-in-the-loop ML Application Development

Simplifying machine learning (ML) application development, including dis...
research
02/01/2023

Task Placement and Resource Allocation for Edge Machine Learning: A GNN-based Multi-Agent Reinforcement Learning Paradigm

Machine learning (ML) tasks are one of the major workloads in today's ed...
research
07/02/2019

Themis: Fair and Efficient GPU Cluster Scheduling for Machine Learning Workloads

Modern distributed machine learning (ML) training workloads benefit sign...
research
11/12/2021

Deep Reinforcement Model Selection for Communications Resource Allocation in On-Site Medical Care

Greater capabilities of mobile communications technology enable intercon...
research
01/25/2021

Online and Scalable Model Selection with Multi-Armed Bandits

Many online applications running on live traffic are powered by machine ...

Please sign up or login with your details

Forgot password? Click here to reset