SMLT: A Serverless Framework for Scalable and Adaptive Machine Learning Design and Training

05/04/2022
by   Ahsan Ali, et al.
6

In today's production machine learning (ML) systems, models are continuously trained, improved, and deployed. ML design and training are becoming a continuous workflow of various tasks that have dynamic resource demands. Serverless computing is an emerging cloud paradigm that provides transparent resource management and scaling for users and has the potential to revolutionize the routine of ML design and training. However, hosting modern ML workflows on existing serverless platforms has non-trivial challenges due to their intrinsic design limitations such as stateless nature, limited communication support across function instances, and limited function execution duration. These limitations result in a lack of an overarching view and adaptation mechanism for training dynamics and an amplification of existing problems in ML workflows. To address the above challenges, we propose SMLT, an automated, scalable, and adaptive serverless framework to enable efficient and user-centric ML design and training. SMLT employs an automated and adaptive scheduling mechanism to dynamically optimize the deployment and resource scaling for ML tasks during training. SMLT further enables user-centric ML workflow execution by supporting user-specified training deadlines and budget limits. In addition, by providing an end-to-end design, SMLT solves the intrinsic problems in serverless platforms such as the communication overhead, limited function execution duration, need for repeated initialization, and also provides explicit fault tolerance for ML training. SMLT is open-sourced and compatible with all major ML frameworks. Our experimental evaluation with large, sophisticated modern ML models demonstrate that SMLT outperforms the state-of-the-art VM based systems and existing serverless ML training frameworks in both training speed (up to 8X) and monetary cost (up to 3X)

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/11/2022

Tiny Robot Learning: Challenges and Directions for Machine Learning in Resource-Constrained Robots

Machine learning (ML) has become a pervasive tool across computing syste...
research
10/04/2021

TACC: A Full-stack Cloud Computing Infrastructure for Machine Learning Tasks

In Machine Learning (ML) system research, efficient resource scheduling ...
research
03/24/2019

TonY: An Orchestrator for Distributed Machine Learning Jobs

Training machine learning (ML) models on large datasets requires conside...
research
04/03/2019

Stratum: A Serverless Framework for Lifecycle Management of Machine Learning based Data Analytics Tasks

With the proliferation of machine learning (ML) libraries and frameworks...
research
01/01/2020

Ripple: A Practical Declarative Programming Framework for Serverless Compute

Serverless computing has emerged as a promising alternative to infrastru...
research
06/27/2022

Resource-Centric Serverless Computing

Today's serverless computing has several key limitations including per-f...
research
11/04/2021

Scanflow: A multi-graph framework for Machine Learning workflow management, supervision, and debugging

Machine Learning (ML) is more than just training models, the whole workf...

Please sign up or login with your details

Forgot password? Click here to reset