MLProxy: SLA-Aware Reverse Proxy for Machine Learning Inference Serving on Serverless Computing Platforms

02/23/2022
by   Nima Mahmoudi, et al.
0

Serving machine learning inference workloads on the cloud is still a challenging task on the production level. Optimal configuration of the inference workload to meet SLA requirements while optimizing the infrastructure costs is highly complicated due to the complex interaction between batch configuration, resource configurations, and variable arrival process. Serverless computing has emerged in recent years to automate most infrastructure management tasks. Workload batching has revealed the potential to improve the response time and cost-effectiveness of machine learning serving workloads. However, it has not yet been supported out of the box by serverless computing platforms. Our experiments have shown that for various machine learning workloads, batching can hugely improve the system's efficiency by reducing the processing overhead per request. In this work, we present MLProxy, an adaptive reverse proxy to support efficient machine learning serving workloads on serverless computing systems. MLProxy supports adaptive batching to ensure SLA compliance while optimizing serverless costs. We performed rigorous experiments on Knative to demonstrate the effectiveness of MLProxy. We showed that MLProxy could reduce the cost of serverless deployment by up to 92 that can be generalized across state-of-the-art model serving frameworks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/10/2022

Serving and Optimizing Machine Learning Workflows on Heterogeneous Infrastructures

With the advent of ubiquitous deployment of smart devices and the Intern...
research
12/05/2018

InferLine: ML Inference Pipeline Composition Framework

The dominant cost in production machine learning workloads is not traini...
research
06/06/2023

FaaSwap: SLO-Aware, GPU-Efficient Serverless Inference via Model Swapping

The dynamic request patterns of machine learning (ML) inference workload...
research
10/27/2021

SOAR: Minimizing Network Utilization with Bounded In-network Computing

In-network computing via smart networking devices is a recent trend for ...
research
10/20/2018

MMLSpark: Unifying Machine Learning Ecosystems at Massive Scales

We introduce Microsoft Machine Learning for Apache Spark (MMLSpark), an ...
research
04/20/2023

Scaling ML Products At Startups: A Practitioner's Guide

How do you scale a machine learning product at a startup? In particular,...
research
06/22/2020

Artist-Guided Semiautomatic Animation Colorization

There is a delicate balance between automating repetitive work in creati...

Please sign up or login with your details

Forgot password? Click here to reset