LazyBatching: An SLA-aware Batching System for Cloud Machine Learning Inference

10/25/2020
by   Yujeong Choi, et al.
0

In cloud ML inference systems, batching is an essential technique to increase throughput which helps optimize total-cost-of-ownership. Prior graph batching combines the individual DNN graphs into a single one, allowing multiple inputs to be concurrently executed in parallel. We observe that the coarse-grained graph batching becomes suboptimal in effectively handling the dynamic inference request traffic, leaving significant performance left on the table. This paper proposes LazyBatching, an SLA-aware batching system that considers both scheduling and batching in the granularity of individual graph nodes, rather than the entire graph for flexible batching. We show that LazyBatching can intelligently determine the set of nodes that can be efficiently batched together, achieving an average 15x, 1.5x, and 5.5x improvement than graph batching in terms of average response time, throughput, and SLA satisfaction, respectively.

READ FULL TEXT

page 3

page 4

page 5

research
05/12/2020

Energy-Aware DNN Graph Optimization

Unlike existing work in deep neural network (DNN) graphs optimization fo...
research
09/06/2019

PREMA: A Predictive Multi-task Scheduling Algorithm For Preemptible Neural Processing Units

To amortize cost, cloud vendors providing DNN acceleration as a service ...
research
01/17/2021

Tailored Learning-Based Scheduling for Kubernetes-Oriented Edge-Cloud System

Kubernetes (k8s) has the potential to merge the distributed edge and the...
research
05/10/2023

Collaborative Learning-Based Scheduling for Kubernetes-Oriented Edge-Cloud Network

Kubernetes (k8s) has the potential to coordinate distributed edge resour...
research
07/28/2019

Decentralized utility- and locality-aware replication for heterogeneous DHT-based P2P cloud storage systems

As a Distributed Hash Table (DHT), Skip Graph routing overlays are explo...
research
10/21/2020

Transferable Graph Optimizers for ML Compilers

Most compilers for machine learning (ML) frameworks need to solve many c...
research
04/09/2019

Distributed Computation of Top-k Degrees in Hidden Bipartite Graphs

Hidden graphs are flexible abstractions that are composed of a set of kn...

Please sign up or login with your details

Forgot password? Click here to reset