Orloj: Predictably Serving Unpredictable DNNs

08/31/2022
by   Peifeng Yu, et al.
0

Existing DNN serving solutions can provide tight latency SLOs while maintaining high throughput via careful scheduling of incoming requests, whose execution times are assumed to be highly predictable and data-independent. However, inference requests to emerging dynamic DNNs – e.g., popular natural language processing (NLP) models and computer vision (CV) models that skip layers – are data-dependent. They exhibit poor performance when served using existing solutions because they experience large variance in request execution times depending on the input – the longest request in a batch inflates the execution times of the smaller ones, causing SLO misses in the absence of careful batching. In this paper, we present Orloj, a dynamic DNN serving system, that captures this variance in dynamic DNNs using empirical distributions of expected request execution times, and then efficiently batches and schedules them without knowing a request's precise execution time. Orloj significantly outperforms state-of-the-art serving solutions for high variance dynamic DNN workloads by 51–80 relaxed SLO settings. For well-studied static DNN workloads, Orloj keeps comparable performance with the state-of-the-art.

READ FULL TEXT
research
06/03/2020

Serving DNNs like Clockwork: Performance Predictability from the Bottom Up

Machine learning inference is becoming a core building block for interac...
research
04/19/2023

Adaptive Scheduling for Edge-Assisted DNN Serving

Deep neural networks (DNNs) have been widely used in various video analy...
research
05/01/2023

BCEdge: SLO-Aware DNN Inference Services with Adaptive Batching on Edge Platforms

As deep neural networks (DNNs) are being applied to a wide range of edge...
research
01/18/2021

Accelerating Deep Learning Inference via Learned Caches

Deep Neural Networks (DNNs) are witnessing increased adoption in multipl...
research
08/17/2020

CARGO : Context Augmented Critical Region Offload for Network-bound datacenter Workloads

Network bound applications, like a database server executing OLTP querie...
research
09/27/2022

Fluid Batching: Exit-Aware Preemptive Serving of Early-Exit Neural Networks on Edge NPUs

With deep neural networks (DNNs) emerging as the backbone in a multitude...
research
07/01/2019

Creek: a General Mixed-Consistency Transactional Replication Scheme

In this paper we introduce Creek, a low-latency, eventually consistent r...

Please sign up or login with your details

Forgot password? Click here to reset