ODIN: Overcoming Dynamic Interference in iNference pipelines

06/02/2023
by   Pirah Noor Soomro, et al.
0

As an increasing number of businesses becomes powered by machine-learning, inference becomes a core operation, with a growing trend to be offered as a service. In this context, the inference task must meet certain service-level objectives (SLOs), such as high throughput and low latency. However, these targets can be compromised by interference caused by long- or short-lived co-located tasks. Prior works focus on the generic problem of co-scheduling to mitigate the effect of interference on the performance-critical task. In this work, we focus on inference pipelines and propose ODIN, a technique to mitigate the effect of interference on the performance of the inference task, based on the online scheduling of the pipeline stages. Our technique detects interference online and automatically re-balances the pipeline stages to mitigate the performance degradation of the inference task. We demonstrate that ODIN successfully mitigates the effect of interference, sustaining the latency and throughput of CNN inference, and outperforms the least-loaded scheduling (LLS), a common technique for interference mitigation. Additionally, it is effective in maintaining service-level objectives for inference, and it is scalable to large network models executing on multiple processing elements.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/06/2019

PREMA: A Predictive Multi-task Scheduling Algorithm For Preemptible Neural Processing Units

To amortize cost, cloud vendors providing DNN acceleration as a service ...
research
04/18/2022

Dynamic Network Adaptation at Inference

Machine learning (ML) inference is a real-time workload that must comply...
research
05/02/2019

An Adaptive Performance-oriented Scheduler for Static and Dynamic Heterogeneity

With the emergence of heterogeneous hardware paving the way for the post...
research
02/23/2023

Hera: A Heterogeneity-Aware Multi-Tenant Inference Server for Personalized Recommendations

While providing low latency is a fundamental requirement in deploying re...
research
10/28/2020

Rosella: A Self-Driving Distributed Scheduler for Heterogeneous Clusters

Large-scale interactive web services and advanced AI applications make s...
research
05/15/2018

Predictable Performance and Fairness Through Accurate Slowdown Estimation in Shared Main Memory Systems

This paper summarizes the ideas and key concepts in MISE (Memory Interfe...

Please sign up or login with your details

Forgot password? Click here to reset