Dynamic Network Adaptation at Inference

04/18/2022
by   Daniel Mendoza, et al.
0

Machine learning (ML) inference is a real-time workload that must comply with strict Service Level Objectives (SLOs), including latency and accuracy targets. Unfortunately, ensuring that SLOs are not violated in inference-serving systems is challenging due to inherent model accuracy-latency tradeoffs, SLO diversity across and within application domains, evolution of SLOs over time, unpredictable query patterns, and co-location interference. In this paper, we observe that neural networks exhibit high degrees of per-input activation sparsity during inference. . Thus, we propose SLO-Aware Neural Networks which dynamically drop out nodes per-inference query, thereby tuning the amount of computation performed, according to specified SLO optimization targets and machine utilization. SLO-Aware Neural Networks achieve average speedups of 1.3-56.7× with little to no accuracy loss (less than 0.3 accuracy constrained, SLO-Aware Neural Networks are able to serve a range of accuracy targets at low latency with the same trained model. When latency constrained, SLO-Aware Neural Networks can proactively alleviate latency degradation from co-location interference while maintaining high accuracy to meet latency constraints.

READ FULL TEXT
research
04/21/2023

Reconciling High Accuracy, Cost-Efficiency, and Low Latency of Inference Serving Systems

The use of machine learning (ML) inference for various applications is g...
research
05/02/2019

Parity Models: A General Framework for Coding-Based Resilience in ML Inference

Machine learning models are becoming the primary workhorses for many app...
research
10/06/2020

Move Fast and Meet Deadlines: Fine-grained Real-time Stream Processing with Cameo

Resource provisioning in multi-tenant stream processing systems faces th...
research
01/13/2021

NetCut: Real-Time DNN Inference Using Layer Removal

Deep Learning plays a significant role in assisting humans in many aspec...
research
06/02/2023

ODIN: Overcoming Dynamic Interference in iNference pipelines

As an increasing number of businesses becomes powered by machine-learnin...
research
06/03/2020

Serving DNNs like Clockwork: Performance Predictability from the Bottom Up

Machine learning inference is becoming a core building block for interac...
research
04/19/2023

Green Carbon Footprint for Model Inference Serving via Exploiting Mixed-Quality Models and GPU Partitioning

This paper presents a solution to the challenge of mitigating carbon emi...

Please sign up or login with your details

Forgot password? Click here to reset