IPA: Inference Pipeline Adaptation to Achieve High Accuracy and Cost-Efficiency

08/24/2023
by   Saeid Ghafouri, et al.
0

Efficiently optimizing multi-model inference pipelines for fast, accurate, and cost-effective inference is a crucial challenge in ML production systems, given their tight end-to-end latency requirements. To simplify the exploration of the vast and intricate trade-off space of accuracy and cost in inference pipelines, providers frequently opt to consider one of them. However, the challenge lies in reconciling accuracy and cost trade-offs. To address this challenge and propose a solution to efficiently manage model variants in inference pipelines, we present IPA, an online deep-learning Inference Pipeline Adaptation system that efficiently leverages model variants for each deep learning task. Model variants are different versions of pre-trained models for the same deep learning task with variations in resource requirements, latency, and accuracy. IPA dynamically configures batch size, replication, and model variants to optimize accuracy, minimize costs, and meet user-defined latency SLAs using Integer Programming. It supports multi-objective settings for achieving different trade-offs between accuracy and cost objectives while remaining adaptable to varying workloads and dynamic traffic patterns. Extensive experiments on a Kubernetes implementation with five real-world inference pipelines demonstrate that IPA improves normalized accuracy by up to 35

READ FULL TEXT

page 9

page 10

research
04/21/2023

Reconciling High Accuracy, Cost-Efficiency, and Low Latency of Inference Serving Systems

The use of machine learning (ML) inference for various applications is g...
research
12/05/2018

InferLine: ML Inference Pipeline Composition Framework

The dominant cost in production machine learning workloads is not traini...
research
02/17/2022

Where Is My Training Bottleneck? Hidden Trade-Offs in Deep Learning Preprocessing Pipelines

Preprocessing pipelines in deep learning aim to provide sufficient data ...
research
07/10/2019

Dual Dynamic Inference: Enabling More Efficient, Adaptive and Controllable Deep Inference

State-of-the-art convolutional neural networks (CNNs) yield record-break...
research
11/19/2019

Supported-BinaryNet: Bitcell Array-based Weight Supports for Dynamic Accuracy-Latency Trade-offs in SRAM-based Binarized Neural Network

In this work, we introduce bitcell array-based support parameters to imp...
research
07/13/2022

Dynamic Selection of Perception Models for Robotic Control

Robotic perception models, such as Deep Neural Networks (DNNs), are beco...
research
01/13/2021

NetCut: Real-Time DNN Inference Using Layer Removal

Deep Learning plays a significant role in assisting humans in many aspec...

Please sign up or login with your details

Forgot password? Click here to reset