Accelerating Deep Learning Inference via Freezing

02/07/2020
by   Adarsh Kumar, et al.
11

Over the last few years, Deep Neural Networks (DNNs) have become ubiquitous owing to their high accuracy on real-world tasks. However, this increase in accuracy comes at the cost of computationally expensive models leading to higher prediction latencies. Prior efforts to reduce this latency such as quantization, model distillation, and any-time prediction models typically trade-off accuracy for performance. In this work, we observe that caching intermediate layer outputs can help us avoid running all the layers of a DNN for a sizeable fraction of inference requests. We find that this can potentially reduce the number of effective layers by half for 91.58 CIFAR-10 requests run on ResNet-18. We present Freeze Inference, a system that introduces approximate caching at each intermediate layer and we discuss techniques to reduce the cache size and improve the cache hit rate. Finally, we discuss some of the open research challenges in realizing such a design.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/18/2021

Accelerating Deep Learning Inference via Learned Caches

Deep Neural Networks (DNNs) are witnessing increased adoption in multipl...
research
09/18/2022

Improving the Performance of DNN-based Software Services using Automated Layer Caching

Deep Neural Networks (DNNs) have become an essential component in many a...
research
08/12/2016

Learning Structured Sparsity in Deep Neural Networks

High demand for computation resources severely hinders deployment of lar...
research
12/13/2021

Accelerating Deep Learning Classification with Error-controlled Approximate-key Caching

While Deep Learning (DL) technologies are a promising tool to solve netw...
research
08/14/2018

Cache Telepathy: Leveraging Shared Resource Attacks to Learn DNN Architectures

Deep Neural Networks (DNNs) are fast becoming ubiquitous for their abili...
research
04/19/2023

Adaptive Scheduling for Edge-Assisted DNN Serving

Deep neural networks (DNNs) have been widely used in various video analy...
research
06/03/2023

On Optimal Caching and Model Multiplexing for Large Model Inference

Large Language Models (LLMs) and other large foundation models have achi...

Please sign up or login with your details

Forgot password? Click here to reset