NetCut: Real-Time DNN Inference Using Layer Removal

01/13/2021
by   Mehrshad Zandigohar, et al.
0

Deep Learning plays a significant role in assisting humans in many aspects of their lives. As these networks tend to get deeper over time, they extract more features to increase accuracy at the cost of additional inference latency. This accuracy-performance trade-off makes it more challenging for Embedded Systems, as resource-constrained processors with strict deadlines, to deploy them efficiently. This can lead to selection of networks that can prematurely meet a specified deadline with excess slack time that could have potentially contributed to increased accuracy. In this work, we propose: (i) the concept of layer removal as a means of constructing TRimmed Networks (TRNs) that are based on removing problem-specific features of a pretrained network used in transfer learning, and (ii) NetCut, a methodology based on an empirical or an analytical latency estimator, which only proposes and retrains TRNs that can meet the application's deadline, hence reducing the exploration time significantly. We demonstrate that TRNs can expand the Pareto frontier that trades off latency and accuracy to provide networks that can meet arbitrary deadlines with potential accuracy improvement over off-the-shelf networks. Our experimental results show that such utilization of TRNs, while transferring to a simpler dataset, in combination with NetCut, can lead to the proposal of networks that can achieve relative accuracy improvement of up to 10.43 off-the-shelf neural architectures while meeting a specific deadline, and 27x speedup in exploration time.

READ FULL TEXT
research
04/18/2022

Dynamic Network Adaptation at Inference

Machine learning (ML) inference is a real-time workload that must comply...
research
11/18/2018

Learning to infer: RL-based search for DNN primitive selection on Heterogeneous Embedded Systems

Deep Learning is increasingly being adopted by industry for computer vis...
research
09/01/2021

Architecture Aware Latency Constrained Sparse Neural Networks

Acceleration of deep neural networks to meet a specific latency constrai...
research
11/08/2020

Towards Latency-aware DNN Optimization with GPU Runtime Analysis and Tail Effect Elimination

Despite the superb performance of State-Of-The-Art (SOTA) DNNs, the incr...
research
07/21/2023

Adaptive ResNet Architecture for Distributed Inference in Resource-Constrained IoT Systems

As deep neural networks continue to expand and become more complex, most...
research
08/24/2023

IPA: Inference Pipeline Adaptation to Achieve High Accuracy and Cost-Efficiency

Efficiently optimizing multi-model inference pipelines for fast, accurat...
research
09/06/2019

PREMA: A Predictive Multi-task Scheduling Algorithm For Preemptible Neural Processing Units

To amortize cost, cloud vendors providing DNN acceleration as a service ...

Please sign up or login with your details

Forgot password? Click here to reset