Towards Latency-aware DNN Optimization with GPU Runtime Analysis and Tail Effect Elimination

11/08/2020
by   Fuxun Yu, et al.
0

Despite the superb performance of State-Of-The-Art (SOTA) DNNs, the increasing computational cost makes them very challenging to meet real-time latency and accuracy requirements. Although DNN runtime latency is dictated by model property (e.g., architecture, operations), hardware property (e.g., utilization, throughput), and more importantly, the effective mapping between these two, many existing approaches focus only on optimizing model property such as FLOPS reduction and overlook the mismatch between DNN model and hardware properties. In this work, we show that the mismatch between the varied DNN computation workloads and GPU capacity can cause the idle GPU tail effect, leading to GPU under-utilization and low throughput. As a result, the FLOPs reduction cannot bring effective latency reduction, which causes sub-optimal accuracy versus latency trade-offs. Motivated by this, we propose a GPU runtime-aware DNN optimization methodology to eliminate such GPU tail effect adaptively on GPU platforms. Our methodology can be applied on top of existing SOTA DNN optimization approaches to achieve better latency and accuracy trade-offs. Experiments show 11 improvement over several SOTA DNN pruning and NAS methods, respectively

READ FULL TEXT
research
08/08/2020

Spatial Sharing of GPU for Autotuning DNN models

GPUs are used for training, inference, and tuning the machine learning m...
research
05/08/2021

Dynamic-OFA: Runtime DNN Architecture Switching for Performance Scaling on Heterogeneous Embedded Platforms

Mobile and embedded platforms are increasingly required to efficiently e...
research
08/26/2023

Throughput Maximization of DNN Inference: Batching or Multi-Tenancy?

Deployment of real-time ML services on warehouse-scale infrastructures i...
research
09/12/2023

RT-LM: Uncertainty-Aware Resource Management for Real-Time Inference of Language Models

Recent advancements in language models (LMs) have gained substantial att...
research
01/13/2021

NetCut: Real-Time DNN Inference Using Layer Removal

Deep Learning plays a significant role in assisting humans in many aspec...
research
03/31/2023

D-STACK: High Throughput DNN Inference by Effective Multiplexing and Spatio-Temporal Scheduling of GPUs

Hardware accelerators such as GPUs are required for real-time, low-laten...
research
03/07/2020

Measurement-driven Analysis of an Edge-Assisted Object Recognition System

We develop an edge-assisted object recognition system with the aim of st...

Please sign up or login with your details

Forgot password? Click here to reset