Performance Characterization of using Quantization for DNN Inference on Edge Devices: Extended Version

03/09/2023
by   Hyunho Ahn, et al.
0

Quantization is a popular technique used in Deep Neural Networks (DNN) inference to reduce the size of models and improve the overall numerical performance by exploiting native hardware. This paper attempts to conduct an elaborate performance characterization of the benefits of using quantization techniques – mainly FP16/INT8 variants with static and dynamic schemes – using the MLPerf Edge Inference benchmarking methodology. The study is conducted on Intel x86 processors and Raspberry Pi device with ARM processor. The paper uses a number of DNN inference frameworks, including OpenVINO (for Intel CPUs only), TensorFlow Lite (TFLite), ONNX, and PyTorch with MobileNetV2, VGG-19, and DenseNet-121. The single-stream, multi-stream, and offline scenarios of the MLPerf Edge Inference benchmarks are used for measuring latency and throughput in our experiments. Our evaluation reveals that OpenVINO and TFLite are the most optimized frameworks for Intel CPUs and Raspberry Pi device, respectively. We observe no loss in accuracy except for the static quantization techniques. We also observed the benefits of using quantization for these optimized frameworks. For example, INT8-based quantized models deliver 3.3× and 4× better performance over FP32 using OpenVINO on Intel CPU and TFLite on Raspberry Pi device, respectively, for the MLPerf offline scenario. To the best of our knowledge, this paper is the first one that presents a unique characterization study characterizing the impact of quantization for a range of DNN inference frameworks – including OpenVINO, TFLite, PyTorch, and ONNX – on Intel x86 processors and Raspberry Pi device with ARM processor using the MLPerf Edge Inference benchmark methodology.

READ FULL TEXT
research
06/18/2020

Efficient Execution of Quantized Deep Learning Models: A Compiler Approach

A growing number of applications implement predictive functions using de...
research
09/27/2018

Scalar Arithmetic Multiple Data: Customizable Precision for Deep Neural Networks

Quantization of weights and activations in Deep Neural Networks (DNNs) i...
research
06/03/2019

Efficient 8-Bit Quantization of Transformer Neural Machine Language Translation Model

In this work, we quantize a trained Transformer machine language transla...
research
07/20/2022

SumMerge: an efficient algorithm and implementation for weight repetition-aware DNN inference

Deep Neural Network (DNN) inference efficiency is a key concern across t...
research
02/01/2023

Xenos: Dataflow-Centric Optimization to Accelerate Model Inference on Edge Devices

Edge computing has been emerging as a popular scenario for model inferen...
research
05/15/2023

Fast Inference of Tree Ensembles on ARM Devices

With the ongoing integration of Machine Learning models into everyday li...
research
10/27/2017

Power Modelling for Heterogeneous Cloud-Edge Data Centers

Existing power modelling research focuses not on the method used for dev...

Please sign up or login with your details

Forgot password? Click here to reset