Efficient Execution of Quantized Deep Learning Models: A Compiler Approach

06/18/2020
by   Animesh Jain, et al.
0

A growing number of applications implement predictive functions using deep learning models, which require heavy use of compute and memory. One popular technique for increasing resource efficiency is 8-bit integer quantization, in which 32-bit floating point numbers (fp32) are represented using shorter 8-bit integer numbers. Although deep learning frameworks such as TensorFlow, TFLite, MXNet, and PyTorch enable developers to quantize models with only a small drop in accuracy, they are not well suited to execute quantized models on a variety of hardware platforms. For example, TFLite is optimized to run inference on ARM CPU edge devices but it does not have efficient support for Intel CPUs and Nvidia GPUs. In this paper, we address the challenges of executing quantized deep learning models on diverse hardware platforms by proposing an augmented compiler approach. A deep learning compiler such as Apache TVM can enable the efficient execution of model from various frameworks on various targets. Many deep learning compilers today, however, are designed primarily for fp32 computation and cannot optimize a pre-quantized INT8 model. To address this issue, we created a new dialect called Quantized Neural Network (QNN) that extends the compiler's internal representation with a quantization context. With this quantization context, the compiler can generate efficient code for pre-quantized models on various hardware platforms. As implemented in Apache TVM, we observe that the QNN-augmented deep learning compiler achieves speedups of 2.35x, 2.15x, 1.35x and 1.40x on Intel Xeon Cascade Lake CPUs, Nvidia Tesla T4 GPUs, ARM Raspberry Pi3 and Pi4 respectively against well optimized fp32 execution, and comparable performance to the state-of-the-art framework-specific solutions.

READ FULL TEXT

page 1

page 4

page 8

page 9

page 10

research
09/19/2023

DeepliteRT: Computer Vision at the Edge

The proliferation of edge devices has unlocked unprecedented opportuniti...
research
03/09/2023

Performance Characterization of using Quantization for DNN Inference on Edge Devices: Extended Version

Quantization is a popular technique used in Deep Neural Networks (DNN) i...
research
06/03/2019

Efficient 8-Bit Quantization of Transformer Neural Machine Language Translation Model

In this work, we quantize a trained Transformer machine language transla...
research
04/16/2019

swTVM: Exploring the Automated Compilation for Deep Learning on Sunway Architecture

The flourish of deep learning frameworks and hardware platforms has been...
research
03/27/2021

Automated Backend-Aware Post-Training Quantization

Quantization is a key technique to reduce the resource requirement and i...
research
10/04/2021

Pre-Quantized Deep Learning Models Codified in ONNX to Enable Hardware/Software Co-Design

This paper presents a methodology to separate the quantization process f...
research
10/25/2018

Automating Generation of Low Precision Deep Learning Operators

State of the art deep learning models have made steady progress in the f...

Please sign up or login with your details

Forgot password? Click here to reset