DeepAI AI Chat
Log In Sign Up

Efficient Execution of Quantized Deep Learning Models: A Compiler Approach

by   Animesh Jain, et al.

A growing number of applications implement predictive functions using deep learning models, which require heavy use of compute and memory. One popular technique for increasing resource efficiency is 8-bit integer quantization, in which 32-bit floating point numbers (fp32) are represented using shorter 8-bit integer numbers. Although deep learning frameworks such as TensorFlow, TFLite, MXNet, and PyTorch enable developers to quantize models with only a small drop in accuracy, they are not well suited to execute quantized models on a variety of hardware platforms. For example, TFLite is optimized to run inference on ARM CPU edge devices but it does not have efficient support for Intel CPUs and Nvidia GPUs. In this paper, we address the challenges of executing quantized deep learning models on diverse hardware platforms by proposing an augmented compiler approach. A deep learning compiler such as Apache TVM can enable the efficient execution of model from various frameworks on various targets. Many deep learning compilers today, however, are designed primarily for fp32 computation and cannot optimize a pre-quantized INT8 model. To address this issue, we created a new dialect called Quantized Neural Network (QNN) that extends the compiler's internal representation with a quantization context. With this quantization context, the compiler can generate efficient code for pre-quantized models on various hardware platforms. As implemented in Apache TVM, we observe that the QNN-augmented deep learning compiler achieves speedups of 2.35x, 2.15x, 1.35x and 1.40x on Intel Xeon Cascade Lake CPUs, Nvidia Tesla T4 GPUs, ARM Raspberry Pi3 and Pi4 respectively against well optimized fp32 execution, and comparable performance to the state-of-the-art framework-specific solutions.


page 1

page 4

page 8

page 9

page 10


Intel nGraph: An Intermediate Representation, Compiler, and Executor for Deep Learning

The Deep Learning (DL) community sees many novel topologies published ea...

Performance Characterization of using Quantization for DNN Inference on Edge Devices: Extended Version

Quantization is a popular technique used in Deep Neural Networks (DNN) i...

Efficient 8-Bit Quantization of Transformer Neural Machine Language Translation Model

In this work, we quantize a trained Transformer machine language transla...

Accelerating Deep Learning Model Inference on Arm CPUs with Ultra-Low Bit Quantization and Runtime

Deep Learning has been one of the most disruptive technological advancem...

Pre-Quantized Deep Learning Models Codified in ONNX to Enable Hardware/Software Co-Design

This paper presents a methodology to separate the quantization process f...

Automating Generation of Low Precision Deep Learning Operators

State of the art deep learning models have made steady progress in the f...

swTVM: Exploring the Automated Compilation for Deep Learning on Sunway Architecture

The flourish of deep learning frameworks and hardware platforms has been...