Analyzing Quantization in TVM

08/19/2023
by   Mingfei Guo, et al.
0

There has been many papers in academic literature on quantizing weight tensors in deep learning models to reduce inference latency and memory footprint. TVM also has the ability to quantize weights and support low-bit computations. Although quantization is typically expected to improve inference time, in TVM, the performance of 8-bit quantization does not meet the expectations. Typically, when applying 8-bit quantization to a deep learning model, it is usually expected to achieve around 50 inference time. However, in this particular case, not only does the quantized version fail to achieve the desired performance boost, but it actually performs worse, resulting in an inference time that is about 2 times as slow as the non-quantized version. In this project, we thoroughly investigate the reasons behind the underperformance and assess the compatibility and optimization opportunities of 8-bit quantization in TVM. We discuss the optimization of two different types of tasks: computation-bound and memory-bound, and provide a detailed comparison of various optimization techniques in TVM. Through the identification of performance issues, we have successfully improved quantization by addressing a bug in graph building. Furthermore, we analyze multiple optimization strategies to achieve the optimal quantization result. The best experiment achieves 163.88 baseline in inference time for the compute-bound task and 194.98 memory-bound task.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/26/2019

Smart Ternary Quantization

Neural network models are resource hungry. Low bit quantization such as ...
research
08/10/2023

NUPES : Non-Uniform Post-Training Quantization via Power Exponent Search

Deep neural network (DNN) deployment has been confined to larger hardwar...
research
02/18/2019

Low-bit Quantization of Neural Networks for Efficient Inference

Recent breakthrough methods in machine learning make use of increasingly...
research
01/13/2021

FBGEMM: Enabling High-Performance Low-Precision Deep Learning Inference

Deep learning models typically use single-precision (FP32) floating poin...
research
07/18/2022

Accelerating Deep Learning Model Inference on Arm CPUs with Ultra-Low Bit Quantization and Runtime

Deep Learning has been one of the most disruptive technological advancem...
research
05/23/2023

QLoRA: Efficient Finetuning of Quantized LLMs

We present QLoRA, an efficient finetuning approach that reduces memory u...
research
08/21/2023

QD-BEV : Quantization-aware View-guided Distillation for Multi-view 3D Object Detection

Multi-view 3D detection based on BEV (bird-eye-view) has recently achiev...

Please sign up or login with your details

Forgot password? Click here to reset