OHQ: On-chip Hardware-aware Quantization

09/05/2023
by   Wei Huang, et al.
0

Quantization emerges as one of the most promising approaches for deploying advanced deep models on resource-constrained hardware. Mixed-precision quantization leverages multiple bit-width architectures to unleash the accuracy and efficiency potential of quantized models. However, existing mixed-precision quantization suffers exhaustive search space that causes immense computational overhead. The quantization process thus relies on separate high-performance devices rather than locally, which also leads to a significant gap between the considered hardware metrics and the real deployment.In this paper, we propose an On-chip Hardware-aware Quantization (OHQ) framework that performs hardware-aware mixed-precision quantization without accessing online devices. First, we construct the On-chip Quantization Awareness (OQA) pipeline, enabling perceive the actual efficiency metrics of the quantization operator on the hardware.Second, we propose Mask-guided Quantization Estimation (MQE) technique to efficiently estimate the accuracy metrics of operators under the constraints of on-chip-level computing power.By synthesizing network and hardware insights through linear programming, we obtain optimized bit-width configurations. Notably, the quantization process occurs on-chip entirely without any additional computing devices and data access. We demonstrate accelerated inference after quantization for various architectures and compression ratios, achieving 70 improves latency by 15 30

READ FULL TEXT
research
07/06/2023

Free Bits: Latency Optimization of Mixed-Precision Quantized Neural Networks on the Edge

Mixed-precision quantization, where a deep neural network's layers are q...
research
03/15/2023

SpaceEvo: Hardware-Friendly Search Space Design for Efficient INT8 Inference

The combination of Neural Architecture Search (NAS) and quantization has...
research
09/18/2022

PIM-QAT: Neural Network Quantization for Processing-In-Memory (PIM) Systems

Processing-in-memory (PIM), an increasingly studied neuromorphic hardwar...
research
08/02/2021

Multi-objective Recurrent Neural Networks Optimization for the Edge – a Quantization-based Approach

The compression of deep learning models is of fundamental importance in ...
research
02/09/2023

Data Quality-aware Mixed-precision Quantization via Hybrid Reinforcement Learning

Mixed-precision quantization mostly predetermines the model bit-width se...
research
08/22/2023

DeepBurning-MixQ: An Open Source Mixed-Precision Neural Network Accelerator Design Framework for FPGAs

Mixed-precision neural networks (MPNNs) that enable the use of just enou...
research
02/14/2023

Searching Transferable Mixed-Precision Quantization Policy through Large Margin Regularization

Mixed-precision quantization (MPQ) suffers from time-consuming policy se...

Please sign up or login with your details

Forgot password? Click here to reset