APQ: Joint Search for Network Architecture, Pruning and Quantization Policy

by   Tianzhe Wang, et al.

We present APQ for efficient deep learning inference on resource-constrained hardware. Unlike previous methods that separately search the neural architecture, pruning policy, and quantization policy, we optimize them in a joint manner. To deal with the larger design space it brings, a promising approach is to train a quantization-aware accuracy predictor to quickly get the accuracy of the quantized model and feed it to the search engine to select the best fit. However, training this quantization-aware accuracy predictor requires collecting a large number of quantized <model, accuracy> pairs, which involves quantization-aware finetuning and thus is highly time-consuming. To tackle this challenge, we propose to transfer the knowledge from a full-precision (i.e., fp32) accuracy predictor to the quantization-aware (i.e., int8) accuracy predictor, which greatly improves the sample efficiency. Besides, collecting the dataset for the fp32 accuracy predictor only requires to evaluate neural networks without any training cost by sampling from a pretrained once-for-all network, which is highly efficient. Extensive experiments on ImageNet demonstrate the benefits of our joint optimization approach. With the same accuracy, APQ reduces the latency/energy by 2x/1.3x over MobileNetV2+HAQ. Compared to the separate optimization approach (ProxylessNAS+AMC+HAQ), APQ achieves 2.3 hours and CO2 emission, pushing the frontier for green AI that is environmental-friendly. The code and video are publicly available.


page 1

page 2

page 3

page 4


HQNAS: Auto CNN deployment framework for joint quantization and architecture search

Deep learning applications are being transferred from the cloud to edge ...

BatchQuant: Quantized-for-all Architecture Search with Robust Quantizer

As the applications of deep learning models on edge devices increase at ...

Pruning Large Language Models via Accuracy Predictor

Large language models(LLMs) containing tens of billions of parameters (o...

Differentiable Joint Pruning and Quantization for Hardware Efficiency

We present a differentiable joint pruning and quantization (DJPQ) scheme...

SpaceEvo: Hardware-Friendly Search Space Design for Efficient INT8 Inference

The combination of Neural Architecture Search (NAS) and quantization has...

Hardware-Centric AutoML for Mixed-Precision Quantization

Model quantization is a widely used technique to compress and accelerate...

ABS: Automatic Bit Sharing for Model Compression

We present Automatic Bit Sharing (ABS) to automatically search for optim...

Please sign up or login with your details

Forgot password? Click here to reset