Multi-objective Recurrent Neural Networks Optimization for the Edge – a Quantization-based Approach

08/02/2021
by   Nesma M. Rezk, et al.
0

The compression of deep learning models is of fundamental importance in deploying such models to edge devices. Incorporating hardware model and application constraints during compression maximizes the benefits but makes it specifically designed for one case. Therefore, the compression needs to be automated. Searching for the optimal compression method parameters is considered an optimization problem. This article introduces a Multi-Objective Hardware-Aware Quantization (MOHAQ) method, which considers both hardware efficiency and inference error as objectives for mixed-precision quantization. The proposed method makes the evaluation of candidate solutions in a large search space feasible by relying on two steps. First, post-training quantization is applied for fast solution evaluation. Second, we propose a search technique named "beacon-based search" to retrain selected solutions only in the search space and use them as beacons to know the effect of retraining on other solutions. To evaluate the optimization potential, we chose a speech recognition model using the TIMIT dataset. The model is based on Simple Recurrent Unit (SRU) due to its considerable speedup over other recurrent units. We applied our method to run on two platforms: SiLago and Bitfusion. Experimental evaluations showed that SRU can be compressed up to 8x by post-training quantization without any significant increase in the error and up to 12x with only a 1.5 percentage point increase in error. On SiLago, the inference-only search found solutions that achieve 80% and 64% of the maximum possible speedup and energy saving, respectively, with a 0.5 percentage point increase in the error. On Bitfusion, with a constraint of a small SRAM size, beacon-based search reduced the error gain of inference-only search by 4 percentage points and increased the possible reached speedup to be 47x compared to the Bitfusion baseline.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/15/2022

Towards Hardware-Specific Automatic Compression of Neural Networks

Compressing neural network architectures is important to allow the deplo...
research
09/05/2023

OHQ: On-chip Hardware-aware Quantization

Quantization emerges as one of the most promising approaches for deployi...
research
06/14/2021

Neuroevolution-Enhanced Multi-Objective Optimization for Mixed-Precision Quantization

Mixed-precision quantization is a powerful tool to enable memory and com...
research
05/20/2021

Model Compression

With time, machine learning models have increased in their scope, functi...
research
11/23/2018

Joint Neural Architecture Search and Quantization

Designing neural architectures is a fundamental step in deep learning ap...
research
06/30/2022

Sub-8-Bit Quantization Aware Training for 8-Bit Neural Network Accelerator with On-Device Speech Recognition

We present a novel sub-8-bit quantization-aware training (S8BQAT) scheme...
research
10/27/2022

Neural Networks with Quantization Constraints

Enabling low precision implementations of deep learning models, without ...

Please sign up or login with your details

Forgot password? Click here to reset