Alternating Multi-bit Quantization for Recurrent Neural Networks

02/01/2018
by   Jianqiang Yao, et al.
0

Recurrent neural networks have achieved excellent performance in many applications. However, on portable devices with limited resources, the models are often too large to deploy. For applications on the server with large scale concurrent requests, the latency during inference can also be very critical for costly computing resources. In this work, we address these problems by quantizing the network, both weights and activations, into multiple binary codes -1,+1. We formulate the quantization as an optimization problem. Under the key observation that once the quantization coefficients are fixed the binary codes can be derived efficiently by binary search tree, alternating minimization is then applied. We test the quantization for two well-known RNNs, i.e., long short term memory (LSTM) and gated recurrent unit (GRU), on the language models. Compared with the full-precision counter part, by 2-bit quantization we can achieve 16x memory saving and 6x real inference acceleration on CPUs, with only a reasonable loss in the accuracy. By 3-bit quantization, we can achieve almost no loss in the accuracy or even surpass the original model, with 10.5x memory saving and 3x real inference acceleration. Both results beat the exiting quantization works with large margins. We extend our alternating quantization to image classification tasks. In both RNNs and feedforward neural networks, the method also achieves excellent performance.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/14/2021

On the quantization of recurrent neural networks

Integer quantization of neural networks can be defined as the approximat...
research
09/28/2018

Learning Recurrent Binary/Ternary Weights

Recurrent neural networks (RNNs) have shown excellent performance in pro...
research
11/29/2021

Low-bit Quantization of Recurrent Neural Network Language Models Using Alternating Direction Methods of Multipliers

The high memory consumption and computational costs of Recurrent neural ...
research
11/10/2017

Quantized Memory-Augmented Neural Networks

Memory-augmented neural networks (MANNs) refer to a class of neural netw...
research
10/14/2022

Accelerating RNN-based Speech Enhancement on a Multi-Core MCU with Mixed FP16-INT8 Post-Training Quantization

This paper presents an optimized methodology to design and deploy Speech...
research
06/16/2022

Accelerating Inference and Language Model Fusion of Recurrent Neural Network Transducers via End-to-End 4-bit Quantization

We report on aggressive quantization strategies that greatly accelerate ...
research
01/23/2020

Low-Complexity LSTM Training and Inference with FloatSD8 Weight Representation

The FloatSD technology has been shown to have excellent performance on l...

Please sign up or login with your details

Forgot password? Click here to reset