Log In Sign Up

Effective Quantization Methods for Recurrent Neural Networks

by   Qinyao He, et al.

Reducing bit-widths of weights, activations, and gradients of a Neural Network can shrink its storage size and memory usage, and also allow for faster training and inference by exploiting bitwise operations. However, previous attempts for quantization of RNNs show considerable performance degradation when using low bit-width weights and activations. In this paper, we propose methods to quantize the structure of gates and interlinks in LSTM and GRU cells. In addition, we propose balanced quantization methods for weights to further reduce performance degradation. Experiments on PTB and IMDB datasets confirm effectiveness of our methods as performances of our models match or surpass the previous state-of-the-art of quantized RNN.


page 1

page 2

page 3

page 4


Balanced Quantization: An Effective and Efficient Approach to Quantized Neural Networks

Quantized Neural Networks (QNNs), which use low bitwidth numbers for rep...

Low Precision RNNs: Quantizing RNNs Without Losing Accuracy

Similar to convolution neural networks, recurrent neural networks (RNNs)...

FullPack: Full Vector Utilization for Sub-Byte Quantized Inference on General Purpose CPUs

Although prior art has demonstrated negligible accuracy drop in sub-byte...

Instant Quantization of Neural Networks using Monte Carlo Methods

Low bit-width integer weights and activations are very important for eff...

FleXOR: Trainable Fractional Quantization

Quantization based on the binary codes is gaining attention because each...

AdaBits: Neural Network Quantization with Adaptive Bit-Widths

Deep neural networks with adaptive configurations have gained increasing...

Code Repositories


Quantize weights and activations in Recurrent Neural Networks.

view repo

1 Introduction

Deep Neural Networks have become important tools for modeling nonlinear functions in applications like computer vision

(Krizhevsky et al., 2012), speech recognition (Hinton et al., 2012a)

, natural language processing

(Bahdanau et al., 2014), and computer games (Silver et al., 2016).

However, inference and training of a DNN may involve up to billions of operations for inputs likes images (Krizhevsky et al., 2012; Szegedy et al., 2014). A DNN may also have large number of parameters, leading to large storage size and runtime memory usage. Such intensive resource requirements impede adoption of DNN in applications requiring real-time responses, especially on resource-limited platforms. To alleviate these requirements, many methods have been proposed, from both hardware and software perspective (Farabet et al., 2011; Pham et al., 2012; Chen et al., 2014a, b). For example, constraints may be imposed on the weights of DNN, like sparsity (Han et al., 2015b, a), circulant matrix (Cheng et al., 2015), low rank (Jaderberg et al., 2014; Zhang et al., 2015)

, vector quantization

(Gong et al., 2014), and hash trick (Chen et al., 2015) etc., to reduce the number of free parameters and computation complexity. However, these methods use high bit-width numbers for computations, which require availability of high precision multiply-and-add instructions.

Another line of research tries to reduce bit-width of weights and activations of a DNN by quantization to low bit-width numbers (Rastegari et al., 2016; Hubara et al., 2016b; Zhou et al., 2016; Hubara et al., 2016a). Reducing bit-width of weights of a -bit model to can shrink the storage size of model to of the original size. Similarly, reducing bit-widths of activations to can shrink the runtime memory usage by the same proportion. In addition, when the underlying platform supports efficient bitwise operations and that counts the number of bits in a bit vector, we can compute the inner product between bit vectors by the following formula:


Consequently, convolutions between low bit-width numbers can be considerable accelerated on platforms supporting efficient execution of bitwise operations, including CPU, GPU, FPGA and ASIC. Previous works shows that using only 1-bit weights and 2-bit activation can achieve 51% top-1 accuracy on ImageNet datasets

(Hubara et al., 2016a).

However, in contrast to the extensive study in compression and quantization of convolutional neural networks, little attention has been paid to reducing the computational resource requirements of RNN.

(Ott et al., 2016)

claims that the weight binarization method does not work with RNNs, and introduces weight ternarization and leaves activations as floating point numbers.

(Hubara et al., 2016a) experiments with different combinations of bit-widths for weights and activations, and shows 4-bit quantized CNN and RNN can achieve comparable accuracy as their 32-bit counterpart. However, large performance degradation occurs when quantizing weights and activations to 2-bit numbers. Though (Hubara et al., 2016a) has their quantized CNN open-sourced, neither of the two works open-source their quantized RNNs.

This paper makes the following contributions:

  1. We outline detailed design for quantizing two popular types of RNN cells: LSTM and GRU. We evaluate our model on different sets of bit-width configurations and two NLP tasks: Penn Treebank and IMDB. We demonstrate that by out design, quantization with 4-bit weights and activations can achieve almost the same performance to 32-bit. In addition, we have significantly better results when quantizing to lower bit-widths.

  2. We propose methods to quantize weights deterministically and adaptively to balanced distributions, especially when weights are 2-bits numbers. The balanced distribution of quantized weights leads to better utilization of the parameter space and consequently increases the prediction accuracy. We explicitly induce the balanced distribution by introducing parameter dependent thresholds into the quantization process during training.

  3. We release code for training our quantized RNNs online 111

    . The code is implemented in TensorFlow

    (Abadi et al., ) framework.

2 Quantization Methods

In this section we outline several quantization methods. W.l.o.g., we assume the input to the quantization is a matrix unless otherwise specified. When all entries of are in close interval , we define the -bit uniform quantization as follows: .


However, derivatives of this quantization function equals zero almost everywhere. We adopt the “straight-through estimator” (STE) method

(Hinton et al., 2012b; Bengio et al., 2013) to circumvent this problem.

For forward and backward pass of training neural network, using above quantization method together with STE leads to the following update rule during forward and backward propagation of neural networks:


2.1 Deterministic Quantization

When entries in are not constrained in closed interval , an affine transform need to be applied before using function . A straightforward transformation can be done using minimum and maximum of to get , the standardized version of :

After quantization, we can apply a reverse affine transform to approximate the original values. Overall, the quantized result is:

2.2 Balanced Deterministic Quantization

When we quantize values, it may be desirable to make the quantized values have balanced distributions, so as to take full advantage of the available parameter space. Ordinarily, this is not possible as the distribution of the input values has already been fixed. In particular, using do not exert any impacts on the distribution of quantized values.

Next we show that we can induce more uniform distributions of quantized values, by introducing parameter dependent adaptive thresholds

. We first introduce a different standardization transform that produces , and define a balanced quantization method as follows:


The only difference between and lies in difference of standardization. In fact, when the extremal values of are symmetric around zero, i.e.

we may rewrite equivalently as follows to make the similarity between and more obvious:

Hence the only difference between and lies in difference between properties of and . We find that as is an order statistics, using it as threshold will produce an auto-balancing effect.

2.2.1 The Auto-balancing effect of

We consider the case when bit-width is as an example. In this case, under the symmetric distribution assumption, we can prove the auto-balancing effect of .

Theorem 1.

If , and suppose are symmetrically distributed around zero and there are no two entries in that are equal, then the four bars in the histogram of will all have exactly the same height.


By Formula 3, entries of will be equal to 1 if corresponding entries in are above , equal to if between and , equal to if between and , and equal to if below . When and are symmetrically distributed around zero, the values in will be thresholded by , , and into four bins. By the property of median, and the symmetric distribution assumption, the four bins will contain the same number of quantized values. ∎

In practice, computing

may not be computationally convenient as it requires sorting. We note that when a distribution has bounded variance

, the mean approximates the median as there is an inequality bounding the difference(Mallows, 1991):

Hence we may use instead of in the quantization. Though with error introduced, empirically we can still observed nearly-balanced distribution.

If we further assume the weights follow zero-mean normal distribution

, then follows half-normal distribution. By simple calculations we have:


Putting all these things together we have the balanced deterministic quantization method:


where a natural choice of would be 3 or 2.5 (rounding 2.5359 to a short binary number) under different assumptions. In our following experiments, we adopt 2.5 as the scaling factor.

Although the above argument for balanced quantization applies only to 2-bit quantization, we argue more bit-width also benefit from avoiding extreme value from extending the value range thus increase rounding error. It should be noted that for 1-bit quantization (binarization), the scaling factor should be , which can be proved to be optimal in the sense of reconstruction error measured by Frobenius norm, as in (Rastegari et al., 2016). However, the proof relies on the constant norm property of 1-bit representations, and does not generalize to the cases of other bit-widths.

2.3 Quantization of Weights

Weights in neural networks are sometimes known to have a bell-style distribution around zero, similar to normal distribution. Hence we can assume to have symmetric distribution around 0, and apply the above equation for balanced quantization as

To include the quantization into the computation graph of a neural network, we apply STE on entire expression rather than only itself.


The specialty about the balanced quantization method is that in general, it distort the extremal values due to the clipping in Formula 4, which in general contribute more to the computed sums of inner products. However, in case where the values to be quantized are weights of neural networks and if we introduce the balanced quantization into the training process, we conjecture that the neural networks may gradually adapt to the distortions, so that distributions of weights may be induced to be more balanced. The more balanced distribution will increase the effective bit-width of neural networks, leading to better prediction accuracy. We will empirically validate this conjecture through experiments in Section 4.

2.4 Quantization of Activations

Quantization of activation follows the method in Zhou et al. (2016)

, assuming output of the previous layer has passed through a bounded activation function

, and we will apply quantization directly to them. In fact, we find that adding a scaling term containing mean or max value to the activations may harm prediction accuracy.

There is a design choice on what range of quantized value should be. One choice is symmetric distribution around 0. Under this choice, inputs are bounded by activation function to , and then shifted to the right by 0.5 before feeding into and then shift back.

Another choice is having value range of

, which is closer to the value range of ReLU activation function. Under this choice, we can directly apply

. For commonly used activation with domain in RNNs, it seems natural to use symmetry quantization. However, we will point out some considerations for using quantization to range in Section 3.

3 Quantization of Recurrent Neural Network

In this section, we detail our design considerations for quantization of recurrent neural networks. Different from plain feed forward neural network, recurrent neural networks, especially Long Short Term Memory

(Hochreiter & Schmidhuber, 1997)

and Gated Recurrent Unit

(Chung et al., 2014), have subtle and delicately designed structure, which makes their quantization more complex and need more careful considerations. Nevertheless, the major algorithm is the same as Algorithm 1 in Zhou et al. (2016).

3.1 Dropout

It is well known that as fully-connected layers have large number of parameters, they are prone to overfit (Srivastava et al., 2014). There are several FC-like structures in a RNN, for example the input, output and transition matrices in RNN cells (like GRU and LSTM) and the final FC layer for softmax classification. The dropout technique, which randomly dropping a portion of features to 0 at training time, turns out be also an effective way of alleviating overfitting in RNN (Zaremba et al., 2014).

As dropped activations are zero, it is necessary to have zero values in the range of quantized values. For symmetric quantization to range , does not exist in range of . Hence we use as the range of quantized values when dropout is needed.

3.2 Embedding Layer

In tasks related to Natural Language Processing, the input words which are represented by ID’s, are embedded into a low-dimensional space before feeding into RNNs. The word embedding matrix is in , where is the size of vocabulary and is length of embedded vectors.

Quantization of weights in embedding layers turns out to be different from quantization of weights in FC layers. In fact, the weights of embedding layers actually behave like activations: a certain row is selected and fed to the next layer, so the quantization method should be the same as that of activations rather than that of weights. Similarly, as dropout may also be applied on the outputs of embedding layers, it is necessary to bound the values in embedding matrices to .

To clip the value range of weights of embedding layers, a natural choice would be using sigmoid function

such that

will be used as parameters of embedding layers, but we observe severe vanishing gradient problem for gradients

in training process. Hence instead, we directly apply a function , and random initialize the embedding matrices with values drawn from uniform distribution . These two measures are found to improve performance of the model.

3.3 Quantization of GRU

We first investigate quantization of GRU as it is structurally simpler. The basic structure of GRU cell may be described as follows:

where stands for the sigmoid function.

Recall that to benefit from the speed advantage of bit convolution kernels, we need to make the two matrix inputs for multiply in low bit form, so that the dot product can be calculated by bitwise operation. For plain feed forward neural networks, as the convolutions take up most of computation time, we can get decent acceleration by quantization of inputs of convolutions and their weights. But when it comes to more complex structures like GRU, we need to check the bit-width of each interlink.

Except for matrix multiplications needed to compute , and , the gate structure of and brings in the need for element-wise multiplication. As the output of the sigmoid function may have large bit-width, the element-wise multiplication may need be done in floating point numbers (or in higher fixed-point format). As and are also the inputs to computations at the next timestamp, and noting that a quantized value multiplied by a quantized value will have a larger bit-width, we need to insert additional quantization steps after element-wise multiplications.

Another problem with quantization of GRU structure lies in the different value range of gates. The range of is , which is different from the value range of and . If we want to preserve the original activation functions, we will have the following quantization scheme:

where we assume the weights have already been quantized to , and input have already been quantized to .

However, we note that the quantization function already has an affine transform to shift the value range. To simplify the implementation, we replace the activation functions of to be the sigmoid function, so that .

Summarizing the above considerations, the quantized version of GRU could be written as

where we assume the weights have already been quantized to , and input have already been quantized to .

3.4 Quantization of LSTM

The structure of LSTM can be described as follows:

Different from GRU, can not be easily quantized, since the value is unbounded by not using activation function like and the sigmoid function. This difficulty comes from structure design and can not be alleviated without introducing extra facility to clip value ranges. But it can be noted that the computations involving are all element-wise multiplications and additions, which may take much less time than computing matrix products. For this reason, we leave to be in floating point form.

To simplify implementation, activation for output may be changed to the sigmoid function.

Summarizing above changes, the formula for quantized LSTM can be:

where we assume the weights have already been quantized to , and input have already been quantized to .

4 Experiment Results

We evaluate the quantized RNN models on two tasks: language modeling and sentence classification.

4.1 Experiments on Penn Treebank dataset

For language modeling we use Penn Treebank dataset (Taylor et al., 2003), which contains 10K unique words. We download the data from Tomas Mikolov’s webpage222 imikolov/rnnlm/simple-examples.tgz. For fair comparison, in the following experiments, our model all use one hidden layer with 300 hidden units, which is the same setting as Hubara et al. (2016a). A word embedding layer is used at the input side of the network whose weights are trained from scratch. The performance is measured in perplexity per word (PPW) metric.

During experiments we find the magnitudes of values in dense matrices or full connected layers explode when using small bit-width, and result in overfitting and divergence. This can be alleviated by adding to constrain the value ranges or adding weight decays for regularization.

Model weight-bits activation-bits PPW
balanced unbalanced
GRU 1 2 285 diverge
GRU 1 32 178 diverge
GRU 2 2 150 165
GRU 2 3 128 141
GRU 3 3 109 110
GRU 4 4 104 102
GRU 32 32 - 100
LSTM 1 2 257 diverge
LSTM 1 32 198 diverge
LSTM 2 2 152 164
LSTM 2 3 142 155
LSTM 3 3 120 122
LSTM 4 4 114 114
LSTM 32 32 - 109
(Hubara et al., 2016a)
2 3 220
(Hubara et al., 2016a)
4 4 100
Table 1: Quantized RNNs on PTB datasets

Our result is in agreement with (Hubara et al., 2016a) where they claim using 4-bit weights and activations can achieve almost the same performance as 32-bit. However, we report higher accuracy when using less bits, such as 2-bit weight and activations. The 2-bit weights and 3-bit activations LSTM achieve 146 PPW, which outperforms the counterpart in (Hubara et al., 2016a) by a large margin.

We also perform experiments in which weights are binarized. The models can converge, though with large performance degradations.

4.2 Experiments on Penn IMDB datasets

We do further experiments on sentence classification using IMDB datasets (Maas et al., 2011)

. We pad or cut each sentence to 500 words, word embedding vectors of length 512, and a single recurrent layer with 512 number of hidden neurons. All models are trained using ADAM

(Kingma & Ba, 2014) learning rule with learning rate .

Model weight-bits activation-bits Accuracy
balanced unbalanced
GRU 1 2 0.8684 diverge
GRU 2 2 0.8708 0.86056
GRU 4 4 0.88132 0.88248
GRU 32 32 - 0.90537
LSTM 1 2 0.87888 diverge
LSTM 2 2 0.8812 0.83971
LSTM 4 4 0.88476 0.86788
LSTM 32 32 - 0.89541
Table 2: Quantized RNNs on IMDB sentence classification

As IMDB is a fairly simple dataset, we observe little performance degradation even when quantizing to 1-bit weights and 2-bit activations.

4.3 Effects of Balanced Distribution

All the above experiments show balanced quantization leads to better results compared to unbalanced counterparts, especially when quantizing to 2-bit weights. However, for 4-bit weights, there is no clear gap between scaling by mean and scaling by max (i.e. balanced and unbalanced quantization), indicating that more effective methods for quantizing to 4-bit need to be discovered.

5 Conclusion and Future Work

We have proposed methods for effective quantization of RNNs. By using carefully designed structure and a balanced quantization methods, we have matched or surpassed previous state-of-the-arts in prediction accuracy, especially when quantizing to 2-bit weights.

The balanced quantization method for weights we propose can induce balanced distribution of quantized weight value to maximum the utilization of parameter space. The method may also be applied to quantization of CNNs.

As future work, first, the method to induce balanced weight quantization when bit-width is more than 2 remains to be found. Second, we have observed some difficulties for quantizing the cell paths in LSTM, which produces unbounded values. One possible way to address this problem is introducing novel scaling schemes to quantize the activations that can deal with unbounded values. Finally, as we have observed GRU and LSTM have different properties in quantization, it remains to be shown whether there exists more efficient recurrent structures designed specifically to facilitate quantization.