bitrnn
Quantize weights and activations in Recurrent Neural Networks.
view repo
Reducing bitwidths of weights, activations, and gradients of a Neural Network can shrink its storage size and memory usage, and also allow for faster training and inference by exploiting bitwise operations. However, previous attempts for quantization of RNNs show considerable performance degradation when using low bitwidth weights and activations. In this paper, we propose methods to quantize the structure of gates and interlinks in LSTM and GRU cells. In addition, we propose balanced quantization methods for weights to further reduce performance degradation. Experiments on PTB and IMDB datasets confirm effectiveness of our methods as performances of our models match or surpass the previous stateoftheart of quantized RNN.
READ FULL TEXT VIEW PDFQuantize weights and activations in Recurrent Neural Networks.
Deep Neural Networks have become important tools for modeling nonlinear functions in applications like computer vision
(Krizhevsky et al., 2012), speech recognition (Hinton et al., 2012a)(Bahdanau et al., 2014), and computer games (Silver et al., 2016).However, inference and training of a DNN may involve up to billions of operations for inputs likes images (Krizhevsky et al., 2012; Szegedy et al., 2014). A DNN may also have large number of parameters, leading to large storage size and runtime memory usage. Such intensive resource requirements impede adoption of DNN in applications requiring realtime responses, especially on resourcelimited platforms. To alleviate these requirements, many methods have been proposed, from both hardware and software perspective (Farabet et al., 2011; Pham et al., 2012; Chen et al., 2014a, b). For example, constraints may be imposed on the weights of DNN, like sparsity (Han et al., 2015b, a), circulant matrix (Cheng et al., 2015), low rank (Jaderberg et al., 2014; Zhang et al., 2015)
, vector quantization
(Gong et al., 2014), and hash trick (Chen et al., 2015) etc., to reduce the number of free parameters and computation complexity. However, these methods use high bitwidth numbers for computations, which require availability of high precision multiplyandadd instructions.Another line of research tries to reduce bitwidth of weights and activations of a DNN by quantization to low bitwidth numbers (Rastegari et al., 2016; Hubara et al., 2016b; Zhou et al., 2016; Hubara et al., 2016a). Reducing bitwidth of weights of a bit model to can shrink the storage size of model to of the original size. Similarly, reducing bitwidths of activations to can shrink the runtime memory usage by the same proportion. In addition, when the underlying platform supports efficient bitwise operations and that counts the number of bits in a bit vector, we can compute the inner product between bit vectors by the following formula:
(1) 
Consequently, convolutions between low bitwidth numbers can be considerable accelerated on platforms supporting efficient execution of bitwise operations, including CPU, GPU, FPGA and ASIC. Previous works shows that using only 1bit weights and 2bit activation can achieve 51% top1 accuracy on ImageNet datasets
(Hubara et al., 2016a).However, in contrast to the extensive study in compression and quantization of convolutional neural networks, little attention has been paid to reducing the computational resource requirements of RNN.
(Ott et al., 2016)claims that the weight binarization method does not work with RNNs, and introduces weight ternarization and leaves activations as floating point numbers.
(Hubara et al., 2016a) experiments with different combinations of bitwidths for weights and activations, and shows 4bit quantized CNN and RNN can achieve comparable accuracy as their 32bit counterpart. However, large performance degradation occurs when quantizing weights and activations to 2bit numbers. Though (Hubara et al., 2016a) has their quantized CNN opensourced, neither of the two works opensource their quantized RNNs.This paper makes the following contributions:
We outline detailed design for quantizing two popular types of RNN cells: LSTM and GRU. We evaluate our model on different sets of bitwidth configurations and two NLP tasks: Penn Treebank and IMDB. We demonstrate that by out design, quantization with 4bit weights and activations can achieve almost the same performance to 32bit. In addition, we have significantly better results when quantizing to lower bitwidths.
We propose methods to quantize weights deterministically and adaptively to balanced distributions, especially when weights are 2bits numbers. The balanced distribution of quantized weights leads to better utilization of the parameter space and consequently increases the prediction accuracy. We explicitly induce the balanced distribution by introducing parameter dependent thresholds into the quantization process during training.
We release code for training our quantized RNNs online ^{1}^{1}1https://github.com/hqythu/bitrnn
. The code is implemented in TensorFlow
(Abadi et al., ) framework.In this section we outline several quantization methods. W.l.o.g., we assume the input to the quantization is a matrix unless otherwise specified. When all entries of are in close interval , we define the bit uniform quantization as follows: .
(2) 
However, derivatives of this quantization function equals zero almost everywhere. We adopt the “straightthrough estimator” (STE) method
(Hinton et al., 2012b; Bengio et al., 2013) to circumvent this problem.For forward and backward pass of training neural network, using above quantization method together with STE leads to the following update rule during forward and backward propagation of neural networks:
Forward:  
Backward: 
When entries in are not constrained in closed interval , an affine transform need to be applied before using function . A straightforward transformation can be done using minimum and maximum of to get , the standardized version of :
After quantization, we can apply a reverse affine transform to approximate the original values. Overall, the quantized result is:
When we quantize values, it may be desirable to make the quantized values have balanced distributions, so as to take full advantage of the available parameter space. Ordinarily, this is not possible as the distribution of the input values has already been fixed. In particular, using do not exert any impacts on the distribution of quantized values.
Next we show that we can induce more uniform distributions of quantized values, by introducing parameter dependent adaptive thresholds
. We first introduce a different standardization transform that produces , and define a balanced quantization method as follows:(3)  
The only difference between and lies in difference of standardization. In fact, when the extremal values of are symmetric around zero, i.e.
we may rewrite equivalently as follows to make the similarity between and more obvious:
Hence the only difference between and lies in difference between properties of and . We find that as is an order statistics, using it as threshold will produce an autobalancing effect.
We consider the case when bitwidth is as an example. In this case, under the symmetric distribution assumption, we can prove the autobalancing effect of .
If , and suppose are symmetrically distributed around zero and there are no two entries in that are equal, then the four bars in the histogram of will all have exactly the same height.
By Formula 3, entries of will be equal to 1 if corresponding entries in are above , equal to if between and , equal to if between and , and equal to if below . When and are symmetrically distributed around zero, the values in will be thresholded by , , and into four bins. By the property of median, and the symmetric distribution assumption, the four bins will contain the same number of quantized values. ∎
In practice, computing
may not be computationally convenient as it requires sorting. We note that when a distribution has bounded variance
, the mean approximates the median as there is an inequality bounding the difference(Mallows, 1991):Hence we may use instead of in the quantization. Though with error introduced, empirically we can still observed nearlybalanced distribution.
If we further assume the weights follow zeromean normal distribution
, then follows halfnormal distribution. By simple calculations we have:and
Putting all these things together we have the balanced deterministic quantization method:
(4)  
where a natural choice of would be 3 or 2.5 (rounding 2.5359 to a short binary number) under different assumptions. In our following experiments, we adopt 2.5 as the scaling factor.
Although the above argument for balanced quantization applies only to 2bit quantization, we argue more bitwidth also benefit from avoiding extreme value from extending the value range thus increase rounding error. It should be noted that for 1bit quantization (binarization), the scaling factor should be , which can be proved to be optimal in the sense of reconstruction error measured by Frobenius norm, as in (Rastegari et al., 2016). However, the proof relies on the constant norm property of 1bit representations, and does not generalize to the cases of other bitwidths.
Weights in neural networks are sometimes known to have a bellstyle distribution around zero, similar to normal distribution. Hence we can assume to have symmetric distribution around 0, and apply the above equation for balanced quantization as
To include the quantization into the computation graph of a neural network, we apply STE on entire expression rather than only itself.
Forward:  
Backward: 
The specialty about the balanced quantization method is that in general, it distort the extremal values due to the clipping in Formula 4, which in general contribute more to the computed sums of inner products. However, in case where the values to be quantized are weights of neural networks and if we introduce the balanced quantization into the training process, we conjecture that the neural networks may gradually adapt to the distortions, so that distributions of weights may be induced to be more balanced. The more balanced distribution will increase the effective bitwidth of neural networks, leading to better prediction accuracy. We will empirically validate this conjecture through experiments in Section 4.
Quantization of activation follows the method in Zhou et al. (2016)
, assuming output of the previous layer has passed through a bounded activation function
, and we will apply quantization directly to them. In fact, we find that adding a scaling term containing mean or max value to the activations may harm prediction accuracy.There is a design choice on what range of quantized value should be. One choice is symmetric distribution around 0. Under this choice, inputs are bounded by activation function to , and then shifted to the right by 0.5 before feeding into and then shift back.
Another choice is having value range of
, which is closer to the value range of ReLU activation function. Under this choice, we can directly apply
. For commonly used activation with domain in RNNs, it seems natural to use symmetry quantization. However, we will point out some considerations for using quantization to range in Section 3.In this section, we detail our design considerations for quantization of recurrent neural networks. Different from plain feed forward neural network, recurrent neural networks, especially Long Short Term Memory
(Hochreiter & Schmidhuber, 1997)(Chung et al., 2014), have subtle and delicately designed structure, which makes their quantization more complex and need more careful considerations. Nevertheless, the major algorithm is the same as Algorithm 1 in Zhou et al. (2016).It is well known that as fullyconnected layers have large number of parameters, they are prone to overfit (Srivastava et al., 2014). There are several FClike structures in a RNN, for example the input, output and transition matrices in RNN cells (like GRU and LSTM) and the final FC layer for softmax classification. The dropout technique, which randomly dropping a portion of features to 0 at training time, turns out be also an effective way of alleviating overfitting in RNN (Zaremba et al., 2014).
As dropped activations are zero, it is necessary to have zero values in the range of quantized values. For symmetric quantization to range , does not exist in range of . Hence we use as the range of quantized values when dropout is needed.
In tasks related to Natural Language Processing, the input words which are represented by ID’s, are embedded into a lowdimensional space before feeding into RNNs. The word embedding matrix is in , where is the size of vocabulary and is length of embedded vectors.
Quantization of weights in embedding layers turns out to be different from quantization of weights in FC layers. In fact, the weights of embedding layers actually behave like activations: a certain row is selected and fed to the next layer, so the quantization method should be the same as that of activations rather than that of weights. Similarly, as dropout may also be applied on the outputs of embedding layers, it is necessary to bound the values in embedding matrices to .
To clip the value range of weights of embedding layers, a natural choice would be using sigmoid function
such thatwill be used as parameters of embedding layers, but we observe severe vanishing gradient problem for gradients
in training process. Hence instead, we directly apply a function , and random initialize the embedding matrices with values drawn from uniform distribution . These two measures are found to improve performance of the model.We first investigate quantization of GRU as it is structurally simpler. The basic structure of GRU cell may be described as follows:
where stands for the sigmoid function.
Recall that to benefit from the speed advantage of bit convolution kernels, we need to make the two matrix inputs for multiply in low bit form, so that the dot product can be calculated by bitwise operation. For plain feed forward neural networks, as the convolutions take up most of computation time, we can get decent acceleration by quantization of inputs of convolutions and their weights. But when it comes to more complex structures like GRU, we need to check the bitwidth of each interlink.
Except for matrix multiplications needed to compute , and , the gate structure of and brings in the need for elementwise multiplication. As the output of the sigmoid function may have large bitwidth, the elementwise multiplication may need be done in floating point numbers (or in higher fixedpoint format). As and are also the inputs to computations at the next timestamp, and noting that a quantized value multiplied by a quantized value will have a larger bitwidth, we need to insert additional quantization steps after elementwise multiplications.
Another problem with quantization of GRU structure lies in the different value range of gates. The range of is , which is different from the value range of and . If we want to preserve the original activation functions, we will have the following quantization scheme:
where we assume the weights have already been quantized to , and input have already been quantized to .
However, we note that the quantization function already has an affine transform to shift the value range. To simplify the implementation, we replace the activation functions of to be the sigmoid function, so that .
Summarizing the above considerations, the quantized version of GRU could be written as
where we assume the weights have already been quantized to , and input have already been quantized to .
The structure of LSTM can be described as follows:
Different from GRU, can not be easily quantized, since the value is unbounded by not using activation function like and the sigmoid function. This difficulty comes from structure design and can not be alleviated without introducing extra facility to clip value ranges. But it can be noted that the computations involving are all elementwise multiplications and additions, which may take much less time than computing matrix products. For this reason, we leave to be in floating point form.
To simplify implementation, activation for output may be changed to the sigmoid function.
Summarizing above changes, the formula for quantized LSTM can be:
where we assume the weights have already been quantized to , and input have already been quantized to .
We evaluate the quantized RNN models on two tasks: language modeling and sentence classification.
For language modeling we use Penn Treebank dataset (Taylor et al., 2003), which contains 10K unique words. We download the data from Tomas Mikolov’s webpage^{2}^{2}2http://www.fit.vutbr.cz/ imikolov/rnnlm/simpleexamples.tgz. For fair comparison, in the following experiments, our model all use one hidden layer with 300 hidden units, which is the same setting as Hubara et al. (2016a). A word embedding layer is used at the input side of the network whose weights are trained from scratch. The performance is measured in perplexity per word (PPW) metric.
During experiments we find the magnitudes of values in dense matrices or full connected layers explode when using small bitwidth, and result in overfitting and divergence. This can be alleviated by adding to constrain the value ranges or adding weight decays for regularization.
Model  weightbits  activationbits  PPW  
balanced  unbalanced  
GRU  1  2  285  diverge  
GRU  1  32  178  diverge  
GRU  2  2  150  165  
GRU  2  3  128  141  
GRU  3  3  109  110  
GRU  4  4  104  102  
GRU  32  32    100  
LSTM  1  2  257  diverge  
LSTM  1  32  198  diverge  
LSTM  2  2  152  164  
LSTM  2  3  142  155  
LSTM  3  3  120  122  
LSTM  4  4  114  114  
LSTM  32  32    109  

2  3  220  

4  4  100 
Our result is in agreement with (Hubara et al., 2016a) where they claim using 4bit weights and activations can achieve almost the same performance as 32bit. However, we report higher accuracy when using less bits, such as 2bit weight and activations. The 2bit weights and 3bit activations LSTM achieve 146 PPW, which outperforms the counterpart in (Hubara et al., 2016a) by a large margin.
We also perform experiments in which weights are binarized. The models can converge, though with large performance degradations.
We do further experiments on sentence classification using IMDB datasets (Maas et al., 2011)
. We pad or cut each sentence to 500 words, word embedding vectors of length 512, and a single recurrent layer with 512 number of hidden neurons. All models are trained using ADAM
(Kingma & Ba, 2014) learning rule with learning rate .Model  weightbits  activationbits  Accuracy  

balanced  unbalanced  
GRU  1  2  0.8684  diverge 
GRU  2  2  0.8708  0.86056 
GRU  4  4  0.88132  0.88248 
GRU  32  32    0.90537 
LSTM  1  2  0.87888  diverge 
LSTM  2  2  0.8812  0.83971 
LSTM  4  4  0.88476  0.86788 
LSTM  32  32    0.89541 
As IMDB is a fairly simple dataset, we observe little performance degradation even when quantizing to 1bit weights and 2bit activations.
All the above experiments show balanced quantization leads to better results compared to unbalanced counterparts, especially when quantizing to 2bit weights. However, for 4bit weights, there is no clear gap between scaling by mean and scaling by max (i.e. balanced and unbalanced quantization), indicating that more effective methods for quantizing to 4bit need to be discovered.
We have proposed methods for effective quantization of RNNs. By using carefully designed structure and a balanced quantization methods, we have matched or surpassed previous stateofthearts in prediction accuracy, especially when quantizing to 2bit weights.
The balanced quantization method for weights we propose can induce balanced distribution of quantized weight value to maximum the utilization of parameter space. The method may also be applied to quantization of CNNs.
As future work, first, the method to induce balanced weight quantization when bitwidth is more than 2 remains to be found. Second, we have observed some difficulties for quantizing the cell paths in LSTM, which produces unbounded values. One possible way to address this problem is introducing novel scaling schemes to quantize the activations that can deal with unbounded values. Finally, as we have observed GRU and LSTM have different properties in quantization, it remains to be shown whether there exists more efficient recurrent structures designed specifically to facilitate quantization.
Tensorflow: Largescale machine learning on heterogeneous systems, 2015.
Software available from tensorflow. org.Learning word vectors for sentiment analysis.
In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P111015.