QuantizedRNN
Recurrent Neural Networks With Limited Numerical Precision
view repo
Recurrent Neural Networks (RNNs) produce stateofart performance on many machine learning tasks but their demand on resources in terms of memory and computational power are often high. Therefore, there is a great interest in optimizing the computations performed with these models especially when considering development of specialized lowpower hardware for deep networks. One way of reducing the computational needs is to limit the numerical precision of the network weights and biases. This has led to different proposed rounding methods which have been applied so far to only Convolutional Neural Networks and FullyConnected Networks. This paper addresses the question of how to best reduce weight precision during training in the case of RNNs. We present results from the use of different stochastic and deterministic reduced precision training methods applied to three major RNN types which are then tested on several datasets. The results show that the weight binarization methods do not work with the RNNs. However, the stochastic and deterministic ternarization, and pow2ternarization methods gave rise to lowprecision RNNs that produce similar and even higher accuracy on certain datasets therefore providing a path towards training more efficient implementations of RNNs in specialized hardware.
READ FULL TEXT VIEW PDF
Similar to convolution neural networks, recurrent neural networks (RNNs)...
read it
Recurrent neural networks (RNNs) are omnipresent in sequence modeling ta...
read it
While Recurrent Neural Networks (RNNs) are famously known to be Turing
c...
read it
In our previous work we have shown that resistive cross point devices, s...
read it
We investigate the functioning of a classifying biological neural networ...
read it
We explore the robustness of recurrent neural networks when the computat...
read it
Recurrent Neural Networks (RNNs) are a class of machine learning algorit...
read it
Recurrent Neural Networks With Limited Numerical Precision
experimental code for tenarized networks
A Recurrent Neural Network (RNN) is a specific type of neural network which is able to process input and output sequences of variable length. Because of this nature, RNNs are suitable for sequence modeling. Various RNN architectures have been proposed in recent years, based on different forms of nonlinearity, such as the Gated Recurrent Unit (GRU)
(Cho et al., 2014)and LongShort Term Memory (LSTM)
(Hochreiter et al., 1997). They have enabled new levels of performance in many tasks such as speech recognition (Amodei et al., 2015)(Chan et al., 2015), machine translation (Devlin et al., 2014)(Chung et al., 2016)(Sutskever et al., 2014), or even video games (Mnih et al., 2015) and Go(Silver et al., 2016).Compared to standard feedforward networks, RNNs often take longer to train and are more demanding in memory and computational power. For example, it can take weeks to train models for stateoftheart machine translation and speech recognition. Thus it is of vital importance to accelerate computation and reduce training time of such networks. On the other hand, even at runtime, these models require too much in terms of computational resources if we want to deploy a model onto lowpower embedded hardware devices. Increasingly, dedicated deep learning hardware platforms including FPGAs
(Farabet et al., 2011) and custom chips (Sim et al., 2016) are reporting higher computational efficiencies of up to tera operations per second per watt (TOPS/W). These platforms are targeted at deep CNNs. If lowprecision RNNs are able to report the same performance, then the savings in the reduction of multipliers (the circuits that take the space and energy) and memory storage of the weights would be even larger as the bit precision of the multipliers needed for the 2 to 3 gates of the gated RNN units can be reduced or the multipliers removed completely.Previous work showed the successful application of stochastic rounding strategies on feed forward networks, including binarization (Courbariaux et al., 2015) and ternarization (Lin et al., 2015) of weights of vanilla Deep Neural Networks (DNNs) and Convolutional Neural Networks (CNNs) (Rastegari et al., 2016), and in (Courbariaux and Bengio, 2016) even the quantization of their activations, during training and runtime. Quantization of RNN weights has so far only been used with pretrained models Shin et al. (2016).
What remained an open question up to now was whether these weight quantization techniques could successfully be applied to RNNs during training.
In this paper, we use different methods to reduce the numerical precision of weights in RNNs, and test their performance on different benchmark datasets. We make the code for the rounding methods available. ^{1}^{1}1https://github.com/ottj/QuantizedRNN We use three popular RNN models: vanilla RNNs, GRUs, and LSTMs. Section 2 covers the 4 ways of obtaining lowprecision weights for the RNN models in this work, and Section 3 elaborates on the test results of the lowprecision RNN models on different datasets including the large WSJ dataset. We find that ternary quantization works very well while binary quantization fails and we analyze this result.
This work evaluates the use of 4 different rounding methods on the weights of various types of RNNs. These methods include the stochastic and deterministic binarization method (BinaryConnect) (Courbariaux and Bengio, 2016) and ternarization method (TernaryConnect) (Lin et al., 2015), the pow2ternarization method (Stromatias et al., 2015), and a new weight quantization method (Section 2.2). For all 4 methods, we keep a fullprecision copy of the weights and biases during training to accumulate the small updates, while during test time, we can either use the learned fullprecision weights or use their deterministic lowprecision version. As experimental results in Section 4 show, the network with learned fullprecision weights usually yields better results than a baseline network with full precision during training, due to the extra regularization effect brought by stochastic binarization. The deterministic lowprecision version could still yield comparable performance while drastically reducing computation and required memory storage at test time. We will briefly describe the former 3 lowprecision methods, and introduce a new fourth method called Exponential Quantization.
BinaryConnect and TernaryConnect were first introduced in (Courbariaux et al., 2015) and (Lin et al., 2015) respectively. By limiting the weights to only 2 or 3 possible values, i.e., 1 or 1 for BinaryConnect and 1, 0 or 1 for TernaryConnect, these methods do not require the use of multiplications. In the stochastic versions of both methods, the lowprecision weights are obtained by stochastic sampling, while in the deterministic versions, the weights are obtained by thresholding.
Let
be a matrix or vector to be binarized. The stochastic BinaryConnect update works as follows:
(1) 
where is the hard sigmoid function:
(2) 
while in the deterministic BinaryConnect method, lowprecision weights are obtained by thresholding the weight value by .
(3) 
TernaryConnect allows weights to be additionally set to zero. Formally, the stochastic form can be expressed as
(4) 
where is an elementwise multiplication. In the deterministic form, the weights are quantized depending on 2 thresholds:
(5) 
Pow2ternarization is another fixedpoint oriented rounding method introduced in (Stromatias et al., 2015). The precision of fixedpoint numbers is described by the Qm.fnotation, where m denotes the number of integer bits including the sign bit, and f the number of fractional bits. For example, Q1.1 allows as values. The rounding procedure works as follows: We first round the values to be in range allowed by the number of integer bits:
(6) 
We subsequently round the fractional part of the values:
(7) 
Quantizing the weight values to an integer power of 2 is also a way of storing weights in low precision and eliminating multiplications. Since quantization does not require a hard clipping of weight values, it scales well with weight values.
Similar to the methods introduced above, we also have a deterministic and stochastic way of quantization. For the stochastic quantization, we sample the logarithm of weight value to be its nearest 2 integers, and the probability of getting one of them is proportional to the distance of the weight value from that integer. For weights with negative weight values, we take the logarithm of its absolute vale, but add their sign back after quantization. i.e.:
(8) 
For the deterministic version, we just set if the in Eq. 8 is larger than 0.5.
Note that we just need to store the logarithm of quantized weight values. The actual instruction needed for multiplying a quantized number differs according to the numerical format. For fixed point representation, multiplying by a quantized value is equivalent to binary shifts, while for floating point representation, that is equivalent to adding the quantized number’s exponent to the exponent of the floating point number. In either case, no complex operation like multiplication would be needed.
As the most basic RNN structure, the vanilla RNN just adds a simple extension to feed forward networks. Its hidden states updated from both the current input and the state at the previous time step:
(9) 
where
denotes the nonlinear activation function. The hidden state can be followed by more layers to yield an output at each time step. For example, in characterlevel language modeling, the output at each timestep is set to be the probability of each character appearing at the next time step. Thus there is a softmax layer that transforms the hidden state representation into predictive probabilities.
(10) 
In the lowprecision version of the RNN, we just apply a quantization function to each of the weights in the aforementioned RNN structure. Thus all multiplications in the forward pass (except for the softmax normalization) will be eliminated:
(11)  
(12) 
where is applied elementwise to all weights in a given weight matrix. We should note that, because of the quantization process, the derivative of the cost with respect to weights is no longer smooth in the lowprecision RNN (it is 0 almost everywhere). We instead compute the derivative with respect to the quantized weights, and use that derivative for weight update. In other words, the gradients are computed as if the quantization operation were not there. This makes sense because we can think of the quantization operation as adding noise:
(13) 
We have observed among all the three RNN architectures that BinaryConnect on the recurrent weights never worked.We should note that, the function in the recurrent direction has to allow to be sampled. We conjecture that this is related to the stabilization of hidden states.
Consider the effect that BinaryConnect and TernaryConnect have on the Jacobians of the statetostate transition. In BinaryConnect, all entries in matrix are sampled to be or . In LSTMs and GRUs, there is a strong near1 diagonal in the Jacobian because the gates are more often to be turned on, i.e., letting information flow through it, while the offdiagonal entries of the Jacobian tend to be much smaller when the weights have not been quantized. However, when the true value of a weight is near zero, its quantized value is stochastically sampled to be or
with nearly equal probability. When near0 offdiagonal entries of a matrix of real values between 1 and 1 are randomly replaced by values near +1 or 1, the magnitude of the weights increases and the condition number of the matrix will tend to worsen due to the presence of more near0 eigenvalues. This could mean that gradients tend to vanish faster, because a gradient vector
could happen more often to have strong components in the directions of some of these small eigenvectors. With larger eigenvalues of the Jacobian (observed, Fig.
1(a)), i.e. larger derivatives, we could also see gradients explode.In Fig. 0(a)
, where we use unbounded units (ReLU) as activation, if we look at the Jacobian of two neighboring hidden states (
), we can see that the maximum eigenvalue of it is around 2.5 across all time steps, much larger than 1. As a consequence, hidden states explode with respect to time steps, while this is not the case for TernaryConnect and ExpQuantize. (Fig. 0(b))On the other hand, if we allow (or a sufficiently small value) to be chosen in the sampling process, the effect of stochastic sampling on the Jacobians will not be that devastating. The Jacobian remains a quasidiagonal matrix, which is wellconditioned.
In (Krueger and Memisevic, 2015) it was shown that in a trained model, hidden state norms change in the first several timesteps, but become stable afterwards. The model can work even better if during training we punish the changes of the norm of the hidden state from one time step to the next.
LSTMs (Hochreiter et al., 1997)
were first introduced in RNNs for sequence modeling. Its gate mechanism makes it a good option to deal with the vanishing gradient problem, since it can model longterm dependencies in the data. To limit the numerical precision, we apply a rounding method
to all or a subset of weights.GRUs (Cho et al., 2014) can also be used in RNNs for modeling temporal sequences.
They involve less computation than the LSTM units, since they do not have an output gate, and are therefore sometimes preferred in large models. At timestep , the state of a single GRU unit is computed as follows:
(14) 
where denotes a elementwise multiplication. The update gate is computed with
(15) 
where is the input at timestep , is the statetostate recurrent weight matrix, is the state at , is the inputtohidden weight matrix, and is the bias.
The reset gate is computed as follows:
(16) 
where
(17) 
In our experiments, the weights are rounded in the same way as the LSTMs. For example, for the gate, the input weight is rounded as follows: .
In the following experiments, we test the effectiveness of the different rounding methods on two different types of applications: characterlevel language modeling and speech recognition. The different RNN types (Vanilla RNN, GRU, and LSTM) are evaluated on experiments using four different datasets.
We validate the lowprecision vanilla RNN on 2 datasets: text8 and Penn Treebank Corpus (PTB).
The text8 dataset contains the first 100M characters from Wikipedia, excluding all punctuations. It does not discriminate between cases, so its alphabet has only 27 different characters: the 26 English characters and space. We take the first 90M characters as training set, and split them equally into sequences with 50 character length each. The last 10M characters are split equally to form validation and test sets.
The Penn Treebank Corpus (Taylor et al., 2003) contains 50 different characters, including English characters, numbers, and punctuations. We follow the settings in (Mikolov et al., 2012) to split our dataset, i.e., 5017k characters for training set, 393k and 442k characters for validation and test set respectively.
The models are built to predict the next character given the previous ones, and performances are evaluated with the bitspercharacter (BPC) metric, which is of the perplexity, or the percharacter loglikelihood (base 2). We use a RNN with ReLU activation and 2048 hidden units. We initialize hiddentohidden weights as identity matrices, while inputtohidden and hiddentooutput matrices are initialized with uniform noise.
We can see the regularization effect of stochastic quantization from the results of the two datasets. In the PTB dataset, where the model size slightly overfits the dataset, the lowprecision model trained with stochastic quantization yields a test set performance of 1.372 BPC, which surpasses its full precision baseline (1.505 BPC) by around 0.133 BPC (Fig. 2, left). From the figure we can see that stochastic quantization does not significantly hurt training speed, and manages to get better generalization when the baseline model begins to overfit. On the other hand, we can also see from the results on the text8 dataset where the same sized model now underfits, the lowprecision model performs worse (1.639 BPC) than its baseline (1.588 BPC). (Fig. 1(b) and Table 1).
This section presents results from the various methods to limit numerical precision in the weights and biases of GRU RNNs which are then tested on the TIDIGITS dataset.
TIDIGITS(Leonard and Doddington, 1993)
is a speech dataset consisting of clean speech of spoken numbers from 326 speakers. We only use single digit samples (zero to nine) in our experiments giving us 2464 training samples and 2486 validation samples. The labels for the spoken ‘zero’ and ‘O’ are combined into one label, hence we have 10 possible labels. We create MFCCs from the raw waveform and do leading zero padding to get samples of matrix size 39x200. The MFCC data is further whitened before use. We only use masking for processing the data with the RNN in some of the experiments.
The model has a 200 unit GRU layer followed by a 200 unit fullyconnected ReLU layer. The output is a 10 unit softmax layer. Weights are initialized using the Glorot & Bengio method (Glorot and Bengio, 2010). The network is trained using Adam (Kingma and Ba, 2014) and BatchNorm (Ioffe and Szegedy, 2015)
is not used. We train our model for up to 400 epochs with a patience setting of 100 (no early stopping before epoch 100). GRU results in Table
1 are from 10 experiments, each experiment starts with a different random seed. This is done because previous experiments have shown that different random seeds can lead up to a few percent difference in final accuracy. We show average and maximum achieved performance on the validation set.To evaluate weight binarization on a GRU, we trained our model with possible binary values {1,1}, {0.5, 0.5}, {0, 1}, {0.5, 0} for the weights. Binarization was done only on the weights matrices , , . We ran each experiment once with stochastic binarization and once with deterministic binarization. As shown in Table1, none of the combinations resulted in an increase in accuracy over chance after 400 training epochs. Also, doubling the number of GRU units to 400 did not help. We therefore concluded that GRUs do not function properly if all the weights are binarized. It has yet to be tested if at least a subset of the aforementioned weight matrices, or some of the hiddentohidden weight matrices could be binarized.
To assess the impact of weight ternarization, we trained our model and quantized the weights during training using pow2ternarization with Q1.1.
Figure 3 (a) shows how pow2ternarization rounding applied on the different sets of GRU weights has an effect on convergence compared to the fullprecision baseline. If full precision weights and biases are used, convergence starts after a few training epochs. As shown in Table 1
, if pow2ternarization is used on inputtoGRU weights, the top1 improves to 99.3%. Training takes 10 epochs longer before convergence starts, but then surpasses the baseline in terms of convergence speed, also the variance between the different runs is smaller compared to baseline runs. Limiting the precision of both inputtoGRU weights and biases leads to a similar training curve, but top1score increases to 99.42%. If pow2ternarization is applied on all GRU weights (now also on
and biases, the top1 decreases (though still higher than baseline) to to 99.1%.Dataset  RNN Type  Baseline  SB  DB  DT  ST  PT  EQ  
text8  VRNN  1.588 BPC  N/A  N/A  1.639 BPC  
PTC  VRNN  1.505 BPC  N/A  N/A  1.372 BPC  
TIDIGITS 


18.7  18.7 





















WSJ  LSTM 


Figure 3 (b) shows that we see the same effects as with pow2ternarization, except for the case where we ternarize all weights and biases. With Pow2 Ternarization we allow 0.5, 0, and 0.5 as values. With the default ternarization we allow 1, 0,1. This difference has a big impact on the hiddentohidden weight function, because if we apply ternarization there, we end up with lowerthanbaseline performance and much slower convergence. On the other hand, if we apply ternarization only on the inputtoGRU weights, we get 99.67%, the highest of all top1 score of our TIDIGITS experiments. This leads us to conclude that different GRU components need different sets of allowed values to function in an optimal fashion. Indeed, if we change ternarization of all weights and biases to 0.5, 0, and 0.5 as allowed values, we see basically the same result as with pow2ternarization. Stochastic ternarization has not shown to be useful here. Convergence starts after 100 training epochs, and the average maximum and top1 accuracy of 97.72% and 98.23% are almost at baseline level.
Previous work had shown that some forms of network binarization work on small datasets but do not scale well to big datasets (Rastegari et al., 2016). To determine if lowprecision networks still work on big datasets, we chose to train a large model on the WSJ dataset.
The model is trained on the Wall Street Journal (WSJ) corpus (available at the LDC as LDC93S6B and LDC94S13B) where we use the 81 hour training set "si284". The development set "dev93" is used for early stopping and the evaluation is performed on the test set "eval92". We use 40 dimensional filter bank features extended with deltas and deltadeltas, leading to 120 dimensional features per frame. Each dimension is normalized to have zero mean and unit variance over the training set. Following the text preprocessing in (Miao et al., 2015), we use 59 character labels for characterbased acoustic modeling. Decoding with the language model is performed on a recent proposed approach (Miao et al., 2015) based on both Connectionist Temporal Classification (CTC) (Graves et al., 2006) and weighted finitestate transducers (WFSTs) (Mohri et al., 2008).
Both the limited precision model and the baseline model have 4 bidirectional LSTM layers with 250 units in both directions of each layer. In order to get the unsegmented character labels directly, we use CTC on top of the model. The baseline and the model are trained using Adam (Kingma and Ba, 2014) with a fixed learning rate . The weights are initialized following the scheme (Glorot and Bengio, 2010). Notice that we do not regularize the model like injecting weight noise for simplicity, thus the baseline results shown here could be worse than the recent published numbers on the same task (Graves and Jaitly, 2014; Miao et al., 2015).
The baseline achieves a word error rate (WER) of 11.16% on the test set after training for 60 epochs, which took 8 days. The pow2ternarization method has a considerably slower convergence similar to the GRU experiments. The model was trained for 3 weeks up to epoch 87, where it reaches an WER of 10.49%.
This paper shows for the first time how lowprecision quantization of weights can be performed already during training effectively for RNNs. We presented 3 existing methods and introduced 1 new method of limiting the numerical precision. We used the different methods on 3 major RNN types and determined how the limited numerical precision affects network performance across 4 datasets.
In the language modeling task, the lowprecision model surpasses its fullprecision baseline by a large gap (0.133 BPC) on the PTB dataset. We also show that the model will work better if put in a slightly overfitting setting, so that the regularization effect of stochastic quantization will begin to function. In the speech recognition task, we show that it is not possible to binarize weights of GRUs while maintaining their functionality. We conjecture that the better performance from ternarization is due to a reduced variance of the weighted sums (when a nearzero real value is quantized to +1 or 1, this introduces substantial variance), which could be more harmful in RNNs because the same weight matrices are used over and over again along the temporal sequence. Furthermore, we show that weight and bias quantization methods using ternarization, pow2ternarization, and exponential quantization, can improve performance over the baseline on the TIDIGITs dataset. The successful outcome of these experiments means that lower resource requirements are needed for custom implementations of RNN models.
We are grateful to INI members Danny Neil, Stefan Braun, and Enea Ceolini, and MILA members Philemon Brakel, Mohammad Pezeshki, and Matthieu Courbariaux, for useful discussions and help with data preparation. We thank the developers of Theano
(Theano Development Team, 2016), Lasagne, Keras, Blocks
(van Merrienboer et al., 2015), and Kaldi(Povey et al., 2011).Humanlevel control through deep reinforcement learning.
Nature, 518(7540):529–533, 2015. ISSN 00280836. doi: 10.1038/nature14236. URL http://dx.doi.org/10.1038/nature14236.Robustness of spiking Deep Belief Networks to noise and reduced bit precision of neuroinspired hardware platforms .
Frontiers in Neuroscience, 9(July):1–14, 2015. ISSN 1662453X. doi: 10.3389/fnins.2015.00222. URL http://www.frontiersin.org/Journal/Abstract.aspx?s=755{&}name=neuromorphic{_}engineering{&}ART{_}DOI=10.3389/fnins.2015.00222.
Comments
There are no comments yet.