1 Introduction
Neural Network is becoming the stateofthe art for machine translation problems Bojar et al. (2018). While neural machine translation (NMT) Bahdanau et al. (2014)
yields better performance compared to its statistical counterpart, it is also more resource demanding. NMT embeds the tokens as vectors. Therefore the embedding layer has to store the vector representation of all tokens in both source and target vocabulary. Moreover, current stateofthe art architecture, such as Transformer or deep RNN usually requires multiple layers
Vaswani et al. (2017); Barone et al. (2017).Model quantization has been widely studied as a way to compress the model size and speedup the inference process. However, most of these work were focused on convolution neural network for computer vision task
Miyashita et al. (2016); Lin et al. (2016); Hubara et al. (2016, 2017); Jacob et al. (2018). A research in model quantization for NMT tasks is limited.We first explore the use of logarithmicbased quantization over the fixed point quantization Miyashita et al. (2016), based on the empirical findings that parameter distribution is not uniform Lin et al. (2016); See et al. (2016). Parameter’s magnitude also varies across layers, therefore we propose a better way to scale the quantization centers. We also notice that biases do not consume noticeable amount of memory, therefore exploring the need to compress biases. Lastly, we explore the significance of retraining in model compression scenario. We adopt an errorfeedback mechanism Seide et al. (2014) to preserve the compressed model as a stale gradient, rather than discarding it every update during retraining.
2 Related Work
A considerable amount of research on model quantization has been done in the area of computer vision with convolutional neural networks; research on model quantization in the area of neural machine translation is much more limited. In this section, we will therefore also refer to work on neural models for image processing where appropriate.
Lin et al. (2016), Hubara et al. (2016), Hubara et al. (2017), Jacob et al. (2018), and JunczysDowmunt et al. (2018) all use linear quantization. Lin et al. (2016) and Hubara et al. (2016) use a fixed scale parameter prior to model training; JunczysDowmunt et al. (2018) and Jacob et al. (2018)
base it on the maximum tensor values for each matrix observed in the trained models.
Observing that their parameters are highly concentrated near 0 (see also Lin et al., 2016; See et al., 2016), Miyashita et al. (2016) opt for logarithmic quantization. They report an improvement in preserving model accuracy over linear quantization while achieving the same model compression rate.
Hubara et al. (2017) compress an LSTMbased architecture for language modeling to 4bit without any quality degradation (while increase the unit size by a factor of 3). See et al. (2016) pruned an NMT model by removing any weight values that are lower than a certain threshold. They achieve 80% model sparsity without any quality degradation.
The most relevant work with respect to our purposes is the submission of JunczysDowmunt et al. (2018) to the Shared Task on Efficient Neural Machine Translation in 2018. This submission applied an 8bit linear quantization for NMT models without any noticeable deterioration in translation quality. Similarly, Quinn and Ballesteros (2018) proposed to use an 8bit matrix multiplication to speedup an NMT system.
3 Low Precision Neural Machine Translation
3.1 LogBased Compression
Lin et al. (2016) and See et al. (2016)
report that parameters in deep learning models are normally distributed, and most of them are small values. Therefore, we adopt a logaritmic quantization, where each center is defined as
, similar to Miyashita et al. (2016). This allows for more centers for smaller values, giving us more precision of representation where parameter value density is the highest.When compressing the model to bit, a single bit will be used for the sign, hence we are left with for representing the values. is an integer defined as . We use a symmetric quantization; we apply the compression in absolute function, then put back the sign after compression. Therefore, our quantization centers (in absolute value) will be up to .
However, we find that models might not have the same magnitude as the quantization centers. To solve this issue, we also scale the the model values temporarily before quantizing, then rescale it back to the original magnitude. This approach is different to that of Miyashita et al. (2016), where quantization centers are notscaled, thus letting every layers to have the same centers.
Miyashita et al. (2016) quantize a value by rounding the logarithmic value to the closest integer. However, we found that this does not always quantize a value to the closest . For example, this approach will quantize to instead of , because is . Instead, we always round up the logarithmic value after we divide it by , as shown in Eq.1 below.
(1) 
Base of 2 was chosen because its simplicity in implementation; Computing the rounded log can be done by checking the leftmost ’1’ bit while computing power operation is just a bit shift. However, other bases can be chosen as well.
(2) 
JunczysDowmunt et al. (2018) and Jacob et al. (2018) scale the model based on its maximum value (Equation 2), which might be very unstable. Alternatively, Lin et al. (2016) and Hubara et al. (2016) use a predefined stepsize in their fixed point quantization. Our objective is to select a scaling factor such that the quantized parameter is as close to the original as possible. Therefore, we optimize such that it minimizes the squared error between the original and the compressed parameter.
We propose a method to fit
with ExpectationMaximization. We first start with an initial scale
based on parameters’ maximum value. For given , we apply our quantization routine described in Equation 1, resulting in a center assignment . For a given assignment , we fit a new scale such that:(3) 
(4) 
To optimize the given objective, we take the first derivative of Equation 4 such that:
(5) 
We optimize for each tensor independently.
3.2 Retraining
Unlike JunczysDowmunt et al. (2018), we retrain the model after initial quantization to allow it to recover some of the quality loss. In the retraining phase, we compute the gradients normally with full precision. We then requantize the model after every update to the parameters, including fitting scaling factors. The requantization error is preserved in a residual valriable and added to the next step’s parameter Seide et al. (2014). This error feedback mechanism was introduced in gradient compression techniques to reduce the impact of compression errors by preserving compression errors as stale gradient updates for the next batch Aji and Heafield (2017); Lin et al. (2017). We see reapplying quantization after parameter updates as a form of gradient compression, hence we explore the usage of an error feedback mechanism to potentially improve the final model’s quality.
3.3 Handling Biases
We do not quantize bias values in the model. We found that they do not follow the same distribution as other parameters, and attempting to logquantize them used only a fraction of the available quantization points. In any case, bias values do not take up a lot of memory relative to other parameters. In our Transformer architecture, they account for only 0.2% of the parameter values.
3.4 Low Precision Dot Products
Our end goal is to run a logquantized model without decompressing it. Activations coming into a matrix multiplication are quantized on the fly; intermediate activations are not quantized.
We use the same logbased quantization procedure described in Section 3.1. However, we only attempt a maxbased scale. Running the slower EM approach to optimize the scale before every dot product would not be fast enough for inference applications.
The derivatives of ceiling and sign functions are zero almost everywhere and undefined in some places. For retraining purposes, we apply a straight through estimator
Bengio et al. (2013) to the ceiling function. For the sign function, we treat the quantization function differently for each individual value in , based on their sign. Therefore, we now compute as:(6) 
Since we multiply by a constant, the derivative will be either multiplied by or . In latter case, the derivative of will be , which returns the derivative’s sign back to positive. Hence, the derivative of our quantization function is:
(7) 
4 Experiments
4.1 Experiment Setup
We use systems for the WMT 2017 English to German news translation task for our experiment; these differ from the WNGT shared task setting previously reported. We use backtranslated monolingual corpora Sennrich et al. (2016a) and bytepair encoding Sennrich et al. (2016b) to preprocess the corpus. Quality is measured with BLEU Papineni et al. (2002) score using sacreBLEU script Post (2018).
We first pretrain baseline models with both Transformer and RNN architecture. Our Transformer model consists of six encoder and six decoder layers with tied embedding. Our deep RNN model consists of eight layers of bidirectional LSTM. Models were trained synchronously with a dynamic batch size of 40 GB perbatch using the Marian toolkit JunczysDowmunt et al. (2018)
. The models are trained for 8 epochs. Models are optimized with Adam
Kingma and Ba (2014). The rest of the hyperparameter settings on both models are following the suggested configurations
Vaswani et al. (2017); Sennrich et al. (2017).4.2 4bit Transformer Model
In this first experiment we explore different ways to scale the quantization centers, the significance of quantizing biases, and the significance of retraining. We use pretrained Transformer model as our baseline, and apply our quantization algorithm on top of that.
Method  Scaling  

Fixed  Max  Optimized  Fixed  Max  Optimized  
Baseline  35.66  
Without retraining  With retraining  
+ Model Quantization  25.2  28.08  33.33  34.92  34.81  35.26 
+ No Bias Quantization  34.16  34.29  34.31  35.09  35.25  35.47 
Table 1 summarizes the results. Using a simple, albeit unstable maxbased scaling has shown to perform better compared to the fixed quantization scale. However, fitting the scaling factor to minimize the quantization squarederror produces the best quality. Interestingly, the BLEU score differences between quantization centers are diminished after retraining.
We can also see improvements by not quantizing biases, especially without retraining. Without any retraining involved, we reached the highest BLEU score of 35.47 by using optimized scale, on top of uncompressed biases. Without biases quantization, we obtained 7.9x compression ratio (instead of 8x) with a 4bit quantization. Based on this tradeoff, we argue that it is more beneficial to keep the biases in fullprecision.
Retraining has shown to improve the quality in general. After retraining, the quality differences between various scaling and biases quantization configurations are minimal. These results suggest that retraining helps the model to finetune under a new quantized parameter space.
To show the improvement of our method, we compare several compression approaches to our 4bit quantization method with retraining and without bias quantization. One of arguably naive way to reduce model size is use smaller unit size. For Transformer, we set the Transformer feedforward dimension to 512 (from 2048), and the embedding size to 128 (from 512). For RNN, we set the RNN dimension to 320 (from 1024) and the embedding size to 160 (from 512). This way, the model size for both architecture is relatively equal to the 4bit compressed models.
We also introduce the fixedpoint quantization approach as comparison, based on JunczysDowmunt et al. (2018). We made few modifications; Firstly We apply retraining, which is absent in their implementation. We also skip biases quantization. Finally, we optimize the scaling factor, instead of the suggested maxbased scale.
Method  Transformer  RNN 

Baseline  35.66  34.28 
Reduced Dimension  29.03 (6.63)  30.88 (3.40) 
FixedPoint Quantization  34.61 (1.05)  34.05 (0.23) 
Ours  35.47 (0.19)  34.22 (0.06) 
Table 2 summarizes the result. It shown that reducing the model size by simply reducing the dimension performed worst. Logarithmic based quantization has shown to perform better compared to fixedpoint quantization on both architecture.
RNN model seems to be more robust towards the compression. RNN models have lesser quality degradation in all compression scenarios. Our hypothesis is that the gradients computed with a highly compressed model is very noisy, thus resulting in noisy parameter updates. Our finding is in line with prior research which state that Transformer is more sensitive towards noisy training condition Chen et al. (2018); Aji and Heafield (2019).
4.3 Quantized DotProduct
We now apply logarithmic quantization for all dotproduct inputs. We use the same quantization procedure as the parameter, however we do not fit the scaling factor as it is very inefficient. Hence, we try using maxscale and fixedscale. For the parameter quantization, We use optimized scale with uncompressed biases, based on the previous experiment.
Method  Transformer  RNN 

Baseline  35.66  34.28 
+ Model Quantization  35.47 (0.19)  34.22 (0.06) 
+ Dot Product Quantization  35.05 (0.61)  33.12 (1.16) 
Table 3 shows the quality result of the experiment. Generally we see a quality degradation compared to a fullprecision dotproduct. There is no significant difference between using maxscale or fixedscale. Therefore, using fixedscale might be beneficial, as we avoid extra computation cost to determine the scale for every dotproduct operations.
4.4 Beyond 4bit precision
With 4bit quantization and uncompressed biases, we obtain 7.9x compression rate. Bitwidth can be set below 4 bit to obtain an even better compression rate, albeit introducing more compression error. To explore this, we sweep several bitwidth. We skip bias quantization and optimize the scaling factor.
Bit  Transformer  RNN  

Size (rate)  BLEU()  Size (rate)  BLEU()  
32  251 MB  35.66  361 MB  34.28 
4  32 MB (7.88x)  35.47 (0.19)  46 MB (7.90x)  34.22 (0.06) 
3  24 MB (10.45x)  34.95 (0.71)  34 MB (10.49x)  34.11 (0.17) 
2  16 MB (15.50x)  33.40 (2.26)  23 MB (15.59x)  32.78 (1.50) 
1  8 MB (30.00x)  29.43 (6.23)  12 MB (30.35x)  31.71 (2.51) 
Training an NMT system below 4bit precision is still a challenge. As shown in Table 4, model performance degrades with less bit used. While this result might still be acceptable, we argue that the result can be improved. One idea that might be interesting to try is to increase the unitsize in extreme lowprecision setting. We shown that 4bit precision performs better compared to fullprecision model with (near) 8x compression rate. In addition, Han et al. (2015) has shown that 2bit precision image classification can be achieved by scaling the parameter size. Alternative approach is to have different bitwidth for each layers Hwang and Sung (2014); Anwar et al. (2015).
We can also see RNN robustness over Transformer in this experiment, as RNN models degrade less compared to the Transformer counterpart. RNN model outperforms Transformer when compressing at binary precision.
5 Conclusion
We compress the model size in neural machine translation to approximately 7.9x smaller than 32bit floats by using a 4bit logarithmic quantization. Bias terms behave different and can be left uncompressed without affecting the compression rate significantly. We also find that retraining after quantization is necessary to restore the model’s performance.
References

Sparse communication for distributed gradient descent.
In
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
, pp. 440–445. Cited by: §3.2. 
Making asynchronous stochastic gradient descent work for transformers
. arXiv preprint arXiv:1906.03496. Cited by: §4.2.  Fixed point optimization of deep convolutional neural networks for object recognition. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1131–1135. Cited by: §4.4.
 Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §1.
 Deep architectures for neural machine translation. In Proceedings of the Second Conference on Machine Translation, pp. 99–107. Cited by: §1.

Estimating or propagating gradients through stochastic neurons for conditional computation
. arXiv preprint arXiv:1308.3432. Cited by: §3.4.  Findings of the 2018 conference on machine translation (wmt18). In Proceedings of the Third Conference on Machine Translation (WMT), Volume 2: Shared Task Papers, pp. 272–307. Cited by: §1.
 The best of both worlds: combining recent advances in neural machine translation. arXiv preprint arXiv:1804.09849. Cited by: §4.2.
 Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §4.4.
 Binarized neural networks. In Advances in neural information processing systems, pp. 4107–4115. Cited by: §1, §2, §3.1.

Quantized neural networks: training neural networks with low precision weights and activations.
The Journal of Machine Learning Research
18 (1), pp. 6869–6898. Cited by: §1, §2, §2.  Fixedpoint feedforward deep neural network design using weights+ 1, 0, and 1. In 2014 IEEE Workshop on Signal Processing Systems (SiPS), pp. 1–6. Cited by: §4.4.

Quantization and training of neural networks for efficient integerarithmeticonly inference.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 2704–2713. Cited by: §1, §2, §3.1.  Marian: costeffective highquality neural machine translation in c++. In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, pp. 129–135. Cited by: §2, §2, §3.1, §3.2, §4.1, §4.2.
 Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
 Fixed point quantization of deep convolutional networks. In International Conference on Machine Learning, pp. 2849–2858. Cited by: §1, §1, §2, §2, §3.1, §3.1.
 Deep gradient compression: reducing the communication bandwidth for distributed training. arXiv preprint arXiv:1712.01887. Cited by: §3.2.
 Convolutional neural networks using logarithmic data representation. arXiv preprint arXiv:1603.01025. Cited by: §1, §1, §2, §3.1, §3.1, §3.1.
 BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: §4.1.
 A call for clarity in reporting bleu scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pp. 186–191. Cited by: §4.1.
 Pieces of eight: 8bit neural machine translation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers), pp. 114–120. Cited by: §2.
 Compression of neural machine translation models via pruning. arXiv preprint arXiv:1606.09274. Cited by: §1, §2, §2, §3.1.
 1bit stochastic gradient descent and its application to dataparallel distributed training of speech dnns. In Fifteenth Annual Conference of the International Speech Communication Association, Cited by: §1, §3.2.
 The University of Edinburgh’s neural mt systems for WMT17. In Proceedings of the Second Conference on Machine Translation, pp. 389–399. Cited by: §4.1.
 Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 86–96. Cited by: §4.1.
 Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 1715–1725. Cited by: §4.1.
 Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §1, §4.1.
Comments
There are no comments yet.