Neural Machine Translation with 4-Bit Precision and Beyond

09/13/2019 ∙ by Alham Fikri Aji, et al. ∙ 0

Neural Machine Translation (NMT) is resource intensive. We design a quantization procedure to compress fit NMT models better for devices with limited hardware capability. We use logarithmic quantization, instead of the more commonly used fixed-point quantization, based on the empirical fact that parameters distribution is not uniform. We find that biases do not take a lot of memory and show that biases can be left uncompressed to improve the overall quality without affecting the compression rate. We also propose to use an error-feedback mechanism during retraining, to preserve the compressed model as a stale gradient. We empirically show that NMT models based on Transformer or RNN architecture can be compressed up to 4-bit precision without any noticeable quality degradation. Models can be compressed up to binary precision, albeit with lower quality. RNN architecture seems to be more robust towards compression, compared to the Transformer.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural Network is becoming the state-of-the art for machine translation problems Bojar et al. (2018). While neural machine translation (NMT) Bahdanau et al. (2014)

yields better performance compared to its statistical counterpart, it is also more resource demanding. NMT embeds the tokens as vectors. Therefore the embedding layer has to store the vector representation of all tokens in both source and target vocabulary. Moreover, current state-of-the art architecture, such as Transformer or deep RNN usually requires multiple layers 

Vaswani et al. (2017); Barone et al. (2017).

Model quantization has been widely studied as a way to compress the model size and speed-up the inference process. However, most of these work were focused on convolution neural network for computer vision task 

Miyashita et al. (2016); Lin et al. (2016); Hubara et al. (2016, 2017); Jacob et al. (2018). A research in model quantization for NMT tasks is limited.

We first explore the use of logarithmic-based quantization over the fixed point quantization Miyashita et al. (2016), based on the empirical findings that parameter distribution is not uniform Lin et al. (2016); See et al. (2016). Parameter’s magnitude also varies across layers, therefore we propose a better way to scale the quantization centers. We also notice that biases do not consume noticeable amount of memory, therefore exploring the need to compress biases. Lastly, we explore the significance of retraining in model compression scenario. We adopt an error-feedback mechanism Seide et al. (2014) to preserve the compressed model as a stale gradient, rather than discarding it every update during retraining.

2 Related Work

A considerable amount of research on model quantization has been done in the area of computer vision with convolutional neural networks; research on model quantization in the area of neural machine translation is much more limited. In this section, we will therefore also refer to work on neural models for image processing where appropriate.

Lin et al. (2016), Hubara et al. (2016), Hubara et al. (2017), Jacob et al. (2018), and Junczys-Dowmunt et al. (2018) all use linear quantization. Lin et al. (2016) and Hubara et al. (2016) use a fixed scale parameter prior to model training; Junczys-Dowmunt et al. (2018) and Jacob et al. (2018)

base it on the maximum tensor values for each matrix observed in the trained models.

Observing that their parameters are highly concentrated near 0 (see also Lin et al., 2016; See et al., 2016), Miyashita et al. (2016) opt for logarithmic quantization. They report an improvement in preserving model accuracy over linear quantization while achieving the same model compression rate.

Hubara et al. (2017) compress an LSTM-based architecture for language modeling to 4-bit without any quality degradation (while increase the unit size by a factor of 3). See et al. (2016) pruned an NMT model by removing any weight values that are lower than a certain threshold. They achieve 80% model sparsity without any quality degradation.

The most relevant work with respect to our purposes is the submission of Junczys-Dowmunt et al. (2018) to the Shared Task on Efficient Neural Machine Translation in 2018. This submission applied an 8-bit linear quantization for NMT models without any noticeable deterioration in translation quality. Similarly, Quinn and Ballesteros (2018) proposed to use an 8-bit matrix multiplication to speedup an NMT system.

3 Low Precision Neural Machine Translation

3.1 Log-Based Compression

Lin et al. (2016) and See et al. (2016)

report that parameters in deep learning models are normally distributed, and most of them are small values. Therefore, we adopt a logaritmic quantization, where each center is defined as

, similar to Miyashita et al. (2016). This allows for more centers for smaller values, giving us more precision of representation where parameter value density is the highest.

When compressing the model to -bit, a single bit will be used for the sign, hence we are left with for representing the values. is an integer defined as . We use a symmetric quantization; we apply the compression in absolute function, then put back the sign after compression. Therefore, our quantization centers (in absolute value) will be up to .

However, we find that models might not have the same magnitude as the quantization centers. To solve this issue, we also scale the the model values temporarily before quantizing, then re-scale it back to the original magnitude. This approach is different to that of Miyashita et al. (2016), where quantization centers are not-scaled, thus letting every layers to have the same centers.

Miyashita et al. (2016) quantize a value by rounding the logarithmic value to the closest integer. However, we found that this does not always quantize a value to the closest . For example, this approach will quantize to instead of , because is . Instead, we always round up the logarithmic value after we divide it by , as shown in Eq.1 below.


Base of 2 was chosen because its simplicity in implementation; Computing the rounded log can be done by checking the leftmost ’1’ bit while computing power operation is just a bit shift. However, other bases can be chosen as well.


Junczys-Dowmunt et al. (2018) and Jacob et al. (2018) scale the model based on its maximum value (Equation 2), which might be very unstable. Alternatively, Lin et al. (2016) and Hubara et al. (2016) use a pre-defined step-size in their fixed point quantization. Our objective is to select a scaling factor such that the quantized parameter is as close to the original as possible. Therefore, we optimize such that it minimizes the squared error between the original and the compressed parameter.

We propose a method to fit

with Expectation-Maximization. We first start with an initial scale

based on parameters’ maximum value. For given , we apply our quantization routine described in Equation 1, resulting in a center assignment . For a given assignment , we fit a new scale such that:


Substituting within Eq. 3 with the result of the last line in Eq. 1, we have:


To optimize the given objective, we take the first derivative of Equation 4 such that:


We optimize for each tensor independently.

3.2 Retraining

Unlike Junczys-Dowmunt et al. (2018), we retrain the model after initial quantization to allow it to recover some of the quality loss. In the retraining phase, we compute the gradients normally with full precision. We then re-quantize the model after every update to the parameters, including fitting scaling factors. The re-quantization error is preserved in a residual valriable and added to the next step’s parameter Seide et al. (2014). This error feedback mechanism was introduced in gradient compression techniques to reduce the impact of compression errors by preserving compression errors as stale gradient updates for the next batch Aji and Heafield (2017); Lin et al. (2017). We see reapplying quantization after parameter updates as a form of gradient compression, hence we explore the usage of an error feedback mechanism to potentially improve the final model’s quality.

3.3 Handling Biases

We do not quantize bias values in the model. We found that they do not follow the same distribution as other parameters, and attempting to log-quantize them used only a fraction of the available quantization points. In any case, bias values do not take up a lot of memory relative to other parameters. In our Transformer architecture, they account for only 0.2% of the parameter values.

3.4 Low Precision Dot Products

Our end goal is to run a log-quantized model without decompressing it. Activations coming into a matrix multiplication are quantized on the fly; intermediate activations are not quantized.

We use the same log-based quantization procedure described in Section 3.1. However, we only attempt a max-based scale. Running the slower EM approach to optimize the scale before every dot product would not be fast enough for inference applications.

The derivatives of ceiling and sign functions are zero almost everywhere and undefined in some places. For retraining purposes, we apply a straight through estimator

Bengio et al. (2013) to the ceiling function. For the sign function, we treat the quantization function differently for each individual value in , based on their sign. Therefore, we now compute as:


Since we multiply by a constant, the derivative will be either multiplied by or . In latter case, the derivative of will be , which returns the derivative’s sign back to positive. Hence, the derivative of our quantization function is:


4 Experiments

4.1 Experiment Setup

We use systems for the WMT 2017 English to German news translation task for our experiment; these differ from the WNGT shared task setting previously reported. We use back-translated monolingual corpora Sennrich et al. (2016a) and byte-pair encoding Sennrich et al. (2016b) to preprocess the corpus. Quality is measured with BLEU Papineni et al. (2002) score using sacreBLEU script Post (2018).

We first pre-train baseline models with both Transformer and RNN architecture. Our Transformer model consists of six encoder and six decoder layers with tied embedding. Our deep RNN model consists of eight layers of bidirectional LSTM. Models were trained synchronously with a dynamic batch size of 40 GB per-batch using the Marian toolkit Junczys-Dowmunt et al. (2018)

. The models are trained for 8 epochs. Models are optimized with Adam 

Kingma and Ba (2014)

. The rest of the hyperparameter settings on both models are following the suggested configurations 

Vaswani et al. (2017); Sennrich et al. (2017).

4.2 4-bit Transformer Model

In this first experiment we explore different ways to scale the quantization centers, the significance of quantizing biases, and the significance of retraining. We use pretrained Transformer model as our baseline, and apply our quantization algorithm on top of that.

Method Scaling
Fixed Max Optimized Fixed Max Optimized
Baseline 35.66
Without retraining With retraining
+ Model Quantization 25.2 28.08 33.33 34.92 34.81 35.26
+ No Bias Quantization 34.16 34.29 34.31 35.09 35.25 35.47
Table 1: 4-bit Transformer quantization performance for English to German translation, measured in BLEU score. We explore different method to find the scaling factor, as well as skipping bias quantization and retraining.

Table 1 summarizes the results. Using a simple, albeit unstable max-based scaling has shown to perform better compared to the fixed quantization scale. However, fitting the scaling factor to minimize the quantization squared-error produces the best quality. Interestingly, the BLEU score differences between quantization centers are diminished after retraining.

We can also see improvements by not quantizing biases, especially without retraining. Without any retraining involved, we reached the highest BLEU score of 35.47 by using optimized scale, on top of uncompressed biases. Without biases quantization, we obtained 7.9x compression ratio (instead of 8x) with a 4-bit quantization. Based on this trade-off, we argue that it is more beneficial to keep the biases in full-precision.

Retraining has shown to improve the quality in general. After retraining, the quality differences between various scaling and biases quantization configurations are minimal. These results suggest that retraining helps the model to fine-tune under a new quantized parameter space.

To show the improvement of our method, we compare several compression approaches to our 4-bit quantization method with retraining and without bias quantization. One of arguably naive way to reduce model size is use smaller unit size. For Transformer, we set the Transformer feed-forward dimension to 512 (from 2048), and the embedding size to 128 (from 512). For RNN, we set the RNN dimension to 320 (from 1024) and the embedding size to 160 (from 512). This way, the model size for both architecture is relatively equal to the 4-bit compressed models.

We also introduce the fixed-point quantization approach as comparison, based on Junczys-Dowmunt et al. (2018). We made few modifications; Firstly We apply retraining, which is absent in their implementation. We also skip biases quantization. Finally, we optimize the scaling factor, instead of the suggested max-based scale.

Method Transformer RNN
Baseline 35.66 34.28
Reduced Dimension 29.03 (-6.63) 30.88 (-3.40)
Fixed-Point Quantization 34.61 (-1.05) 34.05 (-0.23)
Ours 35.47 (-0.19) 34.22 (-0.06)
Table 2: Model performance (in BLEU) of various quantization approaches, on both Transformer and RNN architecture.

Table 2 summarizes the result. It shown that reducing the model size by simply reducing the dimension performed worst. Logarithmic based quantization has shown to perform better compared to fixed-point quantization on both architecture.

RNN model seems to be more robust towards the compression. RNN models have lesser quality degradation in all compression scenarios. Our hypothesis is that the gradients computed with a highly compressed model is very noisy, thus resulting in noisy parameter updates. Our finding is in line with prior research which state that Transformer is more sensitive towards noisy training condition Chen et al. (2018); Aji and Heafield (2019).

4.3 Quantized Dot-Product

We now apply logarithmic quantization for all dot-product inputs. We use the same quantization procedure as the parameter, however we do not fit the scaling factor as it is very inefficient. Hence, we try using max-scale and fixed-scale. For the parameter quantization, We use optimized scale with uncompressed biases, based on the previous experiment.

Method Transformer RNN
Baseline 35.66 34.28
+ Model Quantization 35.47 (-0.19) 34.22 (-0.06)
+ Dot Product Quantization 35.05 (-0.61) 33.12 (-1.16)
Table 3: Model performance (in BLEU) of model quantization with dot-product quantization, on both Transformer and RNN architecture.

Table 3 shows the quality result of the experiment. Generally we see a quality degradation compared to a full-precision dot-product. There is no significant difference between using max-scale or fixed-scale. Therefore, using fixed-scale might be beneficial, as we avoid extra computation cost to determine the scale for every dot-product operations.

4.4 Beyond 4-bit precision

With 4-bit quantization and uncompressed biases, we obtain 7.9x compression rate. Bit-width can be set below 4 bit to obtain an even better compression rate, albeit introducing more compression error. To explore this, we sweep several bit-width. We skip bias quantization and optimize the scaling factor.

Bit Transformer RNN
Size (rate) BLEU() Size (rate) BLEU()
32 251 MB 35.66 361 MB 34.28
4 032 MB (07.88x) 35.47 (-0.19) 046 MB (07.90x) 34.22 (-0.06)
3 024 MB (10.45x) 34.95 (-0.71) 034 MB (10.49x) 34.11 (-0.17)
2 016 MB (15.50x) 33.40 (-2.26) 023 MB (15.59x) 32.78 (-1.50)
1 008 MB (30.00x) 29.43 (-6.23) 012 MB (30.35x) 31.71 (-2.51)
Table 4: Compression rate and performance of both Transformer and RNN with various bit-widths. The compression rate between Transformer and RNN is not equal, as they have different bias to parameter size ratio.

Training an NMT system below 4-bit precision is still a challenge. As shown in Table 4, model performance degrades with less bit used. While this result might still be acceptable, we argue that the result can be improved. One idea that might be interesting to try is to increase the unit-size in extreme low-precision setting. We shown that 4-bit precision performs better compared to full-precision model with (near) 8x compression rate. In addition, Han et al. (2015) has shown that 2-bit precision image classification can be achieved by scaling the parameter size. Alternative approach is to have different bit-width for each layers Hwang and Sung (2014); Anwar et al. (2015).

We can also see RNN robustness over Transformer in this experiment, as RNN models degrade less compared to the Transformer counterpart. RNN model outperforms Transformer when compressing at binary precision.

5 Conclusion

We compress the model size in neural machine translation to approximately 7.9x smaller than 32-bit floats by using a 4-bit logarithmic quantization. Bias terms behave different and can be left uncompressed without affecting the compression rate significantly. We also find that retraining after quantization is necessary to restore the model’s performance.


  • A. F. Aji and K. Heafield (2017) Sparse communication for distributed gradient descent. In

    Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

    pp. 440–445. Cited by: §3.2.
  • A. F. Aji and K. Heafield (2019)

    Making asynchronous stochastic gradient descent work for transformers

    arXiv preprint arXiv:1906.03496. Cited by: §4.2.
  • S. Anwar, K. Hwang, and W. Sung (2015) Fixed point optimization of deep convolutional neural networks for object recognition. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1131–1135. Cited by: §4.4.
  • D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §1.
  • A. V. M. Barone, J. Helcl, R. Sennrich, B. Haddow, and A. Birch (2017) Deep architectures for neural machine translation. In Proceedings of the Second Conference on Machine Translation, pp. 99–107. Cited by: §1.
  • Y. Bengio, N. Léonard, and A. Courville (2013)

    Estimating or propagating gradients through stochastic neurons for conditional computation

    arXiv preprint arXiv:1308.3432. Cited by: §3.4.
  • O. Bojar, C. Federmann, M. Fishel, Y. Graham, B. Haddow, M. Huck, P. Koehn, and C. Monz (2018) Findings of the 2018 conference on machine translation (wmt18). In Proceedings of the Third Conference on Machine Translation (WMT), Volume 2: Shared Task Papers, pp. 272–307. Cited by: §1.
  • M. X. Chen, O. Firat, A. Bapna, M. Johnson, W. Macherey, G. Foster, L. Jones, N. Parmar, M. Schuster, Z. Chen, et al. (2018) The best of both worlds: combining recent advances in neural machine translation. arXiv preprint arXiv:1804.09849. Cited by: §4.2.
  • S. Han, H. Mao, and W. J. Dally (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §4.4.
  • I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio (2016) Binarized neural networks. In Advances in neural information processing systems, pp. 4107–4115. Cited by: §1, §2, §3.1.
  • I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio (2017) Quantized neural networks: training neural networks with low precision weights and activations.

    The Journal of Machine Learning Research

    18 (1), pp. 6869–6898.
    Cited by: §1, §2, §2.
  • K. Hwang and W. Sung (2014) Fixed-point feedforward deep neural network design using weights+ 1, 0, and- 1. In 2014 IEEE Workshop on Signal Processing Systems (SiPS), pp. 1–6. Cited by: §4.4.
  • B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko (2018) Quantization and training of neural networks for efficient integer-arithmetic-only inference. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 2704–2713. Cited by: §1, §2, §3.1.
  • M. Junczys-Dowmunt, K. Heafield, H. Hoang, R. Grundkiewicz, and A. Aue (2018) Marian: cost-effective high-quality neural machine translation in c++. In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, pp. 129–135. Cited by: §2, §2, §3.1, §3.2, §4.1, §4.2.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
  • D. Lin, S. Talathi, and S. Annapureddy (2016) Fixed point quantization of deep convolutional networks. In International Conference on Machine Learning, pp. 2849–2858. Cited by: §1, §1, §2, §2, §3.1, §3.1.
  • Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally (2017) Deep gradient compression: reducing the communication bandwidth for distributed training. arXiv preprint arXiv:1712.01887. Cited by: §3.2.
  • D. Miyashita, E. H. Lee, and B. Murmann (2016) Convolutional neural networks using logarithmic data representation. arXiv preprint arXiv:1603.01025. Cited by: §1, §1, §2, §3.1, §3.1, §3.1.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: §4.1.
  • M. Post (2018) A call for clarity in reporting bleu scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pp. 186–191. Cited by: §4.1.
  • J. Quinn and M. Ballesteros (2018) Pieces of eight: 8-bit neural machine translation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers), pp. 114–120. Cited by: §2.
  • A. See, M. Luong, and C. D. Manning (2016) Compression of neural machine translation models via pruning. arXiv preprint arXiv:1606.09274. Cited by: §1, §2, §2, §3.1.
  • F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu (2014) 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In Fifteenth Annual Conference of the International Speech Communication Association, Cited by: §1, §3.2.
  • R. Sennrich, A. Birch, A. Currey, U. Germann, B. Haddow, K. Heafield, A. V. M. Barone, and P. Williams (2017) The University of Edinburgh’s neural mt systems for WMT17. In Proceedings of the Second Conference on Machine Translation, pp. 389–399. Cited by: §4.1.
  • R. Sennrich, B. Haddow, and A. Birch (2016a) Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 86–96. Cited by: §4.1.
  • R. Sennrich, B. Haddow, and A. Birch (2016b) Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 1715–1725. Cited by: §4.1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §1, §4.1.