1 Introduction
Neural machine translation (NMT) Bahdanau et al. (2014); Sutskever et al. (2014) has recently achieved remarkable performance improving fluency and adequacy over phrasebased machine translation and is being deployed in commercial settings Koehn and Knowles (2017). However, this comes at a cost of slow decoding speeds compared to phrasebased and syntaxbased SMT (see section 3).
NMT models are generally trained using 32bit floating point values. At training time, multiple sentences can be processed in parallel leveraging graphical processing units (GPUs) to good advantage since the data is processed in batches. This is also true for decoding for noninteractive applications such as bulk document translation.
Why is fast execution on CPUs important? First, CPUs are cheaper than GPUs. Fast CPU computation will reduce commercial deployment costs. Second, for lowlatency applications such as speechtospeech translation Neubig et al. (2017a)
, it is important to translate individual sentences quickly enough so that users can have an application experience that responds seamlessly. Translating individual sentences with NMT requires many memory bandwidth intensive matrixvector or matrixnarrow matrix multiplications
Abdelfattah et al. (2016). In addition, the batch size is 1 and GPUs do not have a speed advantage over CPUs due to the lack of adequate parallel work (as evidenced by increasingly difficult batching scenarios in dynamic frameworks Neubig et al. (2017b)).Others have successfully used low precision approximations to neural net models. googlespeed explored 8bit quantization for feedforward neural nets for speech recognition. D171300 explored 16bit quantization for machine translation. In this paper we show the effectiveness of 8bit decoding for models that have been trained using 32bit floating point values. Results show that 8bit decoding does not hurt the fluency or adequacy of the output, while producing results up to 46x times faster. In addition, implementation is straightforward and we can use the models as is without altering training.
The paper is organized as follows: Section 2
reviews the attentional model of translation to be sped up, Section
3 presents our 8bit quantization in our implementation, Section 4 presents automatic measurements of speed and translation quality plus human evaluations, Section 5 discusses the results and some illustrative examples, Section 6 describes prior work, and Section 7 concludes the paper.2 The Attentional Model of Translation
Our translation system implements the attentional model of translation Bahdanau et al. (2014) consisting of an encoderdecoder network with an attention mechanism.
The encoder uses a bidirectional GRU recurrent neural network
Cho et al. (2014) to encode a source sentence , where is the embedding vector for the th word and is the sentence length. The encoded form is a sequence of hidden states where each is computed as follows(1) 
where . Here and are GRU cells.
Given , the decoder predicts the target translation by computing the output token sequence , where is the length of the sequence. At each time
, the probability of each token
from a target vocabulary is(2) 
where is a two layer feedforward network over the embedding of the previous target word (), the decoder hidden state (), and the weighted sum of encoder states (
), followed by a softmax to predict the probability distribution over the output vocabulary.
We compute with a two layer GRU as
(3) 
and
(4) 
where is an intermediate state and . The two GRU units and together with the attention constitute the conditional GRU layer of sennrichEtAl:2017:EACLDemo. is computed as
(5) 
where are the elements of which is the output vector of the attention model. This is computed with a two layer feedforward network
(6) 
where and are weight matrices, and is another matrix resulting in one real value per encoder state . is then the softmax over .
We train our model using a program written using the Theano framework
Bastien et al. (2012). Generally models are trained with batch sizes ranging from 64 to 128 and unbiased Adam stochastic optimizer Kingma and Ba (2014). We use an embedding size of 620 and hidden layer sizes of 1000. We select model parameters according to the best BLEU score on a heldout development set over 10 epochs.
3 8bit Translation
Our translation engine is a C++ implementation. The engine is implemented using the Eigen matrix library, which provides efficient matrix operations. Each CPU core translates a single sentence at a time. The same engine supports both batch and interactive applications, the latter making singlesentence translation latency important. We report speed numbers as both words per second (WPS) and words per core second (WPCS), which is WPS divided by the number of cores running. This gives us a measure of overall scaling across many cores and memory buses as well as the singlesentence speed.
Phrasebased SMT systems, such as Tillmann (2006), for EnglishGerman run at 170 words per core second (3400 words per second) on a 20 core Xeon 2690v2 system. Similarly, syntaxbased SMT systems, such as Zhao and Alonaizan (2008), for the same language pair run at 21.5 words per core second (430 words per second).
In contrast, our NMT system (described in Section 2) with 32bit decoding runs at 6.5 words per core second (131 words per second). Our goal is to increase decoding speed for the NMT system to what can be achieved with phrasebased systems while maintaining the levels of fluency and adequacy that NMT offers.
Benchmarks of our NMT decoder unsurprisingly show matrix multiplication as the number one source of compute cycles. In Table 1 we see that more than 85% of computation is spent in Eigen’s matrix and vector multiply routines (Eigen matrix vector product and Eigen matrix multiply). It dwarfs the costs of the transcendental function computations as well as the bias additions.
Time  Function 

%  Eigen matrix vector product 
%  Eigen matrix multiply 
%  NMT decoder layer 
%  Eigen fast tanh 
%  NMT tanh wrapper 
Given this distribution of computing time, it makes sense to try to accelerate the matrix operations as much as possible. One approach to increasing speed is to quantize matrix operations. Replacing 32bit floating point math operations with 8bit integer approximations in neural nets has been shown to give speedups and similar accuracy Vanhoucke et al. (2011). We chose to apply similar optimization to our translation system, both to reduce memory traffic as well as increase parallelism in the CPU.
Our 8bit matrix multiply routine uses a naive implementation with no blocking or copy. The code is implemented using Intel SSE4 vector instructions and computes 4 rows at a time, similar to Devlin (2017). Simplicity led to implementing 8bit matrix multiplication with the results being placed into a 32bit floating point result. This has the advantage of not needing to know the scale of the result. In addition, the output is a vector or narrow matrix, so little extra memory bandwidth is consumed.
Multilayer matrix multiply algorithms result in significantly faster performance than naive algorithms Goto and Geijn (2008). This is due to the fact that there are math operations on elements when multiplying NxN matrices, therefore it is worth significant effort to minimize memory operations while maximizing math operations. However, when multiplying an NxN matrix by an NxP matrix where P is very small (10), memory operations dominate and performance does not benefit from the complex algorithm. When decoding single sentences, we typically set our beam size to a value less than 8 following standard practice in this kind of systems Koehn and Knowles (2017). We actually find that at such small values of P, the naive algorithm is a bit faster.
Time  Function 

%  8bit matrix multiply 
%  Eigen fast tanh 
%  NMT decoder layer 
%  NMT tanh wrapper 
Table 2 shows the profile after converting the matrix routines to 8bit integer computation. There is only one entry for matrixmatrix and matrixvector multiplies since they are handled by the same routine. After conversion, tanh and sigmoid still consume less than 7% of CPU time. We decided not to convert these operations to integer in light of that fact.
It is possible to replace all the operations with 8bit approximations Wu et al. (2016), but this makes implementation more complex, as the scale of the result of a matrix multiplication must be known to correctly output 8bit numbers without dangerous loss of precision.
Assuming we have 2 matrices of size 1000x1000 with a range of values , the individual dot products in the result could be as large as . In practice with neural nets, the scale of the result is similar to that of the input matrices. So if we scale the result to assuming the worst case, the loss of precision will give us a matrix full of zeros. The choices are to either scale the result of the matrix multiplication with a reasonable value, or to store the result as floating point. We opted for the latter.
8bit computation achieves 32.3 words per core second (646 words per second), compared to the 6.5 words per core second (131 words per second) of the 32bit system (both systems load parameters from the same model). This is even faster than the syntaxbased system that runs at 21.5 words per core second (430 words per second). Table 3 summarizes running speeds for the phrasebased SMT system, syntaxbased system and NMT with 32bit decoding and 8bit decoding.
System  WPCS 

Phrasebased  
Syntaxbased  
NMT 32bit  
NMT 8bit 
4 Measurements
To demonstrate the effectiveness of approximating the floating point math with 8bit integer computation, we show automatic evaluation results on several models, as well as independent human evaluations. We report results on DutchEnglish, EnglishDutch, RussianEnglish, GermanEnglish and EnglishGerman models. Table 4 shows training data sizes and vocabulary sizes. All models have 620 dimension embeddings and 1000 dimension hidden states.
Lang  Training  Source  Target 

Sentences  Vocabulary  Vocabulary  
EnNl  17M  42112  33658 
NlEn  17M  33658  42212 
RuEn  31M  42388  42840 
EnDe  31M  57867  63644 
DeEn  31M  63644  57867 
4.1 Automatic results
Here we report automatic results comparing decoding results on 32bit and 8bit implementations. As others have found Wu et al. (2016), 8bit implementations impact quality very little.
In Table 6, we compared automatic scores and speeds for DutchEnglish, EnglishDutch, RussianEnglish, GermanEnglish and EnglishGerman models on news data. The EnglishGerman model was run with both a single model (1x) and an ensemble of two models (2x) Freitag et al. (2017). Table 5 gives the number of sentences and average sentence length for the test sets used.
Lang  Test  Src Sent  Tgt Sent 

Sentences  Length  Length  
EnNl  990  22.5  25.9 
NlEn  990  25.9  22.5 
RuEn  555  27.2  35.2 
EnDe  168  51.8  46.0 
DeEn  168  46.0  51.8 
Lang  Mode  BLEU  Speed (WPSC) 

EnNl  32bit  
EnNl  8bit  
NlEn  32bit  
NlEn  8bit  
RuEn  32bit  
RuEn  8bit  
DeEn  32bit  
DeEn  8bit  
EnDe 2x  32bit  
EnDe 2x  8bit  
EnDe 1x  32bit  
EnDe 1x  8bit 
Speed is reported in words per core second (WPCS). This gives us a better sense of the speed of individual engines when deployed on multicore systems with all cores performing translations. Total throughput is simply the product of WPCS and the number of cores in the machine. The reported speed is the median of 9 runs to ensure consistent numbers. The results show that we see a 46x speedup over 32bit floating point decoding. GermanEnglish shows the largest deficit for the 8bit mode versus the 32bit mode. The GermanEnglish test set only includes 168 sentences so this may be a spurious difference.
4.2 Human evaluation
These automatic results suggest that 8bit quantization can be done without perceptible degradation. To confirm this, we carried out a human evaluation experiment.
In Table 7, we show the results of performing human evaluations on some of the same language pairs in the previous section. An independent native speaker of the language being translated to/from different than English (who is also proficient in English) scored 100 randomly selected sentences. The sentences were shuffled during the evaluation to avoid evaluator bias towards different runs. We employ a scale from 0 to 5, with 0 being unintelligible and 5 being perfect translation.
Language  32bit  8bit 

EnNl  4.02  4.08 
NlEn  4.03  4.03 
RuEn  4.10  4.06 
EnDe 2x  4.05  4.16 
EnDe 1x  3.84  3.90 
Source  Time  Sie standen seit 1946 an der Parteispitze 
32bit  720 ms  They had been at the party leadership since 1946 
8bit  180 ms  They stood at the top of the party since 1946. 
Source  Time  So erwarten die Experten für dieses Jahr lediglich einen Anstieg der Weltproduktion 
von 3,7 statt der im Juni prognostizierten 3,9 Prozent. Für 2009 sagt das Kieler  
Institut sogar eine Abschwächung auf 3,3 statt 3,7 Prozent voraus.  
32bit  4440 ms  For this year, the experts expect only an increase in world production of 3.7 
instead of the 3.9 percent forecast in June. In 2009, the Kiel Institute  
predictated a slowdown to 3.3 percent instead of 3.7 percent.  
8bit  750 ms  For this year, the experts expect only an increase in world production of 3.7 
instead of the 3.9 percent forecast in June. In 2009, the Kiel Institute even  
forecast a slowdown to 3.3% instead of 3.7 per cent.  
Source  Time  Heftige Regenfälle wegen “Ike” werden möglicherweise schwerere Schäden anrichten 
als seine Windböen. Besonders gefährdet sind dicht besiedelte Gebiete im Tal des Rio  
Grande, die noch immer unter den Folgen des Hurrikans “Dolly” im Juli leiden.  
32bit  6150 ms  Heavy rainfall due to “Ike” may cause more severe damage than its gusts of wind, 
particularly in densely populated areas in the Rio Grande valley, which are still  
suffering from the consequences of the “dolly” hurricane in July.  
8bit  1050 ms  Heavy rainfall due to “Ike” may cause heavier damage than its gusts of wind, 
particularly in densely populated areas in the Rio Grande valley, which still  
suffer from the consequences of the “dolly” hurricane in July. 
Source  Time  Het is tijd om de kloof te overbruggen. 
32bit  730 ms  It’s time to bridge the gap. 
8bit  180 ms  It is time to bridge the gap. 
Source  Time  Niet dat Barientos met zijn vader van plaats zou willen wisselen. 
32bit  1120 ms  Not that Barientos would want to change his father’s place. 
8bit  290 ms  Not that Barientos would like to switch places with his father. 
The Table shows that the automatic scores shown in the previous section are also sustained by humans. 8bit decoding is as good as 32bit decoding according to the human evaluators.
5 Discussion
Having a faster NMT engine with no loss of accuracy is commercially useful. In our deployment scenarios, it is the difference between an interactive user experience that is sluggish and one that is not. Even in batch mode operation, the same throughput can be delivered with 1/4 the hardware.
In addition, this speedup makes it practical to deploy small ensembles of models. As shown above in the EnDe model in Table 6, an ensemble can deliver higher accuracy at the cost of a 2x slowdown. This work makes it possible to translate with higher quality while still being at least twice as fast as the previous baseline.
As the numbers reported in Section 4 demonstrate, 8bit and 32bit decoding have similar average quality. As expected, the outputs produced by the two decoders are not identical. In fact, on a run of 166 sentences of DeEn translation, only 51 were identical between the two. In addition, our human evaluation results and the automatic scoring suggest that there is no specific degradation by the 8bit decoder compared to the 32bit decoder. In order to emphasize these claims, Table 8 shows several examples of output from the two systems for a GermanEnglish system. Table 9 shows 2 more examples from a DutchEnglish system.
In general, there are minor differences without any loss in adequacy or fluency due to 8bit decoding. Sentence 2 in Table 8 shows a spelling error (“predictated”) in the 32bit output due to reassembly of incorrect subword units.^{1}^{1}1In order to limit the vocabulary, we use BPE subword units Sennrich et al. (2016) in all models.
6 Related Work
Reducing the resources required for decoding neural nets in general and neural machine translation in particular has been the focus of some attention in recent years.
googlespeed explored accelerating convolutional neural nets with 8bit integer decoding for speech recognition. They demonstrated that low precision computation could be used with no significant loss of accuracy. Han et al. (2015) investigated highly compressing image classification neural networks using network pruning, quantization, and Huffman coding so as to fit completely into onchip cache, seeing significant improvements in speed and energy efficiency while keeping accuracy losses small.
Focusing on machine translation, Devlin (2017) implemented 16bit fixedpoint integer math to speed up matrix multiplication operations, seeing a 2.59x improvement. They show competitive BLEU scores on WMT EnglishFrench NewsTest2014 while offering significant speedup. Similarly, Wu et al. (2016) applies 8bit endtoend quantization in translation models. They also show that automatic metrics do not suffer as a result. In this work, quantization requires modification to model training to limit the size of matrix outputs.
7 Conclusions and Future Work
In this paper, we show that 8bit decoding for neural machine translation runs up to 46x times faster than a similar optimized floating point implementation. We show that the quality of this approximation is similar to that of the 32bit version. We also show that it is unnecessary to modify the training procedure to produce models compatible with 8bit decoding.
To conclude, this paper shows that 8bit decoding is as good as 32bit decoding both in automatic measures and from a human perception perspective, while it improves latency substantially.
In the future we plan to implement a multilayered matrix multiplication that falls back to the naive algorithm for matrixpanel multiplications. This will provide speed for batch decoding for applications that can take advantage of it. We also plan to explore training with low precision for faster experiment turnaround time.
Our results offer hints of improved accuracy rather than just parity. Other work has used training as part of the compression process. We would like to see if training quantized models changes the results for better or worse.
References
 Abdelfattah et al. (2016) Ahmad Abdelfattah, David Keyes, and Hatem Ltaief. 2016. Kblas: An optimized library for dense matrixvector multiplication on gpu accelerators. ACM Trans. Math. Softw. 42(3):18:1–18:31. https://doi.org/10.1145/2818311.
 Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473.
 Bastien et al. (2012) Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian J. Goodfellow, Arnaud Bergeron, Nicolas Bouchard, and Yoshua Bengio. 2012. Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop.
 Cho et al. (2014) KyungHyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoderdecoder approaches. CoRR abs/1409.1259.

Devlin (2017)
Jacob Devlin. 2017.
Sharp models on dull hardware: Fast and accurate neural machine
translation decoding on the cpu.
In
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
. Association for Computational Linguistics, pages 2820–2825.  Freitag et al. (2017) Markus Freitag, Yaser AlOnaizan, and Baskaran Sankaran. 2017. Ensemble distillation for neural machine translation. CoRR abs/1702.01802.
 Goto and Geijn (2008) Kazushige Goto and Robert A. van de Geijn. 2008. Anatomy of highperformance matrix multiplication. ACM Trans. Math. Softw. 34(3):12:1–12:25.
 Han et al. (2015) Song Han, Huizi Mao, and William J. Dally. 2015. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. CoRR abs/1510.00149.
 Kingma and Ba (2014) Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. CoRR abs/1412.6980.
 Koehn and Knowles (2017) Philipp Koehn and Rebecca Knowles. 2017. Six challenges for neural machine translation. In Proceedings of the First Workshop on Neural Machine Translation. Association for Computational Linguistics, Vancouver, pages 28–39.
 Neubig et al. (2017a) Graham Neubig, Kyunghyun Cho, Jiatao Gu, and Victor O. K. Li. 2017a. Learning to translate in realtime with neural machine translation. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Valencia, Spain, April 37, 2017, Volume 1: Long Papers. pages 1053–1062.
 Neubig et al. (2017b) Graham Neubig, Chris Dyer, Yoav Goldberg, Austin Matthews, Waleed Ammar, Antonios Anastasopoulos, Miguel Ballesteros, David Chiang, Daniel Clothiaux, Trevor Cohn, Kevin Duh, Manaal Faruqui, Cynthia Gan, Dan Garrette, Yangfeng Ji, Lingpeng Kong, Adhiguna Kuncoro, Gaurav Kumar, Chaitanya Malaviya, Paul Michel, Yusuke Oda, Matthew Richardson, Naomi Saphra, Swabha Swayamdipta, and Pengcheng Yin. 2017b. Dynet: The dynamic neural network toolkit. arXiv preprint arXiv:1701.03980 .
 Sennrich et al. (2017) Rico Sennrich, Orhan Firat, Kyunghyun Cho, Alexandra Birch, Barry Haddow, Julian Hitschler, Marcin JunczysDowmunt, Samuel Läubli, Antonio Valerio Miceli Barone, Jozef Mokry, and Maria Nadejde. 2017. Nematus: a toolkit for neural machine translation. In Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, Valencia, Spain, pages 65–68. http://aclweb.org/anthology/E173017.
 Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, pages 1715–1725.
 Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 813 2014, Montreal, Quebec, Canada. pages 3104–3112.
 Tillmann (2006) Christoph Tillmann. 2006. Efficient dynamic programming search algorithms for phrasebased smt. In Proceedings of the Workshop on Computationally Hard Problems and Joint Inference in Speech and Language Processing. Association for Computational Linguistics, Stroudsburg, PA, USA, CHSLP ’06, pages 9–16.
 Vanhoucke et al. (2011) Vincent Vanhoucke, Andrew Senior, and Mark Z. Mao. 2011. Improving the speed of neural networks on cpus. In Deep Learning and Unsupervised Feature Learning Workshop, NIPS 2011.
 Wu et al. (2016) Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR abs/1609.08144.
 Zhao and Alonaizan (2008) Bing Zhao and Yaser Alonaizan. 2008. Generalizing local and nonlocal wordreordering patterns for syntaxbased machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Stroudsburg, PA, USA, EMNLP ’08, pages 572–581. http://dl.acm.org/citation.cfm?id=1613715.1613785.
Comments
There are no comments yet.