1 Introduction
Attentional sequencetosequence models have become the new standard for machine translation over the last two years, and with the unprecedented improvements in translation accuracy comes a new set of technical challenges. One of the biggest challenges is the high training and decoding costs of these neural machine translation (NMT) system, which is often at least an order of magnitude higher than a phrasebased system trained on the same data. For instance, phrasal MT systems were able achieve singlethreaded decoding speeds of 100500 words/sec on decadeold CPUs (Quirk and Moore, 2007), while Jean et al. (2015) reported singlethreaded decoding speeds of 810 words/sec on a shallow NMT system. Wu et al. (2016) was able to reach CPU decoding speeds of 100 words/sec for a deep model, but used 44 CPU cores to do so. There has been recent work in speeding up decoding by reducing the search space (Kim and Rush, 2016), but little in computational improvements.
In this work, we consider a production scenario which requires lowlatency, highthroughput NMT decoding. We focus on CPUbased decoders, since GPU/FPGA/ASICbased decoders require specialized hardware deployment and logistical constraints such as batch processing. Efficient CPU decoders can also be used for ondevice mobile translation. We focus on singlethreaded decoding and singlesentence processing, since multiple threads can be used to reduce latency but not total throughput.
We approach this problem from two angles: In Section 4, we describe a number of techniques for improving the speed of the decoder, and obtain a 4.4x speedup over a highly efficient baseline. These speedups do not affect decoding results, so they can be applied universally. In Section 5, we describe a simple but powerful network architecture which uses a single RNN (GRU/LSTM) layer at the bottom with a large number of fullyconnected (FC) layers on top, and obtains improvements similar to a deep RNN model at a fraction of the training and decoding cost.
2 Data Set
The data set we evaluate on in this work is WMT EnglishFrench NewsTest2014, which has 380M words of parallel training data and a 3003 sentence test set. The NewsTest2013 set is used for validation. In order to compare our architecture to past work, we train a wordbased system without any data augmentation techniques. The network architecture is very similar to Bahdanau et al. (2014), and specific details of layer size/depth are provided in subsequent sections. We use an 80k source/target vocab and perform standard unkreplacement (Jean et al., 2015) on outofvocabulary words. Training is performed using an inhouse toolkit.
3 Baseline Decoder
Our baseline decoder is a standard beam search decoder (Sutskever et al., 2014) with several straightforward performance optimizations:

[noitemsep]

It is written in pure C++, with no heap allocation done during the core search.

A candidate list is used to reduce the output softmax from 80k to ~. We run word alignment (Brown et al., 1993) on the training and keep the top 20 contextfree translations for each source word in the test sentence.

The Intel MKL library is used for matrix multiplication, as it is the fastest floating point matrix multiplication library for CPUs.

Early stopping is performed when the top partial hypothesis has a logscore of worse than the best completed hypothesis.

Batching of matrix multiplication is applied when possible. Since each sentence is decoded separately, we can only batch over the hypotheses in the beam as well as the input vectors on the source side.
4 Decoder Speed Improvements
This section describes a number of speedups that can be made to a CPUbased attentional sequencetosequence beam decoder. Crucially, none of these speedups affect the actual mathematical computation of the decoder, so they can be applied to any network architecture with a guarantee that they will not affect the results.^{1}^{1}1Some speedups apply quantization which leads to small random perturbations, but these change the BLEU score by less than 0.02.
The model used here is similar to the original implementation of Bahdanau et al. (2014). The exact target GRU equation is:
Where , , , are learned parameters, is the hidden vector of the source word, is the previous target recurrent vector, is the target input (e.g., embedding of previous word).
We also denote the various hyperparameters:
for the beam size, for the recurrent hidden size, is the embedding size, for the source sentence length, and for the target sentence length, is the vocab size.4.1 16Bit Matrix Multiplication
Although CPUbased matrix multiplication libraries are highly optimized, they typically only operate on 32/64bit floats, even though DNNs can almost always operate on much lower precision without degredation of accuracy (Han et al., 2016). However, lowprecision math (1bit to 7bit) is difficult to implement efficiently on the CPU, and even 8bit math has limited support in terms of vectorized (SIMD) instruction sets. Here, we use 16bit fixedpoint integer math, since it has firstclass SIMD support and requires minimal changes to training. Training is still performed with 32bit floats, but we clip the weights to the range [1.0, 1.0] the relu activation to [0.0, 10.0] to ensure that all values fit into 16bits with high precision. A reference implementation of 16bit multiplication in C++/SSE2 is provided in the supplementary material, with a thorough description of lowlevel details.^{2}^{2}2Included as ancillary file in Arxiv submission, on right side of submission page.
A comparison between our 16bit integer implementation and Intel MKL’s 32bit floating point multiplication is given in Figure 1. We can see that 16bit multiplication is 2x3x faster than 32bit multiplication for batch sizes between 2 and 8, which is the typical range of the beam size . We are able to achieve greater than a 2x speedup in certain cases because we preprocess the weight matrix offline to have optimal memory layout, which is a capability BLAS libraries do not have.
4.2 PreCompute Embeddings
In the first hidden layer on the source and target sides, corresponds to word embeddings. Since this is a closed set of values that are fixed after training, the vectors can be precomputed (Devlin et al., 2014) for each word in the vocabulary and stored in a lookup table. This can only be applied to the first hidden layer.
Precomputation does increase the memory cost of the model, since we must store floats per word instead of . However, if we only compute the most frequently words (e.g., ), this reduces the precomputation memory by 90% but still results in 95%+ token coverage due to the Zipfian distribution of language.
4.3 PreCompute Attention
The attention context computation in the GRU can be refactored as follows:
Crucially, the hidden vector representation is only dependent on the source sentence, while is dependent on the target hypothesis. Therefore, the original computation requires total multiplications per sentence, but the refactored version only requires total multiplications. The expectation over must still be computed at each target timestep, but this is much less expensive than the multiplication by .
4.4 SSE & Lookup Tables
For the elementwise vector functions use in the GRU, we can use vectorized instructions (SSE/AVX) for the add and multiply functions, and lookup tables for sigmoid and tanh. Reference implementations in C++ are provided in the supplementary material.
4.5 Merge Recurrent States
In the GRU equation, for the first target hidden layer, represents the previously generated word, and encodes the hypothesis up to two words before the current word. Therefore, if two partial hypotheses in the beam only differ by the last emitted word, their vectors will be identical. Thus, we can perform matrix multiplication only on the unique vectors in the beam at each target timestep. For a beam size of , we measured that the ratio of unique compared to total is approximately 70%, averaged over several language pairs. This can only be applied to the first target hidden layer.
Words/Sec.  Speedup  

Type  (SingleThreaded)  Factor 
Baseline  95  1.00x 
+ 16Bit Mult.  248  2.59x 
+ PreComp. Emb.  311  3.25x 
+ PreComp. Att.  342  3.57x 
+ SSE & Lookup  386  4.06x 
+ Merge Rec.  418  4.37x 
4.6 Speedup Results
Cumulative results from each of the preceding speedups are presented in Table 1, measured on WMT EnglishFrench NewsTest2014. The NMT architecture evaluated here uses 3layer 512dimensional bidirectional GRU for the source, and a 1layer 1024dimensional attentional GRU for the target. Each sentence is decoded independently with a beam of 6. Since these speedups are all mathematical identities excluding quantization noise, all outputs achieve 36.2 BLEU and are 99.9%+ identical.
The largest improvement is from 16bit matrix multiplication, but all speedups contribute a significant amount. Overall, we are able to achieve a 4.4x speedup over a fast baseline decoder. Although the absolute speed is impressive, the model only uses one target layer and is several BLEU behind the SOTA, so the next goal is to maximize model accuracy while still achieving speeds greater than some target, such as 100 words/sec.
Words/Sec  
System  BLEU  (SingleThreaded) 
Basic PhraseBased MT (Schwenk, 2014)  33.1   
SOTA PhraseBased MT (Durrani et al., 2014)  37.0   
6Layer NonAttentional SeqtoSeq LSTM (Luong et al., 2014)  33.1   
RNN Search, 1Layer Att. GRU, w/ Large Vocab (Jean et al., 2015)  34.6  
Google NMT, 8Layer Att. LSTM, WordBased (Wu et al., 2016)  37.9  
Google NMT, 8Layer Att. LSTM, WPM32k (Wu et al., 2016)  39.0  
Baidu Deep Attention, 8Layer Att. LSTM (Zhou et al., 2016)  39.2   
(S1) Trg: 1024AttGRU  36.2  418 
(S2) Trg: 1024AttGRU + 1024GRU  36.8  242 
(S3) Trg: 1024AttGRU + 3Layer 768FCRelu + 1024FCTanh  37.1  271 
(S4) Trg: 1024AttGRU + 7Layer 768FCRelu + 1024FCTanh  37.4  229 
(S5) Trg: 1024AttGRU + 7Layer 768FCRelu + 1024GRU  37.6  157 
(S6) Trg: 1024AttGRU + 15Layer 768FCRelu + 1024FCTanh  37.3  163 
(S7) Src: 8Layer LSTM, Trg: 1024AttLSTM + 7Layer 1024LSTM  37.8  28 
(E1) Ensemble of 2x Model (S4)  38.3  102 
(E2) Ensemble of 3x Model (S4)  38.5  65 
5 Model Improvements
In NMT, like in many other deep learning tasks, accuracy can be greatly improved by adding more hidden layers, but training and decoding time increase significantly
(Luong et al., 2014; Zhou et al., 2016; Wu et al., 2016). Several past works have noted that convolutional neural networks (CNNs) are significantly less expensive than RNNs, and replaced the source and/or target side with a CNNbased architecture
(Gehring et al., 2016; Kalchbrenner et al., 2016). However, these works have found it is difficult to replace the target side of the model with CNN layers while maintaining high accuracy. The use of a recurrent target is especially important to track attentional coverage and ensure fluency.Here, we propose a mixed model which uses an RNN layer at the bottom to both capture fullsentence context and perform attention, followed by a series of fullyconnected (FC) layers applied on top at each timestep. The FC layers can be interpreted as a CNN without overlapping stride. Since each FC layer consists of a single matrix multiplication, it is
the cost of a GRU (or an LSTM). Additionally, several of the speedups from Section 4 can only be applied to the first layer, so there is strong incentive to only use a single target RNN.To avoid vanishing gradients, we use ResNetstyle skip connections (He et al., 2016). These allow very deep models to be trained from scratch and do not require any additional matrix multiplications, unlike highway networks (Srivastava et al., 2015). With 5 intermediate FC layers, target timestep is computed as:
We follow He et al. (2016)
and only use skip connections on every other FC layer, but do not use batch normalization. The same pattern can be used for more FC layers, and the FC layers can be a different size than the bottom or top hidden layers. The top hidden layer can be an RNN or an FC layer. It is important to use
relu activations (opposed to tanh) for ResNetstyle skip connections. The GRUs still use tanh.5.1 Model Results
Results using the mixed RNN+FC architecture are shown in Table 2, using all speedups. We have found that the benefit of using RNN+FC layers on the source is minimal, so we only perform ablation on the target. For the source, we use a 3layer 512dim bidi GRU in all models (S1)(S6).
Model (S1) and (S2) are one and two layer baselines. Model (S4), which uses 7 intermediate FC layers, has similar decoding cost to (S2) while doubling the improvement over (S1) to 1.2 BLEU. We see minimal benefit from using a GRU on the top layer (S5) or using more FC layers (S6). In (E1) and (E2) we present 2 and 3 model ensembles of (S4), trained from scratch with different random seeds. We can see that the 2model ensemble improves results by 0.9 BLEU, but the 3model ensemble has little additional improvment. Although not presented here, we have found these improvement from decoder speedups and RNN+FC to be consistent across many language pairs.
All together, we were able to achieve a BLEU score of 38.3 while decoding at 100 words/sec on a single CPU core. As a point of comparison, Wu et al. (2016) achieves similar BLEU scores on this test set (37.9 to 38.9) and reports a CPU decoding speed of ~100 words/sec (0.2226 sents/sec), but parallelizes this decoding across 44 CPU cores. System (S7), which is our reimplementation of Wu et al. (2016), decodes at 28 words/sec on one CPU core, using all of the speedups described in Section 4. Zhou et al. (2016) has a similar computational cost to (S7), but we were not able to replicate those results in terms of accuracy.
Although we are comparing an ensemble to a single model, we can see ensemble (E1) is over 3x faster to decode than the single model (S7). Additionally, we have found that model (S4) is roughly 3x faster to train than (S7) using the same GPU resources, so (E1) is also 1.5x faster to train than a single model (S7).
References
 Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 .

Brown et al. (1993)
Peter F Brown, Vincent J Della Pietra, Stephen A Della Pietra, and Robert L
Mercer. 1993.
The mathematics of statistical machine translation: Parameter estimation.
Computational linguistics 19(2):263–311.  Devlin et al. (2014) Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas Lamar, Richard M Schwartz, and John Makhoul. 2014. Fast and robust neural network joint models for statistical machine translation. In ACL (1). Citeseer, pages 1370–1380.
 Durrani et al. (2014) Nadir Durrani, Barry Haddow, Philipp Koehn, and Kenneth Heafield. 2014. Edinburgh’s phrasebased machine translation systems for wmt14. In Proceedings of the Ninth Workshop on Statistical Machine Translation. pages 97–104.
 Gehring et al. (2016) Jonas Gehring, Michael Auli, David Grangier, and Yann N Dauphin. 2016. A convolutional encoder model for neural machine translation. arXiv preprint arXiv:1611.02344 .
 Han et al. (2016) Song Han, Huizi Mao, and William J. Dally. 2016. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. ICLR .

He et al. (2016)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016.
Deep residual learning for image recognition.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
. pages 770–778.  Jean et al. (2015) Sébastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. 2015. On using very large target vocabulary for neural machine translation. CoRR .
 Kalchbrenner et al. (2016) Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex Graves, and Koray Kavukcuoglu. 2016. Neural machine translation in linear time. arXiv preprint arXiv:1610.10099 .
 Kim and Rush (2016) Yoon Kim and Alexander M Rush. 2016. Sequencelevel knowledge distillation. arXiv preprint arXiv:1606.07947 .
 Luong et al. (2014) MinhThang Luong, Ilya Sutskever, Quoc V Le, Oriol Vinyals, and Wojciech Zaremba. 2014. Addressing the rare word problem in neural machine translation. ACL 2015 .
 Quirk and Moore (2007) Chris Quirk and Robert Moore. 2007. Faster beamsearch decoding for phrasal statistical machine translation. Machine Translation Summit XI .
 Schwenk (2014) Holger Schwenk. 2014. http://wwwlium.univlemans.fr/schwenk/cslm_joint_paper. [Online; accessed 03September2014].
 Srivastava et al. (2015) Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. 2015. Highway networks. arXiv preprint arXiv:1505.00387 .
 Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. NIPS .
 Wu et al. (2016) Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR abs/1609.08144.
 Zhou et al. (2016) Jie Zhou, Ying Cao, Xuguang Wang, Peng Li, and Wei Xu. 2016. Deep recurrent models with fastforward connections for neural machine translation. arXiv preprint arXiv:1606.04199 .
Comments
There are no comments yet.