Accelerating NMT Batched Beam Decoding with LMBR Posteriors for Deployment

04/30/2018 ∙ by Gonzalo Iglesias, et al. ∙ SDL 0

We describe a batched beam decoding algorithm for NMT with LMBR n-gram posteriors, showing that LMBR techniques still yield gains on top of the best recently reported results with Transformers. We also discuss acceleration strategies for deployment, and the effect of the beam size and batching on memory and speed.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The advent of Neural Machine Translation (NMT) has revolutionized the market. Objective improvements 

Sutskever et al. (2014); Bahdanau et al. (2015); Sennrich et al. (2016b); Gehring et al. (2017); Vaswani et al. (2017) and a fair amount of neural hype have increased the pressure on companies offering Machine Translation services to shift as quickly as possible to this new paradigm.

Such a radical change entails non-trivial challenges for deployment; consumers certainly look forward to better translation quality, but do not want to lose all the good features that have been developed over the years along with SMT technology. With NMT, real time decoding is challenging without GPUs, and still an avenue for research Devlin (2017). Great speeds have been reported by  Junczys-Dowmunt et al. (2016) on GPUs, for which batching queries to the neural model is essential. Disk usage and memory footprint of pure neural systems are certainly lower than that of SMT systems, but at the same time GPU memory is limited and high-end GPUs are expensive.

Further to that, consumers still need the ability to constrain translations; in particular, brand-related information is often as important for companies as translation quality itself, and is currently under investigation Chatterjee et al. (2017); Hokamp and Liu (2017); Hasler et al. (2018). It is also well known that pure neural systems reach very high fluency, often sacrificing adequacy Tu et al. (2017); Zhang et al. (2017); Koehn and Knowles (2017), and have been reported to behave badly under noisy conditions Belinkov and Bisk (2018). Stahlberg et al. (2017) show an effective way to counter these problems by taking advantage of the higher adequacy inherent to SMT systems via Lattice Minimum Bayes Risk (LMBR) decoding Tromble et al. (2008). This makes the system more robust to pitfalls, such as over- and under-generation Feng et al. (2016); Meng et al. (2016); Tu et al. (2016) which is important for commercial applications.

In this paper, we describe a batched beam decoding algorithm that uses NMT models with LMBR n-gram posterior probabilities 

Stahlberg et al. (2017). Batching in NMT beam decoding has been mentioned or assumed in the literature, e.g. Devlin (2017); Junczys-Dowmunt et al. (2016), but to the best of our knowledge it has not been formally described, and there are interesting aspects for deployment worth taking into consideration.

We also report on the effect of LMBR posteriors on state-of-the-art neural systems, for five translation tasks. Finally, we discuss how to prepare (LMBR-based) NMT systems for deployment, and how our batching algorithm performs in terms of memory and speed.

2 Neural Machine Translation and LMBR

Given a source sentence , a sequence-to-sequence NMT model scores a candidate translation sentence with words as:


where uses a neural function . To account for batching neural queries together, our abstract function takes the form of where is the previous batch state with

state vectors in rows,

is a vector with the preceding generated target words, and is a matrix with the annotations Bahdanau et al. (2015) of a source sentence. The model has a vocabulary size .

The implementation of this function is determined by the architecture of specific models. The most successful ones in the literature typically share in common an attention mechanism that determines which source word to focus on, informed by and . Bahdanau et al. (2015) use recurrent layers to both compute and the next target word . Gehring et al. (2017) use convolutional layers instead, and Vaswani et al. (2017) prescind from GRU or LSTM layers, relying heavily on multi-layered attention mechanisms, stateful only on the translation side. Finally, this function can also represent an ensemble of neural models.

Lattice Minimum Bayes Risk decoding computes n-gram posterior probabilities from an evidence space and uses them to score a hypothesis space Kumar and Byrne (2004); Tromble et al. (2008); Blackwood et al. (2010). It improves single SMT systems, and also lends itself quite nicely to system combination Sim et al. (2007); de Gispert et al. (2009). Stahlberg et al. (2017) have recently shown a way to use it with NMT decoding: a traditional SMT system is first used to create an evidence space , and the NMT space is then scored left-to-right with both the NMT model(s) and the n-gram posteriors gathered from . More formally:


For our purposes is arranged as a matrix with each row uniquely associated to an n-gram history identified in : each row contains scores for any word in the NMT vocabulary.

can be precomputed very efficiently, and stored in the GPU memory. The number of distinct n-gram histories is typically no more than for our phrase-based decoder producing hypotheses. Notice that such a matrix only containing contributions would be very sparse, but it turns into a dense matrix with the summation of . Both sparse and dense operations can be performed on the GPU. We have found it more efficient to compute first all the sparse operations on CPU, and then upload to the GPU memory and sum the constant in GPU111 Ideally we would want to keep as a sparse matrix and sum on-the-fly. However this is not possible with ArrayFire 3.6..

1:procedure DecodeNMT(x, )
6:      Set of EOS survivors
7:     for  = to  do
10:          Add LMBR contributions
13:         for  = to  do
14:              if  then
15:                   Track indices and score
16:                   Mask out to prevent hypothesis extension                             
17:     return
Algorithm 1 Batch decoding with LMBR n-gram posteriors

3 NMT batched beam decoding

Algorithm 1 describes NMT decoding with LMBR posteriors using beam size equal to the batch size. Lines - initialize the decoder; the number of time steps

is usually a heuristic function of the source length.

will keep track of the best scores per time step, and are indices.

Lines - are the core of the batch decoding procedure. At each time step , given , and , returns two matrices: , with size , contains log-probabilities for all possible candidates in the vocabulary given live hypotheses. is the next batch state. Each row in is the vector state that corresponds to any candidate in the same row of (line 8).

Lines , add the n-gram posterior scores. Given the indices in and it is straightforward to read the unique histories for the open hypotheses: the topology of the hypothesis space is that of a tree because an NMT state represents the entire live hypothesis from time step . Note that is the index to access the previous word in . In effect, indices in function as backpointers, allowing to reconstruct not only n-grams per time step, but also complete hypotheses. As discussed for Equation 2, these histories are associated to rows in our matrix . Function simply creates a new matrix of size by fetching those rows from . This new matrix is summed to (line ).

In line , we get the indices and scores in of the top B hypotheses. These best hypotheses could come from any row in . For example, all B best hypotheses could have been found in row 0. In that case, the new batch state to be used in the next time step should contain copies of row 0 in the other rows. This is achieved again with in line 12.

Finally, lines - identify whether there are any end-of-sentence (EOS) candidates; the corresponding indices and score are pushed into stack and these candidates are masked out (i.e. set to ) to prevent further expansion. In line , traces backwards the best hypothesis in , again using indices in and . Optionally, normalization by hypothesis length happens in this step.

It is worth noting that:

  1. If we drop lines 9, 10 we have a pure left-to-right NMT batched beam decoder.

  2. Applying a constraint (e.g. for lattice rescoring or other user constraints) involves masking out scores in before line 11.

  3. Because the batch size is tied to the beam size, the memory footprint increases with the beam.

  4. Due to the beam being used for both EOS and non EOS candidates, it can be argued that this empoverishes the beam and it could be kept in addition to non EOS candidates (either by using a bigger beam, or keeping separately). Empirically we have found that this does not affect quality with real models.

  5. The opposite, i.e. that EOS candidates never survive in the beam for time steps, can happen, although very infrequently. Several pragmatic backoff strategies can be applied in this situation: for example, running the decoder for additional time steps, or tracking all EOS candidates that did not survive in a separate stack and picking the best hypothesis from there. We chose the latter.

ger-eng eng-ger chi-eng eng-jpn jpn-eng
PBMT 28.9 19.6 15.8 33.4 18.0
FNMT 32.8 26.1 20.8 39.1 25.3
LNMT 33.7 26.6 22.0 40.4 26.1
TNMT 35.2 28.9 24.8 44.6 29.4
LTNMT 35.4 29.2 25.4 44.9 30.2
Best submissions 35.1 28.3 26.4 43.3 28.4
Table 1: Quality assessment of our NMT systems with and without LMBR posteriors for GRU-based (FNMT, LNMT) and Transformer models (TNMT, LTNMT). Cased BLEU scores reported on translation tasks.The exact PBMT systems used to compute n-gram posteriors for LNMT and LTNMT systems are also reported. The last row shows scores for the best official submissions to each task.

3.1 Extension to Sentence batching

In addition to batching all queries to the neural model needed to compute the next time step for one sentence, we can do sentence batching: this is, we translate sentences simultaneously, batching queries per time step.

With small modifications, Algorithm 1 can be easily extended to handle sentence batching. If the number of sentences is ,

  1. Instead of one set to store EOS candidates, we need sets.

  2. For every time step, and need to be matrices instead of vectors, and minor changes are required in to fetch the best candidates per sentence efficiently.

  3. and can remain as matrices, in which case the new batch size is simply .

  4. The heuristic function used to compute is typically sentence specific.

4 Experiments

4.1 Experimental Setup

We report experiments on English-German, German-English and Chinese-English language pairs for the WMT17 task, and Japanese-English and English-Japanese for the WAT task. For the German tasks we use news-test2013 as a development set, and news-test2017 as a test set; for Chinese-English, we use news-dev2017 as a development set, and news-test2017 as a test set. For Japanese tasks we use the ASPEC corpus (Nakazawa et al., 2016).

We use all available data in each task for training. In addition, for German we use back-translation data Sennrich et al. (2016a). All training data for neural models is preprocessed with the byte pair encoding technique described by Sennrich et al. (2016b). We use Blocks van Merriënboer et al. (2015)

with Theano 

Bastien et al. (2012) to train attention-based single GRU layer models Bahdanau et al. (2015), henceforth called FNMT. The vocabulary size is K. Transformer models Vaswani et al. (2017), called here TNMT, are trained using the Tensor2Tensor package222 with a vocabulary size of K.

Our proprietary translation system is a modular homegrown tool that supports pure neural decoding (FNMT and TNMT) and with LMBR posteriors (henceforce called LNMT and LTNMT respectively), and flexibly uses other components (phrase-based decoding, byte pair encoding, etcetera) to seamlessly deploy an end-to-end translation system.

FNMT/LNMT systems use ensembles of 3 neural models unless specified otherwise; TNMT/LTNMT systems decode with 1 to 2 models, each averaging over the last 20 checkpoints.

The Phrase-based decoder (PBMT) uses standard features with one single 5-gram language model Heafield et al. (2013), and is tuned with standard MERT Och (2003); n-gram posterior probabilities are computed on-the-fly over rich translation lattices, with size bounded by the PBMT stack and distortion limits. The parameter in Equation 2 is set as 0.5 divided by the number of models in the ensemble. Empirically we have found this to be a good setting in many tasks.

Unless noted otherwise, the beam size is set to 12 and the NMT beam decoder always batches queries to the neural model. The beam decoder relies on an early preview of ArrayFire 3.6 Yalamanchili et al. (2015)333, compiled with CUDA 8.0 libraries. For speed measurements, the decoder uses one single CPU thread. For hardware, we use an Intel Xeon CPU E5-2640 at 2.60GHz. The GPU is a GeForce GTX 1080Ti. We report cased BLEU scores Papineni et al. (2002), strictly comparable to the official scores in each task444 and

Figure 1: Accelerated FNMT and LNMT decoding times for newstest-2017 test set.

4.2 The effect of LMBR n-gram posteriors

Table 1 shows contrastive experiments for all five language pair/tasks. We make the following observations:

  1. LMBR posteriors show consistent gains on top of the GRU model (LNMT vs FNMT rows), ranging from BLEU to BLEU. This is consistent with the findings reported by Stahlberg et al. (2017).

  2. The TNMT system boasts improvements across the board, ranging from BLEU in German-English to an impressive BLEU in English-Japanese WAT (TNMT vs LNMT). This is in line with findings by Vaswani et al. (2017) and sets new very strong baselines to improve on.

  3. Further, applying LMBR posteriors along with the Transformer model yields gains in all tasks (LTNMT vs TNMT), up to BLEU in Japanese-English. Interestingly, while we find that rescoring PBMT lattices Stahlberg et al. (2016) with GRU models yields similar improvements to those reported by Stahlberg et al. (2017), we did not find gains when rescoring with the stronger TNMT models instead.

Figure 2: Batch beam decoder speed measured over newstest-2017 test set, using the accelerated FNMT system ( BLEU for beam size = ).
Figure 3: Batch beam decoder speed measured over newstest-2017 test set, using the accelerated eng-ger-wmt17 FNMT system ( BLEU) with additional sentence batching, up to sentences.

4.3 Accelerating FNMT and LNMT systems for deployment

There is no particular constraint on speed for the research systems reported in Table 1. We now address the question of deploying NMT systems so that MT users get the best quality improvements at real-time speed and with acceptable memory footprint. As an example, we analyse in detail the English-German FNMT and LNMT case and discuss the main trade-offs if one wanted to accelerate them. Although the actual measurements vary across all our productised NMT engines, the trends are similar to the ones reported here.

In this particular case we specify a beam width of 0.01 for early pruning Wu et al. (2016); Delaney et al. (2006) and reduce the beam size to 4. We also shrink the ensemble into one single big model555The file size of each individual models of the ensemble is 510MB; the size of the shrunken model is 1.2GB. using the data-free shrinking method described by Stahlberg and Byrne (2017), an inexpensive way to improve both speed and GPU memory footprint.

In addition, for LNMT systems we tune phrase-based decoder parameters such as the distortion limit, the number of translations per source phrase and the stack limit. To compute n-gram posteriors we now only take a -best from the phrase-based translation lattice.

Table 2 shows a contrast of our English-German WMT17 research systems versus the respective accelerated ones.

Research Accelerated
BLEU speed BLEU speed
FNMT 26.1 2207 25.2 9449
LNMT 26.6 263 25.7 4927
Table 2: Cased BLEU scores for research vs accelerated English-to-German WMT17 systems. Speed reported in words per minute.

In the process, both accelerated systems have lost BLEU relative to the baseline. As an example, let us break down the effects of accelerating the LNMT system: using only -best hypotheses from the phrase-based translation lattice reduces BLEU. Replacing the ensemble with a data-free shrunken model reduces another BLEU and decreasing the beam size reduces BLEU. The impact of reducing the beam size varies from system to system, although often does not result in substantial quality loss for NMT models Britz et al. (2017).

It is worth noting that these two systems share exactly the same neural model and parameter values. However, LNMT runs words per minute (wpm) slower than FNMT. Figure 1 breaks down the decoding times for both the accelerated FNMT and LNMT systems. The LNMT pipeline also requires a phrase-based decoder and the extra component to compute the n-gram posterior probabilities. In effect, while both are remarkably fast by themselves (e.g. the phrase-based decoder is running at wpm), these extra contributions explain most of the speed reduction for the accelerated LNMT system. In addition, the beam decoder itself is slightly slower for LNMT than for FNMT. This is mainly due to the computation of as explained in Section 2. Finally, the respective GPU memory footprints for FNMT and LNMT are and GB.

4.4 Batched beam decoding and beam size

We next discuss the impact of using batch decoding and the beam size. To this end we use the accelerated FNMT system ( BLEU, wpm) to decode with and without batching; we also widen the beam. Figure 2 shows the results.

The accelerated system itself with batched beam decoding and beam size of is times faster than without batching ( wpm). The GPU memory footprint is GB bigger when batching ( vs GB). As can be expected, widening the beam decreases the speed of both decoders. The relative speed-up ratio favours the batch decoder for wider beams, i.e. it is 5 times faster for beam size 12. However, because the batch size is tied to the beam size, this comes at a cost in GPU memory footprint (under GB).

4.5 Sentence batching

As described in Section 3.1, it is straightforward to extend beam batching to sentence batching. Figure 3 shows the effect of sentence batching up to 7 sentences on our accelerated FNMT system.

Whilst the speed-up of our implementation is sub-linear, when batching 5 sentences the decoder runs at almost 21000 wpm, and goes beyond 24000 for 7 sentences. Thus, our implementation of sentence batching is times faster on top of beam batching. Again, this comes at a cost: the GPU memory footprint increases as we batch more and more sentences together, up to GB for sentences, which approaches the limit of GPU memory.

Note that sentence batching does not change translation quality. For example, when translating 7 sentences, we are effectively batching 28 neural queries per time step. Indeed, each individual sentence is still being translated with a beam size of 4.

Figure 3

also shows the effect of sorting the test set by sentence length. Because sentences have similar lengths, less padding is required and hence we have less wasteful GPU computation. With

batched sentences the decoder would run at barely wpm, this is, wpm less due to not sorting by sentence length. A similar strategy is common for neural training Sutskever et al. (2014); Morishita et al. (2017).

5 Conclusions

We have described a left-to-right batched beam NMT decoding algorithm that is transparent to the neural model and can be combined with LMBR n-gram posteriors. Our quality assessment with Transformer models Vaswani et al. (2017) has shown that LMBR posteriors can still improve such a strong baseline in terms of BLEU. Finally, we have also discussed our acceleration strategy for deployment and the effect of batching and the beam size on memory and speed.