Vocabulary Selection Strategies for Neural Machine Translation

by   Gurvan L'Hostis, et al.

Classical translation models constrain the space of possible outputs by selecting a subset of translation rules based on the input sentence. Recent work on improving the efficiency of neural translation models adopted a similar strategy by restricting the output vocabulary to a subset of likely candidates given the source. In this paper we experiment with context and embedding-based selection methods and extend previous work by examining speed and accuracy trade-offs in more detail. We show that decoding time on CPUs can be reduced by up to 90 English-Romanian tasks at the same or only negligible change in accuracy. This brings the time to decode with a state of the art neural translation system to just over 140 msec per sentence on a single CPU core for English-German.


page 1

page 2

page 3

page 4


On Using Very Large Target Vocabulary for Neural Machine Translation

Neural machine translation, a recently proposed approach to machine tran...

A Convolutional Encoder Model for Neural Machine Translation

The prevalent approach to neural machine translation relies on bi-direct...

Learning to Refine Source Representations for Neural Machine Translation

Neural machine translation (NMT) models generally adopt an encoder-decod...

Tagged Back-Translation

Recent work in Neural Machine Translation (NMT) has shown significant qu...

Syntax-aware Data Augmentation for Neural Machine Translation

Data augmentation is an effective performance enhancement in neural mach...

The Devil is in the Details: On the Pitfalls of Vocabulary Selection in Neural Machine Translation

Vocabulary selection, or lexical shortlisting, is a well-known technique...

Iterative Refinement for Machine Translation

Existing machine translation decoding algorithms generate translations i...

1 Introduction

Machine translation is a sequence to sequence problem, encoder-decoder deep learning architectures are therefore particularly appropriate to solve it. Neural machine translation has gotten good enough to challenge traditional statistical methods in recent years


One of the issues NMT faces is related to vocabulary sizes. The output layer of the decoder is of linear complexity with respects to target vocabulary size because to generate the next token one needs to score every possible token every time. The number of possible target words is thus generally set between 30k and 80k, which makes NMT systems produce a lot of out-of-vocabulary (OOV) tokens to fill the gaps.

One approach to deal with that issue is to use OOV tokens that contain alignment information, this allows a finer post-processing step of OOV replacement [LuongUnk]. However, this is only a way of doing damage-control and does not provide a solution to train large vocabularies. Characters, whether as last resort for rare words [LuongHybrid] or for everything [Ling], and sub-word units [Senrich] can be used instead of words to manage open vocabularies. To use very large vocabularies, Jean groups training data in sets that rely on smaller sub-vocabularies, and an alignment model is used to select the relevant vocabulary at decoding time. IBM build on this approach by leveraging phrase-level alignments on top of the word-based dictionary and by using it at both training and decoding time.

Our approach extends on the latter, making the following contributions:

  • we explore several other vocabulary selection methods;

  • we quantify the gains obtained by these vocabulary selection methods.

2 Translation model

The neural system we use is an in-house Torch reimplementation of the encoder-decoder architecture similar of Bahdanau. We have one embedding for the source language and three for the target language: one for the attention module, one as input to the decoder and one to be multiplied with the decoder output for scoring. The former embeddings are all needed in memory but the other three are look-up tables from which we use only a few columns at a time.

A bidirectional LSTM is used as encoder; hidden representations of the forward and the backward passes are simply stacked.

The attention module is a global dot-product style soft attention mechanism [LuongAttention]

: the previous decoder hidden state is projected in the source word space, added to the attentional embedding of the previous target token and dot-multiplied with every encoder output. These products are then used to compute a weighted sum of the encoder outputs, the attentional vector.

The decoder is a conditional LSTM which takes as inputs the attentional vector, the decoding embedding of the previous target token and the previous hidden state to output the next hidden state. Scoring embedding vectors are then dot-multiplied with this hidden representation to obtain token-wise scores.

Training samples are bucket-batched on target sentence length: we order translation pairs by target length and source length to break ties (this limits source side padding) and make batches that contain only samples of the same target length, leaving some batches not full if necessary.

3 Vocabulary selection methods

One important challenge of current neural translation systems is the vocabulary size of the target language. In figure 1, displaying BLEU scores obtained on the English-German WMT 2013 test set with varying vocabulary sizes, we see that no plateau has been reached; it would be useful to get more tokens. Some languages that are more inflected than German such as Czech may benefit from bigger vocabularies even more.

Figure 1: BLEU scores of models varying in target vocabulary size (logarithmic in x-axis) on newstest2013. Blue is direct output scoring, green is scoring after alignment-based unknown token replacement.

However, using 100k tokens or more creates two problems:

  • for every token that is added to an output sentence, we have to compute the scalar product of the hidden representation produced by the decoder with all target token embeddings for scoring purposes. This adds up linearly with the vocabulary size;

  • storing target embeddings in the limited GPU memory sparks a trade-off between vocabulary size and hidden representation sizes. Best performance is obtained when vocabulary size is at 100k in our model, but it would be better to use larger vocabularies and larger hidden representations.

We address this output layer bottleneck by using a small dynamic vocabulary. For each batch that is presented to the network, a subset of the large global vocabulary is selected and only its tokens are scored in the output layer.

To implement this, we add a look-up table from which the target vocabulary embedding matrix is extracted for each batch and multiplied to the decoder output. We next explain how these small vocabularies are obtained.

3.1 Selection methods

The selection strategies are functions that take a sentence in the source language as inputs and return a set of tokens in the target language. These functions are learned on a parallel translation corpus. Most of them build on a word level selection function of which we aggregate the selections at the sentence and batch levels. At training time, we add the reference tokens to this selection.

Word co-occurrence

The simplest method is to count co-occurrences of each (source token, target token)

pair in the dataset, which can be thought of as an estimation of

with and

source language and target language tokens. These probabilities are then used to fetch a fixed number of target tokens for every source token (the top

k, k

being a hyperparameter). The selected vocabulary is the union of all of these selection over a sentence.

111A similar approach where co-occurrence counts are normalized by target token counts (which estimates ) was tried but gave poor results because of instabilities in the the very low frequencies.

Principal component analysis

A PCA on the co-occurrence matrix of the joint source and target vocabulary can be used to compute a bilingual token embedding [PCA]; for every source token, we retrieve and rank the closest target tokens in it and take the top k closest. We then aggregate all top selections to define the sentence-wise subset. A PCA of dimension 1024 is used. This approach is a form of regularization of the co-occurrence method; while the latter requires exact matches between word pairs, PCA takes advantage of the co-occurrence sparsity to reduce the space to a dimension in which we can perform nearest neighbor search to retrieve semantically similar words.

Word alignment

We use the unsupervised word aligner fast_align [FastAlign] to estimate which (source token, target token) pairs are aligned between two parallel sentences to design another selection method. Everything works like in the co-occurrence method where we replace co-occurrence counts by alignment counts.

Phrase alignment

A more advanced alignment method consists in aligning n-grams directly. We consider every n-gram in the source sentence (

), retrieve aligned target n-grams and output the union of all the tokens that are present in the n-grams. Our alignment dictionary is thresholded at five alignments so that only relevant n-gram translations appear in it, therefore most source n-grams (which are not likely to be a general phrase) will not have any correspondence in the target language.

This method aims at capturing more subtle links that might be missed by word based methods. For instance, the tokens of “pass out”, taken individually, are unlikely to make a word-based selection function retrieve the tokens necessary to translate it as “s’évanouir” or “tomber dans les pommes”, but a phrase-based method should.

Support vector machines

Taking this idea one step further, we try to leverage the whole source sentence to select target tokens by using cheap classifiers, one per target token, that take the one-hot encoded sentence as input and output a score related to the probability of presence of a that token in a valid translation. We use sparse linear support vector machines trained by stochastic gradient descent


SVMs are trained separately and thus need to be calibrated by coherently defining one threshold for each SVM above which its predictions are considered positive; we do this either by using recall on the validation set or by matching token frequencies in the validation set with a given multiplier.

3.2 Common words

On top of the token set that is selected by aforementioned strategies, we can add a set of common words that are always used –for example, the 1000 most frequent words. Doing this is interesting for two reasons: it makes sure no important tokens such as syntactic words or punctuation are omitted; it reduces the cost of embedding reordering that happens in the look-up table by keeping fixed the columns that would be frequently moved.

4 Experimental setup

Experiments are based on the English-German WMT15 dataset, it is composed of 1.64M sentences from EU parliament, 1.81M from a web crawl and 0.2M from news commentaries. Sentences longer than 50 tokens are filtered out and the dataset is shuffled. One percent of it is used as a development set (36495 sentences) and the test set is a news dataset of 2169 sentences. We also use the 2013 test set (3000 sentences) for motivating experiments in section 3.

All models have hidden representations of size 512, they are trained on a Tesla M40 (12GB of memory) using adaptive moment estimation with a learning rate of 0.01 and gradients are clipped at 10. Training takes four days on the full 100k vocabulary model. Data processing, including vocabulary selection, is done in separates threads and do not affect reported times. Testing is made with beams of size 5.

A CPU cluster was used to train the 100k SVMs.

5 Results

Co-occurrence methods

Our aim is to reduce the average vocabulary size as much as possible while retaining similar accuracy, and all methods using co-occurrences failed to do so.

The method based on co-occurrences has a bias towards frequent words: For example, if “John” is translated half the time as “John” and half the time as “Johannes”, very frequent tokens such as “.”, “,” and “die” are likely to co-occur more often and would thus rank higher. We cannot get sufficient coverage out of this method without having vocabulary sizes nearing the original size.

PCA, while based on the same co-occurrence matrix as the former method, produces slightly better results. However, coverage is still too small to get an accuracy close to the full vocabulary model.

Model Parameters Avg. voc. Coverage BLEU
Full vocabulary 100k 93.3% 19.83
Co-occurrences 2k common - top 50 2236 80.0% 17.91
PCA No common - top 50 4178 79.7% 17.44
Word align No common - top 50 3587 88.8% 19.80
1k common - top 20 2265 86.7% 19.77
2k common - top 10 2552 85.6% 19.77
Phrase align 1k common - top 50 2803 86.8% 19.7
2k common - top 20 2817 85.8% 19.62
2k common - top 50 3529 87.4% 19.81
SVM No common 4323 84.2% 18.36
Table 1: Average vocabulary size, coverage and BLEU score of full and reduced vocabulary models on WMT15 with batches of size 16.


Word-level alignment achieves the best results, they are reported in table 1. Three possible trade-offs between using a fixed common words vocabulary and fetching more candidates are shown, all of which get within 0.06 BLEU points of the full vocabulary model. (SOME OF THESE ARE BLEU SCORES OBTAINED WITH PRETRAINED FULL MODEL, UPDATE WITH ACTUAL REDUCED TRAININGS)

Taking this further with phrase-based alignments did not help and even performs a little worse: while we can retain satisfying accuracy with 2k common words and fetching the top 50 candidates, accuracy degrades faster than with word alignment methods when we try to reduce the vocabulary size through either common words or selection.

Last, various calibration heuristics were tried for the SVM method to ensure consistency across classifiers but none produced interesting results.

We remark that while coverage does generally correlates with BLEU scores, the word alignment system achieves better accuracy than the phrase-alignment system at a given coverage.

Vocabulary selection

Figure 2 shows average vocabulary sizes for various combinations of common words and candidate selection with the word alignment method on batches of size 1 and 32. We observe that having no common words when decoding produces vocabularies smaller than a thousand words, top 50 being competitive with the full vocabulary method in terms of performance. Taking common words is therefore not needed at decoding time.

However, when using batches of size 32 like in our training setting, vocabulary size is dominated by subset selection sizes. It is thus interesting to use common words in order to fix the most used embeddings in memory and not having to reorder them at every batch through the look-up table.

Figure 2: Average vocabulary sizes for unbatched (left) and batch 32 (right), for common vocabulary sizes 0 (blue), 1k (green) and 2k (red), for various number of candidates with word alignment strategy on development set.
Figure 3: Decoding times for various average vocabulary sizes on CPU with beam size 5 as percentage of full 100k vocabulary model decoding times. Blue, green and red correspond to 0, 1k and 2k common words. Full 100k vocabulary model is 2.3 s/sentence.
Figure 4: Processing times at various average vocabulary sizes during GPU training with batchsize 32. Blue, green and red correspond to 0, 1k and 2k common words. Full 100k vocabulary model is dashed line.
Figure 5: Coverage of reference sentence with selected vocabulary for unbatched (left) and batch 32 (right), for common vocabulary sizes 0 (blue), 1k (green) and 2k (red), for various number of candidates, with word alignment strategy on development set. Dashed line is full 100k vocabulary reference.
Figure 6: BLEU score with selected vocabulary for unbatched (left) and batch 32 (right), for common vocabulary sizes 0 (blue), 1k (green) and 2k (red), for various number of candidates, with word alignment strategy on development set. Dashed line is full 100k vocabulary reference.

Selection at decoding time

Selection can also be applied at decoding time on a model that was trained with a full vocabulary, such results are shown in figure 6. COMMENT MORE


We measure the total time it takes to achieve one epoch over the development set in a training and in a decoding situation. (EITHER USE ACTUAL TRAINING TIMES OR BE EXPLICIT ABOUT THIS BEING FORWARD ONLY) Figure

3 plots times obtained with the align method using top 10, 20 or 50 candidates. There is no batching here, average vocabulary sizes are thus tiny.

The training setup (figure 4), which uses batches of size 32. Vocabulary reduction gives a smaller speedup on the GPU than on the CPU, at about a third of the duration of the full model.

We see important offsets at vocabulary size zero: the output layer is not the only bottleneck in neural machine translation. The encoder, which is not affected by the vocabulary reduction strategies, makes for a significant portion of global time.

Memory gains

In the full vocabulary model, a large part of the limited GPU memory is used to store the scoring embeddings which are dot-multiplied with the decoder output. In our setup, training with batches of size 32, the translation model uses 3.4GiB with the full 100k scoring embedding but only 1.3GiB when using a vocabulary size of 5k tokens.

The memory bottleneck that is encountered when using big vocabularies is removed; the global vocabulary size can be as big as fits RAM since it does not impact GPU memory use and hidden representation sizes can be increased significantly. For instance, doubling from 512 to 1024 the hidden size in our model with 5k tokens raises memory use to 2.7GiB.

6 Conclusion

We have explored new strategies for vocabulary selection for addressing the neural machine translation target vocabulary bottleneck. We gather from our experiments that a simple word alignment system provides the best strategy, augmenting training speed as well as decoding speed.

Vocabulary reduction makes systems faster at current vocabulary sizes, but it would also be interesting to use it as a way to handle much larger vocabularies and hidden representations. Since only a small part of the embedding is held in memory at any time, it would be possible to store larger embeddings in RAM regardless of GPU memory capacities.