Machine translation is a sequence to sequence problem, encoder-decoder deep learning architectures are therefore particularly appropriate to solve it. Neural machine translation has gotten good enough to challenge traditional statistical methods in recent years[Bahdanau].
One of the issues NMT faces is related to vocabulary sizes. The output layer of the decoder is of linear complexity with respects to target vocabulary size because to generate the next token one needs to score every possible token every time. The number of possible target words is thus generally set between 30k and 80k, which makes NMT systems produce a lot of out-of-vocabulary (OOV) tokens to fill the gaps.
One approach to deal with that issue is to use OOV tokens that contain alignment information, this allows a finer post-processing step of OOV replacement [LuongUnk]. However, this is only a way of doing damage-control and does not provide a solution to train large vocabularies. Characters, whether as last resort for rare words [LuongHybrid] or for everything [Ling], and sub-word units [Senrich] can be used instead of words to manage open vocabularies. To use very large vocabularies, Jean groups training data in sets that rely on smaller sub-vocabularies, and an alignment model is used to select the relevant vocabulary at decoding time. IBM build on this approach by leveraging phrase-level alignments on top of the word-based dictionary and by using it at both training and decoding time.
Our approach extends on the latter, making the following contributions:
we explore several other vocabulary selection methods;
we quantify the gains obtained by these vocabulary selection methods.
2 Translation model
The neural system we use is an in-house Torch reimplementation of the encoder-decoder architecture similar of Bahdanau. We have one embedding for the source language and three for the target language: one for the attention module, one as input to the decoder and one to be multiplied with the decoder output for scoring. The former embeddings are all needed in memory but the other three are look-up tables from which we use only a few columns at a time.
A bidirectional LSTM is used as encoder; hidden representations of the forward and the backward passes are simply stacked.
The attention module is a global dot-product style soft attention mechanism [LuongAttention]
: the previous decoder hidden state is projected in the source word space, added to the attentional embedding of the previous target token and dot-multiplied with every encoder output. These products are then used to compute a weighted sum of the encoder outputs, the attentional vector.
The decoder is a conditional LSTM which takes as inputs the attentional vector, the decoding embedding of the previous target token and the previous hidden state to output the next hidden state. Scoring embedding vectors are then dot-multiplied with this hidden representation to obtain token-wise scores.
Training samples are bucket-batched on target sentence length: we order translation pairs by target length and source length to break ties (this limits source side padding) and make batches that contain only samples of the same target length, leaving some batches not full if necessary.
3 Vocabulary selection methods
One important challenge of current neural translation systems is the vocabulary size of the target language. In figure 1, displaying BLEU scores obtained on the English-German WMT 2013 test set with varying vocabulary sizes, we see that no plateau has been reached; it would be useful to get more tokens. Some languages that are more inflected than German such as Czech may benefit from bigger vocabularies even more.
However, using 100k tokens or more creates two problems:
for every token that is added to an output sentence, we have to compute the scalar product of the hidden representation produced by the decoder with all target token embeddings for scoring purposes. This adds up linearly with the vocabulary size;
storing target embeddings in the limited GPU memory sparks a trade-off between vocabulary size and hidden representation sizes. Best performance is obtained when vocabulary size is at 100k in our model, but it would be better to use larger vocabularies and larger hidden representations.
We address this output layer bottleneck by using a small dynamic vocabulary. For each batch that is presented to the network, a subset of the large global vocabulary is selected and only its tokens are scored in the output layer.
To implement this, we add a look-up table from which the target vocabulary embedding matrix is extracted for each batch and multiplied to the decoder output. We next explain how these small vocabularies are obtained.
3.1 Selection methods
The selection strategies are functions that take a sentence in the source language as inputs and return a set of tokens in the target language. These functions are learned on a parallel translation corpus. Most of them build on a word level selection function of which we aggregate the selections at the sentence and batch levels. At training time, we add the reference tokens to this selection.
The simplest method is to count co-occurrences of each (source token, target token)
pair in the dataset, which can be thought of as an estimation ofwith and
source language and target language tokens. These probabilities are then used to fetch a fixed number of target tokens for every source token (the topk, k
being a hyperparameter). The selected vocabulary is the union of all of these selection over a sentence.111A similar approach where co-occurrence counts are normalized by target token counts (which estimates ) was tried but gave poor results because of instabilities in the the very low frequencies.
Principal component analysis
A PCA on the co-occurrence matrix of the joint source and target vocabulary can be used to compute a bilingual token embedding [PCA]; for every source token, we retrieve and rank the closest target tokens in it and take the top k closest. We then aggregate all top selections to define the sentence-wise subset. A PCA of dimension 1024 is used. This approach is a form of regularization of the co-occurrence method; while the latter requires exact matches between word pairs, PCA takes advantage of the co-occurrence sparsity to reduce the space to a dimension in which we can perform nearest neighbor search to retrieve semantically similar words.
We use the unsupervised word aligner fast_align [FastAlign] to estimate which (source token, target token) pairs are aligned between two parallel sentences to design another selection method. Everything works like in the co-occurrence method where we replace co-occurrence counts by alignment counts.
A more advanced alignment method consists in aligning n-grams directly. We consider every n-gram in the source sentence (), retrieve aligned target n-grams and output the union of all the tokens that are present in the n-grams. Our alignment dictionary is thresholded at five alignments so that only relevant n-gram translations appear in it, therefore most source n-grams (which are not likely to be a general phrase) will not have any correspondence in the target language.
This method aims at capturing more subtle links that might be missed by word based methods. For instance, the tokens of “pass out”, taken individually, are unlikely to make a word-based selection function retrieve the tokens necessary to translate it as “s’évanouir” or “tomber dans les pommes”, but a phrase-based method should.
Support vector machines
Taking this idea one step further, we try to leverage the whole source sentence to select target tokens by using cheap classifiers, one per target token, that take the one-hot encoded sentence as input and output a score related to the probability of presence of a that token in a valid translation. We use sparse linear support vector machines trained by stochastic gradient descent[bottou-2010].
SVMs are trained separately and thus need to be calibrated by coherently defining one threshold for each SVM above which its predictions are considered positive; we do this either by using recall on the validation set or by matching token frequencies in the validation set with a given multiplier.
3.2 Common words
On top of the token set that is selected by aforementioned strategies, we can add a set of common words that are always used –for example, the 1000 most frequent words. Doing this is interesting for two reasons: it makes sure no important tokens such as syntactic words or punctuation are omitted; it reduces the cost of embedding reordering that happens in the look-up table by keeping fixed the columns that would be frequently moved.
4 Experimental setup
Experiments are based on the English-German WMT15 dataset, it is composed of 1.64M sentences from EU parliament, 1.81M from a web crawl and 0.2M from news commentaries. Sentences longer than 50 tokens are filtered out and the dataset is shuffled. One percent of it is used as a development set (36495 sentences) and the test set is a news dataset of 2169 sentences. We also use the 2013 test set (3000 sentences) for motivating experiments in section 3.
All models have hidden representations of size 512, they are trained on a Tesla M40 (12GB of memory) using adaptive moment estimation with a learning rate of 0.01 and gradients are clipped at 10. Training takes four days on the full 100k vocabulary model. Data processing, including vocabulary selection, is done in separates threads and do not affect reported times. Testing is made with beams of size 5.
A CPU cluster was used to train the 100k SVMs.
Our aim is to reduce the average vocabulary size as much as possible while retaining similar accuracy, and all methods using co-occurrences failed to do so.
The method based on co-occurrences has a bias towards frequent words: For example, if “John” is translated half the time as “John” and half the time as “Johannes”, very frequent tokens such as “.”, “,” and “die” are likely to co-occur more often and would thus rank higher. We cannot get sufficient coverage out of this method without having vocabulary sizes nearing the original size.
PCA, while based on the same co-occurrence matrix as the former method, produces slightly better results. However, coverage is still too small to get an accuracy close to the full vocabulary model.
|Co-occurrences||2k common - top 50||2236||80.0%||17.91|
|PCA||No common - top 50||4178||79.7%||17.44|
|Word align||No common - top 50||3587||88.8%||19.80|
|1k common - top 20||2265||86.7%||19.77|
|2k common - top 10||2552||85.6%||19.77|
|Phrase align||1k common - top 50||2803||86.8%||19.7|
|2k common - top 20||2817||85.8%||19.62|
|2k common - top 50||3529||87.4%||19.81|
Word-level alignment achieves the best results, they are reported in table 1. Three possible trade-offs between using a fixed common words vocabulary and fetching more candidates are shown, all of which get within 0.06 BLEU points of the full vocabulary model. (SOME OF THESE ARE BLEU SCORES OBTAINED WITH PRETRAINED FULL MODEL, UPDATE WITH ACTUAL REDUCED TRAININGS)
Taking this further with phrase-based alignments did not help and even performs a little worse: while we can retain satisfying accuracy with 2k common words and fetching the top 50 candidates, accuracy degrades faster than with word alignment methods when we try to reduce the vocabulary size through either common words or selection.
Last, various calibration heuristics were tried for the SVM method to ensure consistency across classifiers but none produced interesting results.
We remark that while coverage does generally correlates with BLEU scores, the word alignment system achieves better accuracy than the phrase-alignment system at a given coverage.
Figure 2 shows average vocabulary sizes for various combinations of common words and candidate selection with the word alignment method on batches of size 1 and 32. We observe that having no common words when decoding produces vocabularies smaller than a thousand words, top 50 being competitive with the full vocabulary method in terms of performance. Taking common words is therefore not needed at decoding time.
However, when using batches of size 32 like in our training setting, vocabulary size is dominated by subset selection sizes. It is thus interesting to use common words in order to fix the most used embeddings in memory and not having to reorder them at every batch through the look-up table.
Selection at decoding time
Selection can also be applied at decoding time on a model that was trained with a full vocabulary, such results are shown in figure 6. COMMENT MORE
We measure the total time it takes to achieve one epoch over the development set in a training and in a decoding situation. (EITHER USE ACTUAL TRAINING TIMES OR BE EXPLICIT ABOUT THIS BEING FORWARD ONLY) Figure3 plots times obtained with the align method using top 10, 20 or 50 candidates. There is no batching here, average vocabulary sizes are thus tiny.
The training setup (figure 4), which uses batches of size 32. Vocabulary reduction gives a smaller speedup on the GPU than on the CPU, at about a third of the duration of the full model.
We see important offsets at vocabulary size zero: the output layer is not the only bottleneck in neural machine translation. The encoder, which is not affected by the vocabulary reduction strategies, makes for a significant portion of global time.
In the full vocabulary model, a large part of the limited GPU memory is used to store the scoring embeddings which are dot-multiplied with the decoder output. In our setup, training with batches of size 32, the translation model uses 3.4GiB with the full 100k scoring embedding but only 1.3GiB when using a vocabulary size of 5k tokens.
The memory bottleneck that is encountered when using big vocabularies is removed; the global vocabulary size can be as big as fits RAM since it does not impact GPU memory use and hidden representation sizes can be increased significantly. For instance, doubling from 512 to 1024 the hidden size in our model with 5k tokens raises memory use to 2.7GiB.
We have explored new strategies for vocabulary selection for addressing the neural machine translation target vocabulary bottleneck. We gather from our experiments that a simple word alignment system provides the best strategy, augmenting training speed as well as decoding speed.
Vocabulary reduction makes systems faster at current vocabulary sizes, but it would also be interesting to use it as a way to handle much larger vocabularies and hidden representations. Since only a small part of the embedding is held in memory at any time, it would be possible to store larger embeddings in RAM regardless of GPU memory capacities.