Factored Neural Machine Translation

by   Mercedes García-Martínez, et al.

We present a new approach for neural machine translation (NMT) using the morphological and grammatical decomposition of the words (factors) in the output side of the neural network. This architecture addresses two main problems occurring in MT, namely dealing with a large target language vocabulary and the out of vocabulary (OOV) words. By the means of factors, we are able to handle larger vocabulary and reduce the training time (for systems with equivalent target language vocabulary size). In addition, we can produce new words that are not in the vocabulary. We use a morphological analyser to get a factored representation of each word (lemmas, Part of Speech tag, tense, person, gender and number). We have extended the NMT approach with attention mechanism in order to have two different outputs, one for the lemmas and the other for the rest of the factors. The final translation is built using some a priori linguistic information. We compare our extension with a word-based NMT system. The experiments, performed on the IWSLT'15 dataset translating from English to French, show that while the performance do not always increase, the system can manage a much larger vocabulary and consistently reduce the OOV rate. We observe up to 2 a simulated out of domain translation setup.


page 1

page 2

page 3

page 4


Neural Machine Translation by Generating Multiple Linguistic Factors

Factored neural machine translation (FNMT) is founded on the idea of usi...

How Much Does Tokenization Affect in Neural Machine Translation?

Tokenization or segmentation is a wide concept that covers simple proces...

How Much Does Tokenization Affect Neural Machine Translation?

Tokenization or segmentation is a wide concept that covers simple proces...

Attention-based Vocabulary Selection for NMT Decoding

Neural Machine Translation (NMT) models usually use large target vocabul...

Modeling Target-Side Inflection in Neural Machine Translation

NMT systems have problems with large vocabulary sizes. Byte-pair encodin...

Improved English to Russian Translation by Neural Suffix Prediction

Neural machine translation (NMT) suffers a performance deficiency when a...

Reduce Indonesian Vocabularies with an Indonesian Sub-word Separator

Indonesian is an agglutinative language since it has a compounding proce...

Code Repositories


nmtpy is a Python framework based on dl4mt-tutorial to experiment with Neural Machine Translation pipelines.

view repo

1 Introduction

Neural Machine Translation (NMT) has been further developed in the last years [Bahdanau et al.2014]. In contrast to the traditional phrased-based statistical machine translation [Koehn et al.2007] that automatically translates subparts of the sentences, NMT uses the sequence to sequence of words approach [Cho et al.2014].

Recently, NMT has improved the results of the phrased-based systems [Bahdanau et al.2014]

. Besides these improvements in NMT, some problems still remain. One problem is the high computational cost of the target word probability due to the softmax that requires to normalize all the output values, see Equation 



where are the outputs, their softmax normalization and the total number of outputs.

In order to solve this issue, a standard technique is to define a short-list containing the most frequent words only. This has the disadvantage of increasing the out of vocabulary (OOV) rate. OOV words correspond to those unseen in the training dataset or which are not included in the vocabulary. They are all considered as unknown words and mapped to the special UNK token.

Jean, proposed to carefully organise the batches so that only a subset of the target vocabulary is possibly generated at training time. This allows the system to perform the softmax only on this subset during training (the complexity remains the same at test time). Another possibility is to define a structured output layer (SOUL) to handle the words not appearing in the shortlist. This allows the system to always apply the softmax normalization on a layer with reduced size [Le et al.2011].

Recently, some works have used subword units to translate instead of words. In Sennrich, the rare and some unknown words are encoded as subword units with the Byte Pair Encoding (BPE) method. The authors show that this can also generate words which are unseen at training time. As an extreme case, the character-level neural machine translation has been presented in several works [Chung et al.2016, Ling et al.2015, Costa-Jussà and Fonollosa2016] and showed very promising results.

In this work we propose an approach using factors as unit level in the output side of the neural network.

The factors are referring to the linguistic annotation at word level like the Part of Speech (POS) tags. Moses toolkit [Haddow and Koehn2012] for statistical machine translation is able to manage factors information in addition to the words to be able to improve the translation. Some works have used factors as additional information for language modeling [Bilmes and Kirchhoff2003, Alexandrescu2006]. Recently, factors have been used as linguistic input features to improve NMT [Sennrich and Haddow2016] as well.

Our approach differs from previous works in the sense that we use only the linguistic decomposition of the words in the output side. Each word is represented by its lemma along its linguistic factors (POS tag, tense, gender, number and person). By these means, the target vocabulary size is reduced because we do not have to keep all the derived forms of the verbs, nouns, adjectives, etc. Furthermore, we are able to produce new words that are not in the vocabulary using all the derived forms of the lemmas.

We use two different outputs for the translation at word level, one output is the lemma of the word and the other output is the rest of the factors mentioned earlier. Multiple output neural networks have been used before [Firat et al.2016] with the difference that in our approach the system produces both outputs at the same time instead of scheduling them. With both outputs (lemma and factors) we are able to generate the final word using linguistic resources.

This work is licensed under a Creative Commons Attribution 4.0 International Licence. Licence details: http://creativecommons.org/licenses/by/4.0/

2 Neural Machine Translation

The encoder-decoder architecture used for NMT consists of two recurrent neural networks (RNN), one for the encoder and the other for the decoder. The encoder maps a source sequence into a continuous space representation and the decoder maps the representation back to a target sequence. Our trained neural translation models are based on a bidirectional encoder-decoder deep neural network equipped with an attention mechanism 

[Bahdanau et al.2014], as described in Figure 2.

Figure 1: Architecture of the NMT system equipped with an attention mechanism.

This architecture consists of a bidirectional RNN as an encoder (as seen in stage 1 of Figure 2). An input sentence is encoded in a sequence of annotations (one for each input word), corresponding to the concatenation of the outputs of a forward and a backward RNN. Each annotation represents the full sentence with a strong focus on the current word. The decoder is composed of a conditional RNN as provided for the DL4MT winter school111https://github.com/nyu-dl/dl4mt-tutorial (see stage 3 of Figure 2

), equipped with an attention mechanism (stage 2). The attention mechanism aims at providing weights for each annotation in order to generate a context vector (by performing a weighted sum over the annotations). The attention mechanism uses the hidden state at timestep

of the decoder RNN along with the annotation to generate a coefficient . A softmax operation is performed over those coefficients to generate the annotation weights . As described in [Bahdanau et al.2014], the annotation weights can be used to align the input words to the output words. The RNN takes as input the context vector, the embedding of the previous output word (stage 4), and of course its hidden state. Finally, on stage 5 of the Figure 2, the output probabilities of the target vocabulary are computed. The word with the highest probability is selected to be the translation at each timestep. The encoder and the decoder are trained jointly to maximize the conditional probability of the correct translation.

3 Factors in Neural Machine Translation

To perform factored neural machine translation, we need to extend the standard NMT architecture of the Figure 2 to allow generating several output symbols at the same time. For the sake of simplicity, we decided to generate only two symbols: the lemma and the concatenation of the different factors that we considered. For example, from the word devient, we obtain the lemma devenir and the factors vP3#s meaning that it is a verb, in Present, 3rd person, irrelevant gender (#) and singular. The morphological and grammatical analysis is performed with the MACAON toolkit [Nasr et al.2011]. Figure 2 shows this modification.  Output detail of the Factored NMT architecture.
Figure 2: Output detail of the Factored NMT architecture.

As we can see, lemmas and factors are generated separately, which in some cases, lead to sequences with different length. To solve this problem, we give priority to the length of the lemmas. Consequently, we constraint the length of the factors sequence to be equal to the length of the lemma sequence. This is motivated by the fact that the lemmas are closer to the final objective (a sequence of words) and that they are the symbols carrying most of the meaning.

Another issue is the feedback that the RNN receives. In the word based model, this feedback is the embedding of the previous generated word. Since we have two outputs, we have to decide what will be given to the decoder RNN. Several options are possible and will be explored in this paper (details in section 4.3.2). For the first set of experiments, only the previous lemma embedding was used as feedback (no information from the factors output).

3.1 Handling beam search with factors

The beam search procedure has also been extended with respect to the original approach, since we are actually facing two beams (one for lemmas and one for factors). We need to deal with the multiple outputs because we do not want to rely solely on the lemma sequence to decide which are the best sequences. Then, we merge the two beams. Once the best lemma and factors hypotheses are generated for each partial hypothesis, the cross product of those output spaces is performed. By this mean, each lemma hypothesis is associated with each factors hypothesis. Afterwards, we keep the -best combinations for each sample, with being the beam size. Finally, the number of best hypotheses is reduced again to the beam size for further processing.

3.2 From factors to word

Once we obtain the factorized outputs from the neural network, we need to fall back to the word representation. This operation is also performed with the MACAON tool, which, given a lemma and some factors, provides the word candidate.

4 Experiments

We performed several sets of experiments trying different architectures and vocabulary sizes for Factored NMT (FNMT) and comparing them with the NMT system.

4.1 Data processing and selection

We evaluate our approach on the English to French Spoken Language Translation task from IWSLT 2015 evaluation campaign222IWSLT’15: https://sites.google.com/site/iwsltevaluation2015. A selection method [Rousseau2013] has been applied using the available parallel corpora (news-commentary, united-nations, europarl, wikipedia, and two crawled corpora) and Technology Entertainment Design (TED333TED: https://www.ted.com) corpus as in-domain corpus. We also do a preprocessing to convert html entities and filter out the sentences with more than 50 words for both source and target languages. We finally end with a selected corpus of 2M sentences, 147K unique words for English side and 266K unique words for French side.

4.2 Training

We chose the following hyperparameters to train the systems. The embedding and recurrent layers have a dimensionality of 620 and 1000 respectively. We use a minibatch size of 80 sentences trained with Adadelta algorithm. The norm of the gradient is clipped to be no more than 1 

[Pascanu et al.2012] and the weights are initialized with Xavier [Glorot and Bengio2010]

. The validations start at the second epoch and are performed every 5000 updates. Early stopping is based on BLEU with a patience set to 10 (early stopping occurs after 10 evaluations without improvement in BLEU). The vocabulary size of the source languages is set to 30K. We varied the output layer size from 5K to 30K in order to simulate different levels of out of domain data. Once the model is trained, we set the beam size to 12 (as this is the standard value for NMT,

[Bahdanau et al.2014]) when translating the development corpus.

4.3 Factors models and results

The Factored NMT system aims at integrating linguistic knowledge into the decoder in order to obtain better performance when facing out of domain data and/or a low resource setup. To assess the feasibility and estimate the potential gain of our approach, we performed a set of experiments reducing the output vocabulary size, simulating such an environment. The results are presented in Table 


Output %BLEU Oracle
Model size vocab. Coverage (%) #OOV #Par. word lem. fact. word
NMT 30K 30K 97.96 1775 89.7M 34.88 - - -
FNMT 30K+142 172K 99.23 (+1.27) 784 89.8M 34.80 37.78 42.72 36.33
NMT 20K 20K 97.03 2171 77.3M 34.21 - - -
FNMT 20K+142 139K 98.88 (+1.85) 1014 77.4M 34.46 37.52 42.65 36.14
NMT 10K 10K 94.51 3996 64.9M 32.61 - - -
FNMT 10K+142 85K 97.72 (+3.21) 1897 64.9M 34.13 37.07 42.75 35.72
NMT 5K 5K 91.02 6545 58.7M 30.54 - - -
FNMT 5K+142 48K 95.61 (+4.59) 3424 58.7M 32.55 35.22 42.98 33.86
Table 1: Comparison of the performance of the NMT and FNMT systems in terms of %BLEU score evaluating at word level, and separately, each output lemma and factors. The size of the output layer and the size of the corresponding vocabulary are presented in columns 2 and 3. Columns 4 and 5 show coverage in test dataset and number of OOVs, respectively. Last column corresponds to the oracle output.

The FNMT system obtains a similar performance compared to the NMT system (first two rows) in terms of word level BLEU score, despite the increased complexity of the architecture of our model (and in particular the two outputs).

In order to estimate the capacity of such a model, we computed the oracle which corresponds to ignore the errors caused by the factors, i.e. if we produce the correct lemma, then the correct word is generated (see last column of Table 1). We can see that a potential gain of more than 1.5% BLEU points can be achieved with a perfect modeling of the factors, which is encouraging.

The first comment is that the Factored NMT approach is able to model a bigger word vocabulary while preserving manageable output layers size. This is due to the fact that the factors-to-word tool is able to generate words which are unseen in the training corpus, augmenting the expressiveness of our model. For the sake of comparison, we provide the target vocabulary size for the standard NMT and the FNMT systems. For example, with an output layer size of 30K, the NMT system can model 30K words against 172K words for the FNMT system. This is an almost 6 times larger word vocabulary.

One consequence is that the word coverage is higher for the FNMT than for the NMT system, as shown in column 4. However, for the first two systems (first two rows), we see that the difference between the coverages is small. When decreasing the output layer size, we can observe that the coverage decreases slowly for FNMT systems compared to the word based system. The FNMT approach surpasses standard NMT when the coverage difference becomes higher. This proves that the approach is sound and well performing, when dealing with out-of-domain data. This is of course dependent on the linguistic knowledge available in the factors-to-word tool. This is exactly the sought behavior: by integrating a priori linguistic knowledge, we reduce the impact of the training conditions (domain, data availability, etc.) on the performance of the system.

The reduction of the out of vocabulary (OOV) rate of about 47% is a promising result which is not always well reflected by the BLEU score. These results would be better highlighted if performing a human evaluation (this point will not be addressed further in this paper). To make things clear, the OOV rate corresponds to the number of UNK tokens generated by our system. In those experiments, we did not use any specific method to replace them (e.g. put source words aligned to them, use a dictionary, etc.)

Moreover, the number of parameters to train also decreases according to the size of the output layer, as shown in column 6, allowing a simpler training because we have to learn less weights in the model. For example, using a lemma output layer size of 10K instead of 30K (3 times smaller) for factored model, we obtain a small drop of 0.67 points in BLEU. By contrast, in NMT base model we observe a drop of 2.27 points in BLEU comparing the same output sizes 30K and 10K.

Another interesting remark is that the scores evaluating in lemmas and factors are higher than the BLEU in words for both systems, this is due to the difficulty of the final step to generate the words. Nevertheless, the BLEU for factors are pretty low considering that the output layer size for this is only 142. This can be due to two different causes. First, the neural network is not able to correctly model this small output. Second, the task of translating from English words to French factors is complex.

4.3.1 Evaluating each output

We evaluated BLEU at different levels (word, lemma or factors) using the base NMT system with only one output (see Table 2). We compare the values with the Factored NMT system results which models lemmas and factors at the same time. We observe that the difference between the results in BLEU for lemmas using the FNMT are similar to the NMT system. However, the differences evaluating factors are big between the two systems (2.44 difference of %BLEU). This experiment confirms that the task to predict factors managing very different output sizes respect to the source words is not easy. In future we will implement factors also in the input side of the neural network to verify this hypothesis. Also, we have to take into account that we are giving more priority to the length of the lemmas sequence than the factors one during beam search. This also suggests that we adapt our architecture so that factors are better predicted to obtain a final better BLEU at word evaluation.

Model word lemma factors
NMT 34.88 37.72 45.16
FNMT 34.80 37.78 42.72
Table 2: Comparison of the performances between standard NMT system and the Factored NMT system in terms of %BLEU computed at word, lemma and factors level. The first line corresponds to 3 standard NMT systems built to generate at the output words, lemmas and factors, respectively.

4.3.2 Feedback

As explained in section 2, the decoder RNN is a conditional-GRU which is fed by the input context vector, its hidden state and the feedback (i.e. the previous generated symbol). Since we now have two outputs, we need to define what kind of feedback is more suitable for the Factored NMT system. Several solutions are possible.

The first assumption we made is highly dependent on the design of the considered factors, i.e. the lemmas are the most informative factors among all. Then, we tried using only the output lemma embedding as feedback (see equation 2).


where is the target language lemma lookup table and is the embedding of the lemma used to generate the output word .

Another straightforward operation is to sum the embeddings of the previous lemma with the embedding of the previous factors, as described in equation 3.



is the target output word, and

and are its corresponding lemma and factors embeddings. While this could seem unnatural, by doing this, we hope to obtain a joint vector representation of both the lemma and the factors.

Finally, we investigated whether the neural network can learn a better combination of the lemmas and factors embeddings using a linear (eq. 4) or non-linear (eq. 5) operation instead of a simple sum.


where and are the parameters to be learned.

Model Feedback word lemma factors #OOV
NMT - 34.88 - - 1775
FNMT Lemma 34.80 37.78 42.72 784
FNMT Sum 34.48 37.14 44.46 815
FNMT Linear 34.42 37.27 44.03 868
FNMT Tanh 34.58 37.28 43.96 757
Table 3: Performance in terms of %BLEU computed on word, lemma and factors when using different output embedding combinations as feedback.

Table 3 presents the results obtained with systems integrating the different output embedding combinations as feedback. We can see that all systems perform similarly regarding BLEU score on words with a better result for the lemma feedback. As expected, when using only lemma as feedback, the system better estimates the lemmas probabilities, as a consequence, there is a significant reduction of the performance on factors. The comparison between the lemma %BLEU (fourth column of Table 3) and the number of OOVs (sixth column) shows a correlation between those two values, except when using non-linear combination which has the lowest value of OOVs. This tends to prove that modeling the lemmas better is important to reduce the OOV rate (confirming our assumption that lemmas are more informative) but not sufficient. In the future we would like to explore the combination of the two embeddings using its concatenation to see if we can get better results.

4.3.3 Dependency model

One observation that can be made is that while generating factors could seem easier due to the small number of the possible outputs (only 142), the BLEU score is not as high as what we could expect. However, one could argue that generating a sequence of factors in French from a sequence of English words is not an easy task. In order to help the factors prediction, we contextualized the corresponding output with the lemma being generated. This creates a dependency between the lemma output and the factors output. The dependency has been implemented by including a transformer (see Figure 3) which projects the lemma embeddings into the hidden layer used to generate factors. The results by applying those two techniques are presented in Table 4.  Dependency model
Figure 3: Dependency model
Model Feedback word lemma factors #OOV
NMT - 34.88 - - 1775
FNMT with dependency Lemma 34.45 37.45 42.15 770
FNMT with dependency Sum 34.65 37.34 44.35 800
FNMT with dependency Linear 34.25 37.02 43.57 822
FNMT with dependency Tanh 34.38 37.09 43.82 915
Table 4: Results for dependency model

In Table 4, we can observe that the dependency model does not improve the results in terms of %BLEU score on words from Table 3 using lemma, linear and tanh feedback. However, it improves using the sum feedback. For the sum feedback dependency model, we see that lemma BLEU output improves with respect to the same model without dependency. By contrast, the factors output obtains lower BLEU. This can occur because factors output receives more information from lemma and when the factors cost is back-propagated, the lemma output can improve the learning. We can also observe that if we improve lemma output it is more correlated to the word evaluation than if we improve factors output. Moreover, the number of the OOV are reduced for all the feedback combination excepting tanh feedback, which is not reflected by the automatic score.

4.3.4 Qualitative analysis

We have observed some of the translation outputs to better understand in what cases our FNMT system performs better or worse than the NMT system.

Src set of adaptive choices that our lineage made
Ref de choix adaptés établis par notre lignée
NMT de choix UNK que notre UNK a fait
FNMT de choix adaptatifs que notre lignée a fait
2 Src here ’s the updated version of this entry
Ref voici la version actualisée de cette entrée .
NMT voici la version mise à jour de cette entrée .
FNMT voici la version actualisée de cette entrée .
3 Src i could draw i could paint
Ref je pouvais dessiner . je pouvais peindre .
NMT je pouvais dessiner . je pouvais peindre .
FNMT je pourrais dessiner . je pouvais peindre .
4 Src and it ’s a very easy question
Ref c’ est une question très simple .
NMT c’ est une question très simple .
FNMT et c’ est une question très facile .
Table 5: Examples of translations with NMT and Factored NMT.

Translation examples with better BLEU performance

In the first two examples of Table 5, the FNMT system obtains better BLEU score than the NMT system.

First example shows when our factored system can generate words when the NMT base system predicts unknown words. Firstly, the word lineage in source sentence is translated as the reference (ligneé) by the FNMT system and mapped to UNK by the NMT base system. Secondly, the word adaptive is translated as adaptatifs by the FNMT system, the reference translation is adaptés, but we can consider the FNMT choice a better translation. NMT system also mapped the word adaptive to UNK.

In the second example, FNMT translation performs as the reference. We are able to generate the new word actualisée (actualiser+past participle+feminine+singular) that it is not in the shortlist of the NMT system vocabulary. This is due, on one hand, because the word actualisée appears 40 times in the word vocabulary of the NMT system so it is excluded from the shortlist. On the other hand, the lemma actualiser appears 172 times in the lemmas shortlist so it is included and we are able to generate actualisée from the lemma and factors outputs. These examples can show the potential of our FNMT system generating new words and reducing unknown words.

Translations with lower BLEU performance

We also have extracted some translations where we have seen a lower BLEU from the FNMT system with respect to the NMT base system (see Table 5).

Example 3 shows a problem with the factors output, from the correct lemma pouvoir, the FNMT system has generated the word pourrais instead of pouvais. We can consider both translations as correct but BLEU score penalizes the FNMT translation.

Finally, in the last example, we saw that the translation of the FNMT system is more correct than the NMT system because it translated the word and to et but in the reference is not included. In addition, FNMT system translated easy to a synonym (facile) of simple. Consequently, BLEU score penalizes this example in FNMT system being a correct translation.

5 Conclusion

In this paper, we have proposed an NMT architecture which produces a factored representation of the target language words. Those factors are based on linguistics a priori knowledge. We showed that we are able to train Factored NMT systems with similar performance to word based systems but with the advantage of modeling an almost 6 times bigger word vocabulary with only a slight increase of the computational cost. A consequence of that is the OOV rate reduction observed with the FNMT system. Also, the use of additional linguistic resources allows us to generate new word forms that would not be included in the standard NMT system shortlist.

By reducing the target language vocabulary, we simulated an out-of-domain setup, and we showed that our factored NMT method performs better than the basic NMT system in this case.

As future work, we would like to include linguistic features at the input. It is known that this can be helpful for NMT [Sennrich and Haddow2016]. Extending the approach with input factors could make the target language factors generation simpler. This will be investigated in the future. The proposed Factored NMT method could even show better performance if applied on highly inflected languages like German, Arabic, Czech, Russian or Hindi on the target side.