Enriching the Transformer with Linguistic and Semantic Factors for Low-Resource Machine Translation

Introducing factors, that is to say, word features such as linguistic information referring to the source tokens, is known to improve the results of neural machine translation systems in certain settings, typically in recurrent architectures. This study proposes enhancing the current state-of-the-art neural machine translation architecture, the Transformer, so that it allows to introduce external knowledge. In particular, our proposed modification, the Factored Transformer, uses factors, either linguistic or semantic, that insert additional knowledge into the machine translation system. Apart from using different kinds of features, we study the effect of different architectural configurations. Specifically, we analyze the performance of combining words and features at the embedding level or at the encoder level, and we experiment with two different combination strategies. With the best-found configuration, we show improvements of 0.8 BLEU over the baseline Transformer in the IWSLT German-to-English task. Moreover, we experiment with the more challenging FLoRes English-to-Nepali benchmark, which includes both extremely low-resourced and very distant languages, and obtain an improvement of 1.2 BLEU. These improvements are achieved with linguistic and not with semantic information.


page 1

page 2

page 3

page 4


Linguistic Input Features Improve Neural Machine Translation

Neural machine translation has recently achieved impressive results, whi...

Optimizing Transformer for Low-Resource Neural Machine Translation

Language pairs with limited amounts of parallel data, also known as low-...

A Call for Prudent Choice of Subword Merge Operations

Most neural machine translation systems are built upon subword units ext...

Sparsely Factored Neural Machine Translation

The standard approach to incorporate linguistic information to neural ma...

Augmenting Neural Machine Translation with Knowledge Graphs

While neural networks have been used extensively to make substantial pro...

Speechformer: Reducing Information Loss in Direct Speech Translation

Transformer-based models have gained increasing popularity achieving sta...

Deep Architectures for Neural Machine Translation

It has been shown that increasing model depth improves the quality of ne...

1 Introduction

Many classical Natural Language Processing (NLP) pipelines used either linguistic and semantic features

Koehn and Hoang (2007); Du et al. (2016). In recent years, the rise of neural architectures has diminished the importance of the aforementioned features. Nevertheless, some works have still shown the effectiveness of introducing linguistic information into neural machine translation systems, typically in recurrent sequence-to-sequence (Seq2seq) architectures Sennrich and Haddow (2016); García-Martínez et al. (2016); España-Bonet and van Genabith (2018).

The motivation for studying strategies for incorporating linguistic or semantic information into state-of-the-art neural machine translation systems is two-fold. On the one hand, it can slightly improve the results in generic settings. On the other hand, and most importantly, it can play a key role in major challenges for machine translation, such as low-resource settings and morphologically different languages. In this work, we provide successful use cases for both situations. We suggest a modification to adapt the current state-of-the-art neural machine translation architecture, the Transformer Vaswani et al. (2017), to working with an arbitrary number of factors such as linguistic or semantic features. Specifically, we study the effect of incorporating such features at the embedding level or at the encoder level, with two different combination strategies and using different linguistic and semantic features, the former being extracted from NLP taggers and the latter from a linked data database. We report improvements in IWSLT and FLoReS benchmarks.

2 Related Work and Contributions

By factored Neural Machine Translation (NMT), we refer to the use of word features alongside the words themselves to improve translation quality. Both the encoder and the decoder of a Seq2seq architecture can be modified to obtain better translations García-Martínez et al. (2016)

. The most prominent approach consists of modifying the encoder such that instead of only one embedding layer, the encoder has as many embedding layers as factors, one for words themselves and one for each feature, and then the embedding vectors are concatenated and input to the rest of the model, which remains unchanged

Sennrich and Haddow (2016). The embedding sizes are set according to the respective vocabularies of the features. Notice that Byte Pair Encoding (BPE) Sennrich et al. (2016), an unsupervised preprocessing step for automatically splitting words into subwords with the goal of improving the translation of rare or unseen words, was applied to the words here. Thus, the features had to be repeated for each subword. In España-Bonet and van Genabith (2018), the exact same architecture was used, except that this new proposal used concepts extracted from a linked data database, BabelNet Navigli and Ponzetto (2012). These semantic features, synsets, were shown to improve zero-shot translations. All the cited works obtained moderate improvements with respect to the BLEU scores of the corresponding baselines.

Some works have previously proposed additional ways to combine sources and introduce hierarchical linguistic information in Currey and Heafield (2019) Currey and Heafield (2018) Libovický et al. (2018) Tebbifakhr et al. (2018).

The main goal of this work, and differently from previous works using NMT architectures based on recurrent neural networks, is to modify the Transformer to make it compatible with factored NMT with an architecture that we call Factored Transformer and inject classical linguistic knowledge from lemmas, which is the best performing feature

Sennrich and Haddow (2016), and concepts extracted from the semantic linked data database, BabelNet Navigli and Ponzetto (2012). We are focusing our attention on low-resource datasets.

3 Factored Transformer

Unlike the vanilla Transformer Vaswani et al. (2017), the Factored Transformer can work with factors; that is, instead of just being input the original source sequence, it can work with an arbitrary number of feature sequences. Those features can be injected at embedding-level, as in the previous works we described above (but in a Transformer instead of a recurrent-based seq2seq architecture), or at the encoder level. We have implemented the two model variants.

1-encoder model (depicted in Figure 1, left):

Each factor, including the words themselves, has its own embedding layer. The embedding vectors of the different factors are combined, positional encoding is summed and input to the following layer. The rest of the model remains unchanged. The positional encoding is summed to the combined vector and not to each individual embedding because we are not modifying the length of the sequence; therefore, the relative positions remain unchanged.

N-encoders model (depicted in Figure 1, right):

We intuited that features with large vocabulary sizes could benefit from having a specific encoder. In this variant, each factor has its own full encoder (instead of just its own embedding layer). The outputs from the encoder are combined and input to the following layer. The rest of the model remains unchanged.

Once we have the outputs of the multiple embedding layers (from the 1-encoder) or the N-encoders, they must be aggregated before being input to the next layer. We have considered two combination strategies.


The outputs of the different embedding layers or encoders are concatenated along the corresponding dimension.


The outputs of the different embedding layers or encoders are summed.

In both cases, the dimensions must agree. The decoder embedding size must be equal to the encoder embedding size. If the outputs from the different encoders or embedding layers are concatenated, they do not need to have the same embedding size, but the resulting embedding size is increased. Instead, if they are summed, they must share the same dimensionality, but the resulting vector size is not increased.

Figure 1: 1-encoder and N-encoders.

4 Linguistic and Semantic Features

An arbitrary number of features can be injected into the Factored Transformer, provided they are aligned with words. In this work, we suggest using linguistic or semantic features, as in the previous works we described above, even though other alternatives could be considered (for instance, domain-specific features for domain adaptation). As follows we describe how they were extracted and aligned at the subword level.

4.1 Feature Extraction

Classical linguistic features:

The corpus was tagged with classical linguistic information, namely lemmas, part-of-speech (PoS), word dependencies and morphological features, using StanfordNLP Qi et al. (2018), and aligned with respect to the original tokenization.

Semantic features:

BabelNet’s API retrieves all possible synsets (semantic identifiers) that a given token may have. Babelfy Moro et al. (2014) is a word sense disambiguation service based on BabelNet that retrieves the disambiguated synset for each token depending on the sentence-level context. We split the corpus into chunks such that the daily usage limits of the API were not exceeded and no sentence was split in half (because otherwise Babelfy would have missed the context). Babelfy returns a list of all the detected synsets with their character offsets, and they must be assigned and aligned to the original tokenization of the corpus. The following step was performed to resolve multiword synset conflicts since in the case of synsets composed of more than one token, Babelfy may retrieve one individual synset for each token and a collective one. We decided to prioritize the synset with the largest number of tokens since it seemed to give the most disambiguated information (e.g. the synset semantic network gives more specific information than the individual synsets semantic and network). For the tokens in the corpus that do not have an assigned synset (e.g. articles or punctuation marks), we assign a backup linguistic feature, namely, part-of-speech.

4.2 Feature Alignment at the Subword-level

To obtain state-of-the-art results in NMT, subwords (typically, BPE Sennrich et al. (2016)) is usually required. This presents a challenge with regard to word features since they must be aligned with the words themselves. The following alternatives were implemented and experimented with: just repeating the word features for each subword; using the BPE symbol in word features, in the same manner this tag is used in BPE for splitting subwords; and subword tags. This last approach was used in Sennrich and Haddow (2016) and it consisted of repeating the word features for each subword and introducing a new factor, subword tags, to encode the position of the subword in the original word. The 4 possible tags are: B (beginning of subword), I (intermediate subword), E (end of the subword) and O (the word was not split). This approach is not compatible with the multiencoder architecture.

5 Experimental Framework and Results


The first experiments were conducted with a pair composed of similar languages, the German-to-English translation direction of the IWSLT14 Cettolo et al. (2014), which is a low-resource dataset (the training set contains about 160,000 sentences). For cleaning and tokenizing, we use the data preparation script proposed by the authors of Fairseq Ott et al. (2019). As test sets, we took the test sets from the corpus released for IWSLT14 and IWSLT16. The former was used to test the best configuration, and the latter was used to see the improvement of this configuration in another set. A joint BPE (ie. German and English share subwords) of 32,000 operations is learned from the training data, with a threshold of 50 occurrences for the vocabulary. The second round of experiments was conducted with the English-to-Nepali translation direction of the FLoRes Low Resource MT Benchmark Guzmán et al. (2019). Although this pair has more sentences than the previous one (564,000 parallel sentences), it is considered to be extremely low-resource and far more challenging because of the lack of similarity between the involved languages. In this case, we learn a joint BPE of 5000 operations (both with an algorithm based on BPE, sentencepiece Kudo and Richardson (2018), as proposed by the FLoRes authors, and with the original BPE algorithm).

Parameters and Configurations:

In the case of German-to-English, we used the Transformer architecture with the hyperparameters proposed by the Fairseq authors: specifically, 6 layers in the encoder and the decoder, 4 attention heads, embedding sizes of 512 and 1024 for the feedforward expansion size, a dropout of

and a total batch size of 4000 tokens, with a label smoothing of . For English-to-Nepali, we used the baseline proposed by the FLoRes authors: specifically, 5 layers in the encoder and the decoder, 2 attention heads, embedding sizes of 512 and 2048 for the feedforward expansion size and a total batch size of 4000 tokens, with a label smoothing of 0.2. In both cases, we used the Transformer architecture with the corresponding parameters we described above as the respective baseline systems, and we introduced the modifications of the Factored Transformer without modifying the rest of the architecture and parameters. As mentioned previously, linguistic features were obtained through StanfordNLP Qi et al. (2018) and regarding the Babelnet synsets, we found that approximately 70% of the tokens in the corpus we used did not have an assigned synset and were therefore assigned part-of-speech.

Preliminary experiments:

We experimented with BPE alignment strategies (including the approaches from section 4.2), and different classical linguistic features (lemmas, part-of-speech, word dependencies, morphological features). The preliminary experiments showed that BPE alignment strategies were not very relevant, so we adopted the alignment with BPE by repeating the word feature. In addition, we found that the most promising classical linguistic feature was lemmas, consistently with the results obtained in Sennrich and Haddow (2016).

Reported results:

After the preliminary research, we report experiments with features (lemmas and synsets), architectures (1-encoder and N-encoders systems), and combination strategies (concatenation and summation). Table 1 shows the performance of the baseline and the baseline architecture but with lemmas instead of the original words. We report how different features (lemmas or BabelNet) compare for a given architecture. Then, for the best feature, lemmas, Table 1 compares different architectures, and it is shown that the best architecture is the 1-encoder with summation. Finally, the best performing system (lemmas with a 1-encoder and summation) is evaluated in another test set, IWSLT16. The selected model is relatively efficient, because it only needs an additional embedding layer with respect to the baseline, while the total embedding size does not have to be increased because the embeddings are summed instead of concatenated.

  Model Comb.* Feature BLEU
  Baseline - - 34.08
Lemmas - - 29.83

Sum Lemmas 34.35
1-encoder Sum Babelnet 33.66
1-encoder Concat Lemmas 27.10
N-encoders Concat Lemmas 33.58
N-encoders Sum Lemmas 9.71


- - 36.67

Sum Lemmas 37.46
Baseline - - 3.06
1-encoder Sum Lemmas 4.27
Table 1: BLEU results. In bold, best results.

Once we had found that the 1-encoder Factored Transformer with summation and lemmas was a solid configuration for low-resource settings, we applied this combination the more challenging Facebook Low Resource (FLoRes) MT Benchmark. Specifically, we wanted to compare how this architecture performs against the baseline reported in the original work of this benchmark. The authors report the results before applying backtranslation and with sentence piece, which is 4.30 BLEU. We reproduced that baseline and we got slightly better results (up to 4.38 BLEU). However, our system is designed to work with BPE, not sentencepiece, which is more challenging to align to features (since subwords coming from different words can be combined into a single token). Table 1 shows that our configuration clearly outperformed the baseline with BPE (almost 40% up), and was very close to the results with sentencepiece.


The 1-encoder system outperforms the N-encoder architecture. We hypothesize that the N-encoder system does not give good results because a completely disentangled representation for each feature is being learned, and this is not an effective strategy for factored NMT. Therefore, it is better to combine features and words at the embedding level, not at the hidden-state level. In the case of N-encoder with concatenation, the decoder at least can learn to ignore half of the vector (the hidden state coming from the encoder of the linguistic features). In this case, the system would be roughly equivalent to the baseline, provided the decoder learned to ignore half of the vector entirely. In practice, it gives worse results, because there is a considerable amount of noise, but the results are closer to the baseline than in the case of summation. In the case of the N-encoder architecture with sum, since the outputs from different encoders, which are potentially in very different spaces, are summed, it is tough for the decoder to interpret the vectors. If having a different encoder for a given feature causes the encoders to learn completely disentangled representations for words and linguistic features, and this makes the information coming from the feature encoders irrelevant or noisy, the decoder should learn to undo a sum, which is more difficult than just learning to ignore half of the vector (i.e., assigning low weights).

In the case of the 1-encoder architecture, the summation gives a much more compact representation. Summing lemmas implies a simple, linear transformation that allows the decoder layers to have a dimension of 512 (instead of doubling that, which is resulting in overfitting).

Regarding the reasons why lemmas outperform synsets, we believe that the problem comes from Babelnet. When tagging, a significant proportion of the tokens do not get a synset (as detailed before, in this case, we apply a backup linguistic feature, namely PoS). Instead, we can tag all words with lemmas (even if tagging is not perfect and can give wrong lemmas in some cases). Besides, the use of semantic features (BabelNet) intends to help at disambiguating, but some recent papers have shown that the Transformer is already good at this task. Instead, lemmas can help by providing the normalized term of a given word that may be very infrequent in the training corpus (but its respective lemma might be frequent enough).

6 Conclusions

We have shown that the Transformer can take advantage of linguistic features but not semantic ones. We conclude that the best configuration for the Factored Transformer was the 1-encoder model (with multiple embedding layers) with summation instead of concatenation. For the German-to-English IWSLT task, the best configuration for the Factored Transformer shows an improvement of 0.8 BLEU, and for the extremely low-resourced English-to-Nepali task, the improvement is 1.2 BLEU. In future work, we suggest adapting the alignment algorithm to sentencepiece by combining features coming from different words into a single feature, provided their respective subwords have been merged into a single token. In addition, whether the advantage provided by linguistic features still holds once backtranslation has been applied and up to what point this holds should be investigated.


This work is supported in part by the Spanish Ministerio de Economía y Competitividad, the European Regional Development Fund through the postdoctoral senior grant Ramón y Cajal and by the Agencia Estatal de Investigación through the project EUR2019-103819.