Augmenting Neural Machine Translation with Knowledge Graphs

While neural networks have been used extensively to make substantial progress in the machine translation task, they are known for being heavily dependent on the availability of large amounts of training data. Recent efforts have tried to alleviate the data sparsity problem by augmenting the training data using different strategies, such as back-translation. Along with the data scarcity, the out-of-vocabulary words, mostly entities and terminological expressions, pose a difficult challenge to Neural Machine Translation systems. In this paper, we hypothesize that knowledge graphs enhance the semantic feature extraction of neural models, thus optimizing the translation of entities and terminological expressions in texts and consequently leading to a better translation quality. We hence investigate two different strategies for incorporating knowledge graphs into neural models without modifying the neural network architectures. We also examine the effectiveness of our augmentation method to recurrent and non-recurrent (self-attentional) neural architectures. Our knowledge graph augmented neural translation model, dubbed KG-NMT, achieves significant and consistent improvements of +3 BLEU, METEOR and chrF3 on average on the newstest datasets between 2014 and 2018 for WMT English-German translation task.


page 1

page 2

page 3

page 4


Question Answering over Knowledge Graphs with Neural Machine Translation and Entity Linking

The goal of Question Answering over Knowledge Graphs (KGQA) is to find a...

On the Impact of Various Types of Noise on Neural Machine Translation

We examine how various types of noise in the parallel training data impa...

Enriching the Transformer with Linguistic and Semantic Factors for Low-Resource Machine Translation

Introducing factors, that is to say, word features such as linguistic in...

Knowledge Graphs for Multilingual Language Translation and Generation

The Natural Language Processing (NLP) community has recently seen outsta...

Translating Domain-Specific Expressions in Knowledge Bases with Neural Machine Translation

Our work presented in this paper focuses on the translation of domain-sp...

Boosting Neural Machine Translation

Training efficiency is one of the main problems for Neural Machine Trans...

Tackling Graphical NLP problems with Graph Recurrent Networks

How to properly model graphs is a long-existing and important problem in...

1 Introduction

Neural Network (NN) models have shown significant improvements in translation generation and have been widely adopted given their sustained improvements over the previous state-of-the-art Phrase-Based Statistical Machine Translation (PBSMT) approaches Koehn et al. (2007). A number of NN architectures have therefore been proposed in the recent past, ranging from recurrent Bahdanau et al. (2014); Sutskever et al. (2014) to self-attentional networks Vaswani et al. (2017). However, a major drawback of Neural Machine Translation (NMT) models is that they need large amounts of training data to return adequate results and have a limited vocabulary size due to their computational complexity Luong and Manning (2016). The data sparsity problem in Machine Translation (MT), which is mostly caused by a lack of training data, manifests itself in particular in the poor translation of rare and out-of-vocabulary (OOV) words, e.g., entities or terminological expressions rarely or never seen in the training phase. Previous work has attempted to deal with the data sparsity problem by introducing character-based models Luong and Manning (2016) or Byte Pair Encoding (BPE) algorithms Sennrich et al. (2016b). Additionally, different strategies were devised for overcoming the lack of training data, such as back-translation Sennrich et al. (2016a).

Despite the significant advancement of previous work in NMT, translating entities and terminological expressions remains a challenge Koehn and Knowles (2017). Entities may be subsumed in two groups, i.e., proper nouns and common nouns. Proper nouns are also known as Named Entity (NE) and correspond to the name of persons, organizations or locations, e.g., Canada. Common nouns describe classes of object, e.g., spoon or cancer. Both types of entities are found in a Knowledge Graph (KG), in which they are described within triples Auer et al. (2007); Vrandečić and Krötzsch (2014). Each triple consists of a subject—often an entity—, a relation—often called property—and an object—often an entity or a literal, e.g., a string or a value with a unit—. For example, <NAACL, areaServed, North_America>,111 means in natural language that “NAACL takes place in North America”.222In this paper, we use KG and KB interchangeably. Recent work has exploited the contribution of KGs to improve distinct Natural Language Processing (NLP) tasks such as Natural Language Inference (NLIAnnervaz et al. (2018), Question Answering (QASorokin and Gurevych (2018); Sun et al. (2018) and Machine Reading (MRYang and Mitchell (2017) successfully. Additionally, the benefits of incorporating type information on entities—e.g., NE-tags such as PERSON, LOCATION or ORGANIZATION—into NMT by relying on Named Entity Recognition (NER) systems have been shown in previous works Ugawa et al. (2018); Li et al. (2018). However, none of these have exploited the combination of Entity Linking (EL) with KGs in NMT systems.

The goal of EL is to disambiguate and link a given NE contained in a text to a corresponding entity—also called a resource—in a reference Knowledge Base (KB) Moussallem et al. (2017). If the reference KB is bilingual, then the links generated by EL can be used to retrieve the translation of entities found in the text. In this work, we aim to use EL to improve the results of NMT approaches. We build upon recent works, which have devised Knowledge Graph Embeddings (KGE) approaches Bordes et al. (2013), i.e., approaches that embed KG

s into continuous vector spaces. Since neural models learn translations in a continuous vector space, we hypothesize that a given

KG, once converted to embeddings, can be used along with EL to improve NMT models. Our results suggest that with this proposed methodology, we are capable of enhancing the semantic feature extraction of neural models for gathering the correct translation of entities and consequently improving the translation quality of the text.

We devised two strategies to implement the insight stated above. In our first strategy, we began by annotating bilingual training data with a multilingual EL system using a reference KB. Then, we map the entities and relationships contained in the reference KB to a continuous vector space using a KGE technique. Afterwards, we concatenate the KGE to the internal NMT embeddings, thus augmenting the embedding layer of NMT training. Given that EL can be time-consuming when faced with large training corpora, we skip the EL task in our second strategy and we semantically enrich the KGE by using the referring expressions of entities, also known as labels to initialize the vector values at the embedding layer. Differently from Venugopalan et al. (2016), we maximize the vector values of entities found in the bilingual corpora with the values of entities’ labels from the KGE. We perform an extensive automatic and manual evaluation in order to analyze our hypothesis. Among others, we examine the effectiveness of our augmentation method when combined with recurrent and non-recurrent (self-attentional) neural architectures, dubbed RNN and Transformer respectively. Our KG-augmented neural translation model, named KG-NMT, achieves significant and consistent improvements of +3 BLEU, METEOR and chrF3 on average on the WMT newstest datasets between 2014 and 2018 for the English-German translation task, using a small set of two million parallel sentences. To the best of our knowledge, no previous work has investigated the augmentation of NMT by using KGs without affecting the NN architecture. Hence, the main contribution of this paper lies in the investigation of two different strategies for integrating KGE

into neural translation models to maximize the probability score of the translation of entities. Moreover, we show that we can enhance the translation quality of

NMT systems by incorporating KGE into the training phase.333Our data and models will be made publicly available.

2 Related Work

NMT Augmentation.

Different methods have been suggested to overcome the limitations of NMT vocabulary size. Luong and Manning (2016) implemented a hybrid solution, which combines word and character models in order to achieve an open vocabulary NMT system. Similarly, Sennrich et al. (2016b) introduced BPE, which is a form of data compression that iteratively replaces the most frequent pair of bytes in a sequence with a single, unused byte. Additionally, the use of monolingual data for data augmentation has gained considerable attention as it is not supposed to alter the NN architecture while demonstrating consistent results. Sennrich et al. (2016a) explored two methods using monolingual data during the training of an NMT system. They used dummy source sentences and relied on an automatic back-translation of the monolingual data using different NMT systems. Moreover, past work exploited the use of monolingual data to augment NMT systems in distinct NN architectures. Hoang et al. (2018) presented an iterative back-translation method, which generates increasingly synthetic parallel data from monolingual data while training a given NMT system. Also, Edunov et al. (2018) attempted to understand the effectiveness of back-translation in a large scale scenario by using different strategies on hundreds of millions of monolingual sentences. Recently, approaches other than back-translation for data augmentation were introduced. For example, Wang et al. (2018) proposed a method of randomly replacing words in both the source sentence and the target sentence with other random words from their corresponding vocabularies.

External Structured Knowledge in Mt.

According to a recent survey Moussallem et al. (2018), the idea of using a structured KB in MT systems started with the work of Knight and Luk (1994). Still, only a few researchers have designed different strategies for benefiting of structured knowledge in MT architectures McCrae and Cimiano (2013); Arcan et al. (2015); Simov et al. (2016). Recently, the idea of using KG into MT systems has gained renewed attention. Du et al. (2016) created an approach to address the problem of OOV words by using BabelNet Navigli and Ponzetto (2012). Their approach applies different methods of using BabelNet. In sum, they create additional training data and also apply a post-editing technique which replaces the OOV words while querying BabelNet. Shi et al. (2016) have recently built a semantic embedding model reliant upon a specific KB to be used in NMT systems. The model relies on semantic embeddings to encode the key information contained in words so as to translate the meaning of sentences correctly.

Named Entities in Nmt.

Only a few works have investigated the NE translation issue in NMT. Some researchers worked on models specific to this problem, while others incorporated external information as features within NMT models. Li et al. (2016) and Wang et al. (2017b) rely on an NER tool to identify and align the NE pairs within the source and target sentences. Afterwards, the NE pairs are replaced with their corresponding NE-tags to train the model. In the translation phase, the targeted NE tags are then substituted with the original entities by a separate NE translation model or a bilingual NE dictionary. Ugawa et al. (2018) used a similar architecture but included one more layer in the encoder to encode the NE-tags expressed as chunk tags at each time step. The disadvantages of the methods above include NE information loss and NE alignment errors. To overcome these problems, Li et al. (2018) relied on an effective and simple method which added the NE-tags as boundary information to the entities directly inserted by an NER tool in the source sentence. It does not require either any separate model or external resource, and it therefore does not affect the NN architecture while achieving good performance.

Knowledge Graph Embeddings.

According to Annervaz et al. (2018)

, we classify

KGE into two categories: (1) Structure-based, which encodes only entities and relations, (2) Semantically-enriched, which takes into account semantic information of entities, e.g., text, along with the entities and its relations. (1) According to  Wang et al. (2017a) manifold approaches, where relationships are interpreted as displacements operating on the low-dimensional embeddings of the entities, have been implemented so far, such as TransE Bordes et al. (2013) and TransG Xiao et al. (2015). However, Joulin et al. (2017b) showed recently that a simple Bag-of-Words (BoW) based approach with the fastText algorithm Joulin et al. (2017a) generates surprisingly good KGE while achieving the state-of-the-art results. (2) Wang et al. (2014) proposed a technique of learning to embed structured and unstructured data (such as text) jointly in an effort to augment the prediction models. Additionally, Zhong et al. (2015) introduced an alignment of entities and word embeddings considering the description of entities. More work on agglutinating the semantics with entities arose, such as SSP Xiao et al. (2017) and DKRL Xie et al. (2016a) as well as TKRL Xie et al. (2016b).

3 The KG-NMT Methodology

KG-NMT is based on the observation that more than 150 billion facts referring to more than 3 billion entities are available in the form of KG on the Web McCrae et al. (2018). Hence, the intuition behind our methodology is as follows: Given that KGs describe real-world entities, we can use KGs along with EL to optimize the entries in the vector of entities and consequently to achieve a better translation quality of entities in text. In the following, we give an overview of NMT and KGE. Afterwards, we present how we use EL and KGE to augment NMT models. Throughout the description of our methodology and our experiments, we used DBpedia Auer et al. (2007) as reference KB.

3.1 Background

3.1.1 Neural Machine Translation

We use two different NMT architectures, the Recurrent Neural Network (RNN) and Transformer-based models. Both consist of an encoder and a decoder, i.e., a two-tier architecture where the encoder reads an input sequence and the decoder predicts a target sequence . The encoder and decoder interact via a soft-attention mechanism Bahdanau et al. (2014); Luong et al. (2015), which comprises of one or multiple attention layers. We follow the notations from Tang et al. (2018b) in the subsequent sections: corresponds to the hidden state at step of layer . represents the hidden state at the previous step of layer while means the hidden state at of layer. is a word embedding matrix, , are weight matrices, with being the word embedding size and the number of hidden units. is the vocabulary size of the source language. Thus, refers to the embedding of , and indicates the positional embedding at position .

RNN-based NMT.

In RNN models, networks change as new inputs (previous hidden state and the token in the line) come in, and each state is directly connected to the previous state only. Therefore, the path length of any two tokens with a distance of in RNNs is exactly

. Its architecture enables adding more layers, whereby two adjoining layers are usually connected with residual connections in deeper configurations. Equation

1 displays , where is usually a function based on Gated recurrent unit (GRUCho et al. (2014) or Long Short-Term Memories (LSTMHochreiter and Schmidhuber (1997). The first layer is then represented as . Additionally, the initial state of the decoder is commonly initialized with the average of the hidden states or the last hidden state of the encoder.

Transformer-based NMT.

Transformer models rely deeply on self-attention networks. Each token is connected to any other token in the same sentence directly via self-attention. Thus, the path length between any two tokens is . Additionally, these models rely on multi-head attention to feature attention networks, which are more complex in comparison to -head attention mechanisms used in RNNs. In contrast to RNN, the positional information is also preserved in positional embeddings. Equation 2 represents the hidden state , which is calculated from all hidden states of the previous layer. represents a feed-forward network with the rectified linear unit (ReLU

) as the activation function and layer normalization. The first layer is represented as

. Moreover, the decoder has a multi-head attention over the encoder hidden states.


3.1.2 Knowledge Graph Embeddings

The underlying concept of KGE is that, in a given KB, each subject or object entity can be associated as a point in a continuous vector space whereby its relation can be modelled as displacement vectors () while preserving the inherent structure of the KG. In the methodology introduced by Joulin et al. (2017b), named fastText, the model is based on BoW representation which considers the subject and object entities along with its relation as a unique discrete token. Thus, fastText models the co-occurrences of entities and its relations with a linear classifier and standard cost functions. Hence, it allows theoretically creating either a Structure-based or Semantically-enriched KGE. Therefore, we use fastText models in our experiments, represented by the following equation Equation 3.


The normalized BoW of the  input set is represented as as the label. is a matrix, which is used as a look-up table over the discrete tokens and a matrix  is used for the classifier. The representations of the discrete tokens are averaged into BoW representation, which is in turn fed to the linear classifier.

is used to compute the probability distribution over the classes, and 

input sets for discrete tokens. We denote the generated KGE as .

3.2 Methodology

Recent work has successfully devised strategies for incorporating different kinds of knowledge into NMT models, such as linguistic features Sennrich and Haddow (2016) and NE-tags Gu et al. (2016). Differently—but inspired by the above-mentioned approaches—instead of training a given NMT model on a large amount of parallel data with the aim of improving the translation of entities, our idea relies on EL solutions, which can disambiguate the entities found in a text to translate by mapping them to a corresponding node from a given reference KG. In turn, the KG can support the learning of a neural translation models through its graph structure.  Figure 1 depicts the general idea of our methodology. Formally, we substantiate our methodology on the two following definitions of EL and KB.444We assume that all mentions can be linked to entities in the KB

Figure 1: Overview of the KG-NMT methodology.
Definition 1

Entity Linking: Let be a set of entities from a KB and be a document containing potential mentions of entities m = . The goal of an EL system is to generate an assignment of mentions to entities with for the document .

Definition 2

Knowledge Base: We define KB as a directed graph where the nodes are resources of , the edges are properties of and .

We devised two strategies to instantiate our methodology. In the first training strategy, we link the NEs in the source and target texts to a reference KB using a given multilingual EL system. We then incorporate the Uniform Resource Identifier (URI)s of entities along with the tokens akin to Li et al. (2018) with the NE-tags. For example, the word Kiwi can be annotated with Kiwi|dbr_Kiwi555 or dbr_Kiwi_(people), depending on the context. Similarly, the word cancer can be annotated with cancer|dbr_Cancer,666 and its translation can be found in the German part of the DBpedia KB (dbr_Krebs_(Medizin)). After incorporating the URIs, we embed the reference KB, DBpedia, using the fastText KGE algorithm. Once the KGE embeddings are created, we concatenate their vectors to the internal vectors of NMT embeddings. The concatenation is possible as the annotations, i.e., URIs, are present in the texts and consequently in the vocabulary Speer and Lowry-Duda (2017). Formally, let the tokens from the source and target text be elements of a fixed vocabulary which are used to train a given NMT model, while the assignments are the nodes within KB . The embeddings of can be concatenated along with the internal embeddings of NMT using a function , thus resulting in a new vector . With this modification the first layer of an RNN becomes .

Although incorporating EL as a feature into NMT is interesting by itself, the annotation of entities in the training set and the post-editing can be resource-intensive. Additionally, one limitation of Structure-based KGEs is that it can only work with word-based models since it is not possible to apply any segmentation model on entities and relations, since segmentation may force the algorithm to assign wrong vectors to the entities. For example, the entities dbr:Leipzig and dbr:Leibniz can be similar when considering sub-word units, however, the first is a location while the second is a person. Thus, they should not be regarded as similar. To overcome both limitations, we devised our second strategy which uses only Semantically-enriched KGEs and skips the EL part. Here, we enrich the Structure-based KGE with referring expressions of the entities found in the KB, thus decreasing the annotation effort. To generate the Semantically-enriched KGE, we rely on a classifier in a supervised training implemented in fastText which assigns a label to a given entity. For example, we add to the triple, <NAACL, areaServed, USA> the following information, <NAACL, label, North American Chapter of the Association for Computational Linguistics>.777More than one label can be assigned to the entities. By enriching the KGE, it allows us to use the vectors to initialize the embedding layer’s weights of the NMT models similarly to Neishi et al. (2017), which used pre-trained monolingual embeddings. Furthermore, it also enables applying segmentation to the labels, which allows work with BPE models. Commonly, the initialization of the embeddings layer is a function which assigns random values to the weight matrix , whereas in our second strategy, the values from KGE matrix are used to assign constant values to matrix using a function .

4 Experimental Setup

In our experiments, we used the multilingual EL system introduced by Moussallem et al. (2017) which is language and KB agnostic. Also, it does not require any training and still has shown competitive results according to the benchmarking platform GERBIL Usbeck et al. (2015). Different NN architectures are complex to compare as they are susceptible to hyper-parameters. Therefore, the idea was to use a minimal reasonable configuration set in order to allow a fair analysis of the real KG contributions. For our overall experiments, the RNN-based models use a bi-directional 2-layer LSTM encoder-decoder model with attention Bahdanau et al. (2014)

. The training uses a batch size of 32 and the stochastic gradient descent with an initial learning rate of 0.0002. We set a word embeddings’ size of 500, and hidden layers to size 500, dropout = 0.3 (naive). We use a maximum sentence length of 80, a vocabulary of 50 thousand words and a beam size of 5. All experiments were performed with OpenNMT 

Klein et al. (2017). In addition, we encoded words using BPE Sennrich et al. (2016b) with 32,000 merge operations to achieve an open vocabulary. OpenNMT enables substitution of OOV words with target words that have the highest attention weight according to their source words Luong et al. (2015) and when the words are not found, it uses a copy mechanism which copies the source words to the position of the not-found target word Gu et al. (2016). Thus, we used all the options mentioned above to evaluate the performance of the translation quality. We trained the KGEs with a vector dimension size of 500 with a window size of 50 by using 12 threads with hierarchical softmax. In addition, to Semantically-enriched KGE we added the labels whereby we use sub-word units with values of 2 to the min and 5 to the max. To compare both KGE types, we dubbed the KG-NMT approach that relies on EL and Structured-based KGE as KG-NMT (EL+KGE). The version with semantic information is named KG-NMT (SemKGE). For training, we attempted to be as generic as possible. Thus, our training set consists of a merge of the initial one-third of JRC-Acquis 3.0 Steinberger et al. (2006), Europarl Koehn (2005) and OpenSubtitles2013 Tiedemann (2012), obtaining a parallel training corpus of two million sentences, containing around 38M running words. We used the English and German versions of DBpedia as our reference KG. The English KB contains 4.2 million entities, 661 relations, and 2.1 million labels, while the German version has 1 million entities, 249 relations, and 0.5 million labels. As the measurement of translation quality is inherently subjective, we used three automatic MT metrics to ensure a consistent and clear evaluation. Besides BLEU Papineni et al. (2002), we use METEOR Banerjee and Lavie (2005) and chrF3 Popović (2017) on the newstest between 2014 and 2018 for testing the models. Moreover, we carried out a manual analysis of outputs for assuring the contribution from KGE and we investigated the use of KGE in other settings.

Models newstest2014 newstest2015 newstest2016 newstest2017 newstest2018
Bleu Met chrF3 Bleu Met chrF3 Bleu Met chrF3 BBleu Met chrF3 Bleu Met chrF3
Word RNN baseline 14.47 33.52 40.03 16.77 35.20 41.11 18.55 36.62 42.54 15.1 33.75 39.52 20.53 39.02 43.92
KG-NMT (EL+KGE) 17.19 36.61 42.14 19.86 38.25 42.92 22.38 40.40 45.18 18.04 36.94 41.55 24.87 43.49 46.88
KG-NMT (SemKGE) 18.58 38.42 43.55 21.49 40.19 44.72 24.01 42.47 46.84 19.66 38.89 43.11 27.02 45.77 48.70
CopyM RNN baseline 16.75 37.16 44.93 19.63 39.20 46.38 21.37 40.90 47.85 17.88 37.89 44.85 24.22 43.96 50.15
KG-NMT (EL+KGE) 19.53 39.88 47.18 22.46 41.67 48.28 25.05 44.23 50.66 20.77 40.58 47.04 28.44 47.86 53.25
KG-NMT (SemKGE) 20.97 41.55 48.39 24.08 43.43 49.72 26.70 46.08 52.05 22.30 42.37 48.36 30.55 49.92 54.71
BPE32 RNN baseline 16.33 38.93 49.82 15.89 36.51 45.97 21.95 42.88 52.68 16.8 39.12 49.35 23.85 45.85 54.98
KG-NMT (SemKGE) 19.03 39.82 49.64 21.74 41.41 50.04 24.86 44.32 52.59 20.45 40.62 49.45 28.02 47.51 55.16
Table 1: Results of RNN models in BLEU (Bleu), METEOR (Met), chrF3 on WMT newstest datasets. Word word-based models, CopyM Copy Mechanism and BPE32 BPE models.
RNN vs Transformer.

Previous work has compared NN architectures on a variety of NLP tasks Yin et al. (2017); Linzen et al. (2016); Bernardy and Lappin (2017). However, few investigated RNN and Transformer architectures on the translation task. Recently, Tran et al. (2018) concluded that RNN performs better than Transformer on a subject-verb agreement task, while Tang et al. (2018a) found that Transformer models surpass RNN models only in high-resource conditions. Lastly, Tang et al. (2018b) compared RNN and Transformers on subject-verb agreement and Word Sense Disambiguation (WSD) by scoring contrastive translation pairs. Their findings show that Transformer models overcome RNN at WSD task, showing that they are better at extracting semantic features. In this sense, we decided to perform a comparison between both architectures in order to analyze our hypothesis with KGs. To build a Transformer-based KG-NMT model, we followed the specifications found at Vaswani et al. (2017), which use a 6-layer encoder-decoder, a batch size of 4076, 8 heads, word embeddings and hidden layers of size 512. The Adam optimizer with a learning rate of 2 and a dropout of 0.1 was used. We used the same values to sentence length, beam, and BPE.

Monolingual Embeddings vs. Kge.

Here, we aim to compare the performance of an NMT using pre-trained monolingual embeddings with the Semantically-enriched KGE as both can be used to initialize the internal vectors’ values of an NMT model. Our focus is to analyze if the KGE with fewer words and vectors can perform better than the monolingual embeddings for addressing the translation of entities and terminologies. We used the pre-trained monolingual embeddings from Bojanowski et al. (2017) for English which has 9.2 billion words and the German from Grave et al. (2018) with 1.3 billion words.

5 Results

Overall results

Table 1 depicts the results from KG-NMT using RNN architecture on the newstest dataset between 2014 and 2018. Using KGE leads to a clear improvement over the baseline as it significantly improved the translation quality in terms of BLEU(+3), METEOR (+4) and chrF3 (+3) metrics. KG-NMT (SemKGE) outperformed KG-NMT (EL+KGE) by around +1.3 in BLEU and chrF3, while we observe a +2 point improvement for METEOR. This difference between the contribution of KGE types is directly related to the EL performance which did not manage to annotate all kind of entities present in the KG. The models on BPE also presented consistent improvements showing that segmentation on labels of KG-NMT (SemKGE) model worked. Moreover, the use of the copy mechanism along with KGE got the best results as expected since some entities which were not found in KG, i.e., unfamous persons, were copied from their source words and correctly translated. For example, the entity Chad Johnston appeared in line 1487 of the newstest2015 dataset, but this name was not found in the KB as an entity even though translated correctly.

A detailed study of our results showed that the number of OOV words decreased considerably with the augmentation through KGE. Table 2 shows the number of OOV words generated by the RNN models across all WMT newstest datasets. The statistics cannot ensure that every OOV word that became a known word was essentially an entity presented in KG. Thus, we chose the newstest2015 for a manual analysis. First, we leveraged the METEOR scores to identify sentences with a large number of OOV words. We observed that many OOV words were in fact entities contained in the KG. As an example (line 1265), UK was not translated by RNN baseline even using the copy mechanism (UK) and BPE (Britische). However, it was correctly translated into German as Großbritannien by both KGE models. Similarly, the entity Coastguard (line 1540) was not translated correctly by baseline models, whereby both KGE models were able to translate it into Küstenwache. However, we observed translation mistakes regarding gender information in German. For example, while KG-NMT (EL+KGE) was able to translate the word principal (line 438) correctly into Direktor but using the feminine gender (die Direktorin). An interesting observation regarding the use of EL is that some entities which were not annotated in the source text, were correctly annotated with a German URI in the translated text. This human evaluation suggests that the KG-augmented RNN models were able to correctly learn the translation of entities through the relations found in KGE.

Model 2014 2015 2016 2017 2018
RNN baseline 8,367 6,004 9,559 9,707 9,383
+ Monolingual 5,438 3,832 5,669 5,624 6,055
KG-NMT (EL+KGE) 6,109 4,427 6,524 6,603 6,914
KG-NMT (SemKGE) 5,563 4,067 5,990 6,130 6,236
Table 2: Statistics of OOV words with RNN on newstest between 2014 and 2018.
Comparison of RNN vs Transformer.

Table 3 shows that KG improved the RNN models substantially while it decreased the performance of Transformer models on the newstest2014-2018. The Transformer baseline word-based model outperformed the RNN baseline word-based model across all testsets. However, once augmented with KG, RNNs surpassed the Transformers word-based models in BLEU.888For the sake of space, we only display BLEU results, but we also measured METEOR and chrF3. While analyzing the translations manually on newstest2015, we took the same aforementioned examples. In line 1265, the Transformer baseline was capable of translating UK to Vereinigtes Königreich. Also, in line 438, the transformer baseline model translated the word Principal correctly to Direktor with the correct male gender. In line 1540, the entity Coastguard was not translated by any Transformer-based model. Our manual evaluation showed that Transformer models ignored the translations present in KG. It led us to believe that the activation function played a key role for improving the semantic feature extraction. We concluded that ReLU in Transformed was not capable of learning the KGE values along with the word embeddings while LSTM in RNN was. Moreover, our findings support the results of recent studies comparing both architectures Tang et al. (2018b, c).

Models 2014 2015 2016 2017 2018
Word-based models
RNN baseline 16.75 19.63 21.37 17.88 24.22
Transformer baseline 19.88 22.44 24.12 20.63 27.70
RNN + EL+KGE 19.53 22.46 25.05 20.77 28.44
Transformer + EL+KGE 18.79 21.00 22.83 19.20 26.43
RNN + SemKGE 20.97 24.08 26.70 22.30 30.55
Transformer + SemKGE 19.10 21.31 23.22 19.90 26.84
BPE32 models
RNN baseline 16.33 15.89 21.95 16.80 23.85
Transformer baseline 21.76 24.58 26.43 22.65 30.78
RNN+SemKGE 19.03 21.74 24.86 20.45 28.02
Transformer+SemKGE 19.82 22.38 24.25 21.05 28.01
Table 3: Comparison between Transformer and RNN models in BLEU on various WMT newstest datasets.
Monolingual Embeddings vs Kge.

Table 4 reports no significant difference between monolingual embeddings and KGE in terms of BLEU, METEOR and chrF3.999Due space limitation, we only display 2017 and 2018. At first glance, this finding is interesting since the monolingual embeddings contain billions of words, compared to the DBpedia KG with 4.2 million entities. However, our manual analysis showed that the OOV words addressed by the monolingual embeddings were not in fact entities, but common words and the entities remained unknown. As an example, the RNN+MonoE model translated incorrectly the entity Principal into Wichtigste, while the KG-NMT (SemKGE) used the knowledge documented in the KGs.101010 to translate the entity correctly Moreover, RNN+MonoE was not able to translate the entities UK and Coastguard. Therefore, we envisage that a combination of both is promising and may lead to better results.

Models newstest2017 newstest2018
Bleu Met chrF3 Bleu Met chrF3
Word-based models
RNN+MonoE 20.05 39.42 43.90 27.15 46.13 49.35
KG-NMT (SemKGE) 19.66 38.89 43.11 27.02 45.77 48.70
RNN+MonoE 22.61 42.87 49.01 30.77 50.39 55.41
KG-NMT (SemKGE) 22.30 42.37 48.36 30.55 49.92 54.71
RNN+MonoE 20.93 41.41 50.33 28.42 48.00 55.98
KG-NMT (SemKGE) 20.45 40.62 49.45 28.02 47.51 55.16
Table 4: Comparison between pre-trained monolingual embeddings and KGE (Met = METEOR).

6 Conclusion

In this paper, we introduced an augmentation methodology which relies on the use of KGs to improve the performance of NMT systems. We devised two strategies for incorporating KG embeddings into NMT models which works on word- and character-based models. Additionally, we carried out an extensive evaluation with a manual analysis which showed consistent enhancements provided by KGs in NMT. The overall methodology can be applied to any NMT model since it does not modify the main NMT model structure and also allows replacing different EL systems.


  • Annervaz et al. (2018) Annervaz, Somnath Basu Roy Chowdhury, and Ambedkar Dukkipati. 2018. Learning beyond datasets: Knowledge graph augmented neural networks for natural language processing. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 313–322. Association for Computational Linguistics.
  • Arcan et al. (2015) Mihael Arcan, Marco Turchi, and Paul Buitelaar. 2015. Knowledge Portability with Semantic Expansion of Ontology Labels. In ACL (1), pages 708–718.
  • Auer et al. (2007) Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. 2007. Dbpedia: A nucleus for a web of open data. In The semantic web, pages 722–735. Springer.
  • Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
  • Banerjee and Lavie (2005) Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, volume 29, pages 65–72.
  • Bernardy and Lappin (2017) Jean-Philippe Bernardy and Shalom Lappin. 2017. Using deep neural networks to learn syntactic agreement. LiLT (Linguistic Issues in Language Technology), 15.
  • Bojanowski et al. (2017) Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information. Transactions of the Association of Computational Linguistics, 5(1):135–146.
  • Bordes et al. (2013) Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. 2013. Translating embeddings for modeling multi-relational data. In Advances in Neural Information Processing Systems, pages 2787–2795.
  • Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
  • Du et al. (2016) Jinhua Du, Andy Way, and Andrzej Zydron. 2016. Using babelnet to improve OOV coverage in SMT. In Proceedings of the Tenth International Conference on Language Resources and Evaluation LREC 2016, Portorož, Slovenia, May 23-28, 2016., pages 9–15.
  • Edunov et al. (2018) Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. 2018. Understanding back-translation at scale. arXiv preprint arXiv:1808.09381.
  • Grave et al. (2018) Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas Mikolov. 2018. Learning word vectors for 157 languages. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018).
  • Gu et al. (2016) Jiatao Gu, Zhengdong Lu, Hang Li, and Victor OK Li. 2016. Incorporating copying mechanism in sequence-to-sequence learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1631–1640.
  • Hoang et al. (2018) Vu Cong Duy Hoang, Philipp Koehn, Gholamreza Haffari, and Trevor Cohn. 2018. Iterative back-translation for neural machine translation. In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, pages 18–24.
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
  • Joulin et al. (2017a) Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017a. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, volume 2, pages 427–431.
  • Joulin et al. (2017b) Armand Joulin, Edouard Grave, Piotr Bojanowski, Maximilian Nickel, and Tomas Mikolov. 2017b. Fast linear model for knowledge graph embeddings. arXiv preprint arXiv:1710.10881.
  • Klein et al. (2017) G. Klein, Y. Kim, Y. Deng, J. Senellart, and A. M. Rush. 2017. OpenNMT: Open-Source Toolkit for Neural Machine Translation. ArXiv e-prints.
  • Knight and Luk (1994) Kevin Knight and Steve K Luk. 1994. Building a large-scale knowledge base for machine translation. In AAAI, volume 94, pages 773–778.
  • Koehn (2005) Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In MT summit, volume 5, pages 79–86.
  • Koehn et al. (2007) Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, et al. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions, pages 177–180. Association for Computational Linguistics.
  • Koehn and Knowles (2017) Philipp Koehn and Rebecca Knowles. 2017. Six Challenges for Neural Machine Translation. arXiv preprint arXiv:1706.03872.
  • Li et al. (2016) Xiaoqing Li, Jiajun Zhang, and Chengqing Zong. 2016. Neural name translation improves neural machine translation. arXiv preprint arXiv:1607.01856.
  • Li et al. (2018) Zhongwei Li, Xuancong Wang, AiTi Aw, Eng Siong Chng, and Haizhou Li. 2018. Named-entity tagging and domain adaptation for better customized translation. In Proceedings of the Seventh Named Entities Workshop, pages 41–46. Association for Computational Linguistics.
  • Linzen et al. (2016) Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg. 2016. Assessing the ability of lstms to learn syntax-sensitive dependencies. arXiv preprint arXiv:1611.01368.
  • Luong and Manning (2016) Minh-Thang Luong and Christopher D. Manning. 2016. Achieving open vocabulary neural machine translation with hybrid word-character models. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1054–1063. Association for Computational Linguistics.
  • Luong et al. (2015) Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412–1421. Association for Computational Linguistics.
  • McCrae et al. (2018) John P. McCrae, Andrejs Abele, Paul Buitelaar, Richard Cyganiak, Anja Jentzsch, and Vladimir Andryushechkin. 2018. The Linked Open Data Cloud.
  • McCrae and Cimiano (2013) John Philip McCrae and Philipp Cimiano. 2013. Mining translations from the web of open linked data. In Joint Workshop on NLP&LOD and SWAIE: Semantic Web, Linked Open Data and Information Extraction, page 8.
  • Moussallem et al. (2017) Diego Moussallem, Ricardo Usbeck, Michael Röeder, and Axel-Cyrille Ngonga Ngomo. 2017. MAG: A Multilingual, Knowledge-base Agnostic and Deterministic Entity Linking Approach. In Proceedings of the Knowledge Capture Conference, page 9. ACM.
  • Moussallem et al. (2018) Diego Moussallem, Matthias Wauer, and Axel-Cyrille Ngonga Ngomo. 2018. Machine translation using semantic web technologies: A survey. Journal of Web Semantics, 51:1–19.
  • Navigli and Ponzetto (2012) Roberto Navigli and Simone Paolo Ponzetto. 2012. BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artificial Intelligence, 193:217–250.
  • Neishi et al. (2017) Masato Neishi, Jin Sakuma, Satoshi Tohda, Shonosuke Ishiwatari, Naoki Yoshinaga, and Masashi Toyoda. 2017. A bag of useful tricks for practical neural machine translation: Embedding layer initialization and large batch size. In Proceedings of the 4th Workshop on Asian Translation (WAT2017), pages 99–109.
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics.
  • Popović (2017) Maja Popović. 2017.

    chrF++: words helping character n-grams.

    In Proceedings of the Second Conference on Machine Translation, pages 612–618.
  • Sennrich and Haddow (2016) Rico Sennrich and Barry Haddow. 2016. Linguistic input features improve neural machine translation. In Proceedings of the First Conference on Machine Translation: Volume 1, Research Papers, volume 1, pages 83–91.
  • Sennrich et al. (2016a) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016a. Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 86–96.
  • Sennrich et al. (2016b) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016b. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725. Association for Computational Linguistics.
  • Shi et al. (2016) Chen Shi, Shujie Liu, Shuo Ren, Shi Feng, Mu Li, Ming Zhou, Xu Sun, and Houfeng Wang. 2016. Knowledge-Based Semantic Embedding for Machine Translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 2245–2254.
  • Simov et al. (2016) Kiril Simov, Petya Osenova, and Alex Popov. 2016. Towards Semantic-based Hybrid Machine Translation between Bulgarian and English. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
  • Sorokin and Gurevych (2018) Daniil Sorokin and Iryna Gurevych. 2018. Modeling semantics with gated graph neural networks for knowledge base question answering. In Proceedings of the 27th International Conference on Computational Linguistics, pages 3306–3317. Association for Computational Linguistics.
  • Speer and Lowry-Duda (2017) Robert Speer and Joanna Lowry-Duda. 2017. Conceptnet at semeval-2017 task 2: Extending word embeddings with multilingual relational knowledge. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 85–89.
  • Steinberger et al. (2006) Ralf Steinberger, Bruno Pouliquen, Anna Widiger, Camelia Ignat, Tomaz Erjavec, Dan Tufis, and Dániel Varga. 2006. The jrc-acquis: A multilingual aligned parallel corpus with 20+ languages. arXiv preprint cs/0609058.
  • Sun et al. (2018) Haitian Sun, Bhuwan Dhingra, Manzil Zaheer, Kathryn Mazaitis, Ruslan Salakhutdinov, and William W Cohen. 2018. Open domain question answering using early fusion of knowledge bases and text. EMNLP.
  • Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112.
  • Tang et al. (2018a) Gongbo Tang, Fabienne Cap, Eva Pettersson, and Joakim Nivre. 2018a. An evaluation of neural machine translation models on historical spelling normalization. arXiv preprint arXiv:1806.05210.
  • Tang et al. (2018b) Gongbo Tang, Mathias Müller, Annette Rios, and Rico Sennrich. 2018b. Why self-attention? a targeted evaluation of neural machine translation architectures. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4263–4272.
  • Tang et al. (2018c) Gongbo Tang, Rico Sennrich, and Joakim Nivre. 2018c. An analysis of attention mechanisms: The case of word sense disambiguation in neural machine translation. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 26–35.
  • Tiedemann (2012) Jörg Tiedemann. 2012. Parallel data, tools and interfaces in opus. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey. European Language Resources Association (ELRA).
  • Tran et al. (2018) Ke Tran, Arianna Bisazza, and Christof Monz. 2018. The importance of being recurrent for modeling hierarchical structure. arXiv preprint arXiv:1803.03585.
  • Ugawa et al. (2018) Arata Ugawa, Akihiro Tamura, Takashi Ninomiya, Hiroya Takamura, and Manabu Okumura. 2018. Neural machine translation incorporating named entity. In Proceedings of the 27th International Conference on Computational Linguistics, pages 3240–3250.
  • Usbeck et al. (2015) Ricardo Usbeck, Michael Röder, Axel-Cyrille Ngonga Ngomo, Ciro Baron, Andreas Both, Martin Brümmer, Diego Ceccarelli, Marco Cornolti, Didier Cherix, Bernd Eickmann, Paolo Ferragina, Christiane Lemke, Andrea Moro, Roberto Navigli, Francesco Piccinno, Giuseppe Rizzo, Harald Sack, René Speck, Raphaël Troncy, Jörg Waitelonis, and Lars Wesemann. 2015. GERBIL: general entity annotator benchmarking framework. In Proceedings of the 24th International Conference on World Wide Web, WWW 2015, Florence, Italy, May 18-22, 2015, pages 1133–1143, Florence, Italy.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008.
  • Venugopalan et al. (2016) Subhashini Venugopalan, Lisa Anne Hendricks, Raymond Mooney, and Kate Saenko. 2016. Improving lstm-based video description with linguistic knowledge mined from text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1961–1966.
  • Vrandečić and Krötzsch (2014) Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: a free collaborative knowledgebase. Communications of the ACM, 57(10):78–85.
  • Wang et al. (2017a) Quan Wang, Zhendong Mao, Bin Wang, and Li Guo. 2017a. Knowledge graph embedding: A survey of approaches and applications. IEEE Transactions on Knowledge and Data Engineering, 29(12):2724–2743.
  • Wang et al. (2018) Xinyi Wang, Hieu Pham, Zihang Dai, and Graham Neubig. 2018. Switchout: an efficient data augmentation algorithm for neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 856–861.
  • Wang et al. (2017b) Yuguang Wang, Shanbo Cheng, Liyang Jiang, Jiajun Yang, Wei Chen, Muze Li, Lin Shi, Yanfeng Wang, and Hongtao Yang. 2017b. Sogou neural machine translation systems for wmt17. In Proceedings of the Second Conference on Machine Translation, pages 410–415.
  • Wang et al. (2014) Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. 2014. Knowledge graph and text jointly embedding. In EMNLP, pages 1591–1601. Citeseer.
  • Xiao et al. (2015) Han Xiao, Minlie Huang, Yu Hao, and Xiaoyan Zhu. 2015. Transg: A generative mixture model for knowledge graph embedding. arXiv preprint arXiv:1509.05488.
  • Xiao et al. (2017) Han Xiao, Minlie Huang, Lian Meng, and Xiaoyan Zhu. 2017. Ssp: Semantic space projection for knowledge graph embedding with text descriptions. In AAAI, volume 17, pages 3104–3110.
  • Xie et al. (2016a) Ruobing Xie, Zhiyuan Liu, Jia Jia, Huanbo Luan, and Maosong Sun. 2016a. Representation learning of knowledge graphs with entity descriptions. In AAAI, pages 2659–2665.
  • Xie et al. (2016b) Ruobing Xie, Zhiyuan Liu, and Maosong Sun. 2016b. Representation learning of knowledge graphs with hierarchical types. In IJCAI, pages 2965–2971.
  • Yang and Mitchell (2017) Bishan Yang and Tom Mitchell. 2017. Leveraging knowledge bases in lstms for improving machine reading. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1436–1446.
  • Yin et al. (2017) Wenpeng Yin, Katharina Kann, Mo Yu, and Hinrich Schütze. 2017. Comparative study of cnn and rnn for natural language processing. arXiv preprint arXiv:1702.01923.
  • Zhong et al. (2015) Huaping Zhong, Jianwen Zhang, Zhen Wang, Hai Wan, and Zheng Chen. 2015. Aligning knowledge and text embeddings by entity descriptions. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 267–272.