Exploiting Semantics in Neural Machine Translation with Graph Convolutional Networks

04/23/2018 ∙ by Diego Marcheggiani, et al. ∙ University of Amsterdam 0

Semantic representations have long been argued as potentially useful for enforcing meaning preservation and improving generalization performance of machine translation methods. In this work, we are the first to incorporate information about predicate-argument structure of source sentences (namely, semantic-role representations) into neural machine translation. We use Graph Convolutional Networks (GCNs) to inject a semantic bias into sentence encoders and achieve improvements in BLEU scores over the linguistic-agnostic and syntax-aware versions on the English--German language pair.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

It has long been argued that semantic representations may provide a useful linguistic bias to machine translation systems Weaver (1955); Bar-Hillel (1960). Semantic representations provide an abstraction which can generalize over different surface realizations of the same underlying ‘meaning’. Providing this information to a machine translation system, can, in principle, improve meaning preservation and boost generalization performance.

Though incorporation of semantic information into traditional statistical machine translation has been an active research topic (e.g., Baker et al. (2012); Liu and Gildea (2010); Wu and Fung (2009); Bazrafshan and Gildea (2013); Aziz et al. (2011); Jones et al. (2012)), we are not aware of any previous work considering semantic structures in neural machine translation (NMT). In this work, we aim to fill this gap by showing how information about predicate-argument structure of source sentences can be integrated into standard attention-based NMT models Bahdanau et al. (2015).

Figure 1: An example sentence annotated with a semantic-role representation.

We consider PropBank-style Palmer et al. (2005) semantic role structures, or more specifically their dependency versions Surdeanu et al. (2008). The semantic-role representations mark semantic arguments of predicates in a sentence and categorize them according to their semantic roles. Consider Figure 1, the predicate gave has three arguments:111We slightly abuse the terminology: formally these are syntactic heads of arguments rather than arguments. John (semantic role A0, ‘the giver’), wife (A2, ‘an entity given to’) and present (A1, ‘the thing given’). Semantic roles capture commonalities between different realizations of the same underlying predicate-argument structures. For example, present will still be A1 in sentence “John gave a nice present to his wonderful wife”, despite different surface forms of the two sentences. We hypothesize that semantic roles can be especially beneficial in NMT, as ‘argument switching’ (flipping arguments corresponding to different roles) is one of frequent and severe mistakes made by NMT systems Isabelle et al. (2017).

There is a limited amount of work on incorporating graph structures into neural sequence models. Though, unlike semantics in NMT, syntactically-aware NMT has been a relatively hot topic recently, with a number of approaches claiming improvements from using treebank syntax  Sennrich and Haddow (2016); Eriguchi et al. (2016); Nadejde et al. (2017); Bastings et al. (2017); Aharoni and Goldberg (2017), our graphs are different from syntactic structures. Unlike syntactic dependency graphs, they are not trees and thus cannot be processed in a bottom-up fashion as in eriguchi2016treetoseq or easily linearized as in aharonigoldberg2017stringtotree. Luckily, the modeling approach of bastings-EtAl:2017:EMNLP2017 does not make any assumptions about the graph structure, and thus we build on their method.

bastings-EtAl:2017:EMNLP2017 used Graph Convolutional Networks (GCNs) to encode syntactic structure. GCNs were originally proposed by kipf2016semigraphconv and modified to handle labeled and automatically predicted (hence noisy) syntactic dependency graphs by  marcheggiani-titov:2017:srlgcn. Representations of nodes (i.e. words in a sentence) in GCNs are directly influenced by representations of their neighbors in the graph. The form of influence (e.g., transition matrices and parameters of gates) are learned in such a way as to benefit the end task (i.e. translation). These linguistically-aware word representations are used within a neural encoder. Although recent research has shown that neural architectures are able to learn some linguistic phenomena without explicit linguistic supervision Linzen et al. (2016); Vaswani et al. (2017), informing word representations with linguistic structures can provide a useful inductive bias.

We apply GCNs to the semantic dependency graphs and experiment on the English–German language pair (WMT16). We observe an improvement over the semantics-agnostic baseline (a BiRNN encoder; 23.3 vs 24.5 BLEU). As we use exactly the same modeling approach as in the syntactic method of bastings-EtAl:2017:EMNLP2017, we can easily compare the influence of the types of linguistic structures (i.e., syntax vs. semantics). We observe that when using full WMT data we obtain better results with semantics than with syntax (23.9 BLEU for syntactic GCN). Using syntactic and semantic GCN together, we obtain a further gain (24.9 BLEU) that suggests the complementarity of syntax and semantics.

2 Model

2.1 Encoder-decoder Models

We use a standard attention-based encoder-decoder model Bahdanau et al. (2015) as a starting point for constructing our model. In encoder-decoder models, the encoder takes as input the source sentence and calculates a representation of each word in . The decoder outputs a translation

relying on the representations of the source sentence. Traditionally, the encoder is parametrized as a Recurrent Neural Network (RNN), but other architectures have also been successful, such as Convolutional Neural Networks (CNN)

Gehring et al. (2017)

and hierarchical self-attention models

Vaswani et al. (2017), among others. In this paper we experiment with RNN and CNN encoders. We explore the benefits of incorporating information about semantic-role structures into such encoders.

More formally, RNNs Elman (1990) can be defined as a function

that calculates the hidden representation

of a word based on its left context. Bidirectional RNNs use two RNNs: one runs in the forward direction and another one in the backward direction. The forward represents the left context of word , whereas the backward computes a representation of the right context. The two representations are concatenated in order to incorporate information about the entire sentence:

In contrast to BiRNNs, CNNs LeCun et al. (2001) calculate a representation of a word by considering a window of words around , such as

where is usually an affine transformation followed by a nonlinear function.

Once the sentence has been encoded, the decoder takes as input the induced sentence representation and generates the target sentence . The target sentence

is predicted word by word using an RNN decoder. At each step, the decoder calculates the probability of generating a word

conditioning on a context vector

and the previous state of the RNN decoder. The context vector is calculated based on the representation of the source sentence computed by the encoder, using an attention mechanism Bahdanau et al. (2015). Such a model is trained end-to-end on a parallel corpus to maximize the conditional likelihood of the target sentences.

2.2 Graph Convolutional Networks

Figure 2: Two layers of semantic GCN on top of a (not shown) BiRNN or CNN encoder.
BiRNN CNN
Baseline Bastings et al. (2017) 14.9 12.6
  +Sem 15.6 13.4
  +Syn Bastings et al. (2017) 16.1 13.7
  +Syn + Sem 15.8 14.3
Table 1: Test BLEU, En–De, News Commentary.
BiRNN
Baseline Bastings et al. (2017) 23.3
  +Sem 24.5
  +Syn Bastings et al. (2017) 23.9
  +Syn + Sem 24.9
Table 2: Test BLEU, En–De, full WMT16.

Graph neural networks are a family of neural architectures Scarselli et al. (2009); Gilmer et al. (2017) specifically devised to induce representation of nodes in a graph relying on its graph structure. Graph convolutional networks (GCNs) belong to this family. While GCNs were introduced for modeling undirected unlabeled graphs Kipf and Welling (2016), in this paper we use a formulation of GCNs for labeled directed graphs, where the direction and the label of an edge are incorporated. In particular, we follow the formulation of marcheggiani-titov:2017:srlgcn and bastings-EtAl:2017:EMNLP2017 for syntactic graphs and apply it to dependency-based semantic-role structures Hajic et al. (2009) (as in Figure 1).

More formally, consider a directed graph , where is a set of nodes, and is a set of edges. Each node is represented by a feature vector , where is the latent space dimensionality. The GCN induces a new representation of a node while relying on representations of its neighbors:

where is the set of neighbors of , is a direction-specific parameter matrix. There are three possible directions (): self-loop edges were added in order to ensure that the initial representation of node directly affects its new representation . The vector is an embedding of a semantic role label of the edge (e.g., A0). The functions are scalar gates which weight the importance of each edge. Gates are particularly useful when the graph is predicted and thus may contain errors, i.e., wrong edges. In this scenario gates can down weight the influence of such edges.

is a non-linearity (ReLU).

222Refer to marcheggiani-titov:2017:srlgcn and bastings-EtAl:2017:EMNLP2017 for further details.

As with CNNs, GCN layers can be stacked in order to incorporate higher order neighborhoods. In our experiments, we used GCNs on top of a standard BiRNN encoder and a CNN encoder (Figure 2). In other words, the initial representations of words fed into GCN were either RNN states or CNN representations.

3 Experiments

We experimented with the English-to-German WMT16 dataset (4.5 million sentence pairs for training). We use its subset, News Commentary v11, for development and additional experiments (226.000 sentence pairs). For all these experiments, we use newstest2015 and newstest2016 as a validation and test set, respectively.

We parsed the English partitions of these datasets with a syntactic dependency parser Andor et al. (2016) and dependency-based semantic role labeler Marcheggiani et al. (2017). We constructed the English vocabulary by taking all words with frequency higher than three, while for German we used byte-pair encodings (BPE) Sennrich et al. (2016)

. All hyperparameter selection was performed on the validation set (see Appendix

A). We measured the performance of the models with (cased) BLEU scores Papineni et al. (2002). The settings and the framework (Neural Monkey Helcl and Libovický (2017)) used for experiments are the ones used in bastings-EtAl:2017:EMNLP2017, which we use as baselines. As RNNs, we use GRUs Cho et al. (2014).

We now discuss the impact that different architectures and linguistic information have on the translation quality.

3.1 Results and Discussion

First, we start with experiments with the smaller News Commentary training set (See Table 1). As in bastings-EtAl:2017:EMNLP2017, we used the standard attention-based encoder-decoder model as a baseline.

We tested the impact of semantic GCNs when used on top of CNN and BiRNN encoders. As expected, BiRNN results are stronger than CNN ones. In general, for both encoders we observe the same trend: using semantic GCNs leads to an improvement over the baseline model. The improvements is 0.7 BLEU for BiRNN and 0.8 for CNN. This is slightly surprising as the potentially non-local semantic information should in principle be more beneficial within a less powerful and local CNN encoder. The syntactic GCNs Bastings et al. (2017) appear stronger than semantic GCNs. As exactly the same model and optimization are used for both GCNs, the differences should be due to the type of linguistic representations used.333Note that the SRL system we use Marcheggiani et al. (2017) does not use syntax and is faster than the syntactic parser of P16-1231, so semantic GCNs may still be preferable from the engineering perspective even in this setting. When syntactic and semantic GCNs are used together, we observe a further improvement with respect to the semantic GCN model, and a substantial improvement with respect to the syntactic GCN model with a CNN encoder.

Now we turn to the full WMT experiments. Though we expected that the linguistic bias should more valuable in a resource-poor setting, the improvement from using semantic-role structures is larger here (+1.2 BLEU). It is surprising but perhaps more data is beneficial for accurately modeling influence of semantics on the translation task. Interestingly, the semantic GCN now outperforms the syntactic one by 0.6 BLEU. Again, it is hard to pinpoint exact reasons for this. One may speculate though that, given enough data, RNNs were able to capture syntactic dependency and thus reducing the benefits from using treebank syntax, whereas (often less local and harder) semantic dependencies were more complementary. Finally, when syntactic and semantic GCN are trained together, we obtain a further improvement reaching 24.9 BLEU. These results suggest that syntactic and semantic dependency structures are complementary information when it comes to translation.

BiRNN CNN
Baseline Bastings et al. (2017) 14.1 12.1
  +Sem (1L) 14.3 12.5
  +Sem (2L) 14.4 12.6
  +Sem (3L) 14.4 12.7
  +Syn (2L) Bastings et al. (2017) 14.8 13.1
  +SelfLoop (1L) 14.1 12.1
  +SelfLoop (2L) 14.2 11.5
  +SemSyn (1L) 14.1 12.7
  +Syn (1L) + Sem (1L) 14.7 12.7
  +Syn (1L) + Sem (2L) 14.6 12.8
  +Syn (2L) + Sem (1L) 14.9 13.0
  +Syn (2L) + Sem (2L) 14.9 13.5
Table 3: Validation BLEU, News commentary only
BiRNN John verkaufte das Auto nach Mark .
Sem John verkaufte das Auto an Mark .
BiRNN Der Junge zu Fuß die staubige Straße ist ein Bier trinken .
Sem Der Junge , der die staubige Straße hinunter geht , trinkt ein Bier .
BiRNN Der Junge auf einer Bank im Park spielt Schach .
Sem Der Junge sitzt auf einer Bank im Park Schach .
Table 4: Qualitative analysis. The first two sentences are translations where the semantic structure helps. For the last sentence both translations are problematic but the BiRNN one is grammatical.

3.2 Ablation and Syntax-Semantics GCNs

We used the validation set to perform extra experiments, as well as to select hyper parameters (e.g., the number of GCN layers) for the experiments presented above. Table 3 presents the results. The annotation 1L, 2L and 3L refers to the number of GCN layers used.

First, we tested whether the gain we observed is an effect of an extra layer of non-linearity or an effect of the linguistic structures encoded with GCNs. In order to do so, we used the GCN layer without any structural information. In this way, only the self-loop edge is used within the GCN node updates. These results (e.g., BiRNN+SelfLoop) show that the linguistic-agnostic GCNs perform on par with the baseline, and thus using linguistic structure is genuinely beneficial in translation.

Since syntax and semantic structures seem to be individually beneficial and, though related, capture different linguistic phenomena, it is natural to try combining them. When syntax and semantic are combined together in the same GCN layer (SemSyn), we do not observe any improvement with respect to having semantic and syntactic information alone.444We used distinct matrices for syntax and semantics. We argue that the reason for this is that the two linguistic signals do not interact much when encoded into the same GCN layer with a simpler aggregation function. We thus stacked a semantic GCN on top of a syntactic one and varied the number of layers. Though this approach is more successful, we manage to obtain only very moderate improvements over the single-representation models.

3.3 Qualitative Analysis

We analyzed the behavior of the BiRNN baseline and the semantic GCN model trained on the full WMT16 training set. In Table 4 we show three examples where there is a clear difference between translations produced by the two models. Besides the two translations, we show the dependency SRL structure predicted by the labeler and exploited by our GCN model.

In the first sentence, the only difference is in the choice of the preposition for the argument Mark. Note that the argument is correctly assigned to role A2 (‘Buyer’) by the semantic role labeler. The BiRNN model translates to with nach, which in German expresses directionality and would be a correct translation should the argument refer to a location. In contrast, semantic GCN correctly translates to as an. We hypothesize that the semantic structure, namely the assignment of the argument to A2 rather than AM-DIR (‘Directionality’), helps the model to choose the right preposition. In the second sentence, the BiRNN’s translation is ungrammatical, whereas semantic GCN is able to correctly translate the source sentence. Again, the arguments, correctly identified by semantic role labeler, may have been useful in translating this somewhat tricky sentence. Finally, in the third case, we can observe that both translations are problematic. BiRNN and Semantic GCN ignored verbs sit and play, respectively. However, BiRNN’s translation for this sentence is preferable, as it is grammatically correct, even if not fluent or particularly precise.

4 Conclusions

In this work we propose injecting information about predicate-argument structures of sentences in NMT models. We observe that the semantic structures are beneficial for the English–German language pair. So far we evaluated the model performance in terms of BLEU only. It would be interesting in future work to both understand when semantics appears beneficial, and also to see which components of semantic structures play a role. Experiments on other language pairs are also left for future work.

Acknowledgments

We thank Stella Frank and Wilker Aziz for their suggestions and comments. The project was supported by the European Research Council (ERC StG BroadSem 678254), and the Dutch National Science Foundation (NWO VIDI 639.022.518). We thank NVIDIA for donating the GPUs used for this research.

References

Appendix A Hyperparameters

For experiments on the News Commentary data we used 8000 BPE merges, whereas we used 16000 BPE merges for En–De experiments on the full dataset. For all the experiments, we used bidirectional GRUs and we set the embedding size to 256, we used word dropout with retain probability of 0.8 and edge dropout with the same probability, we used L2 regularization on all the parameters with value of

, translations are obtained using a greedy decoder. We placed residual connections

He et al. (2016) before every GCN layer. For the experiments on News Commentary data, we set GRU (for both encoder and decoder) and CNN hidden states to 512, we use Adam Kingma and Ba (2015)

as optimizer with an initial learning rate of 0.0002, and we trained the models for 50 epochs. For large scale experiments on En–De, we set the GRU hidden states to 800, and instead of greedy decoding we employed beam search (beam 12). We trained the model for 20 epochs with the same hyperparameters.

Appendix B Datasets Statistics

Train Val. Test
English–German 226822 2169 2999
English–German (full) 4500966 2169 2999
Table 5: The number of sentences in our datasets.
Source Target
English–German 37824 8099 (BPE)
English–German (full) 50000 16000 (BPE)
Table 6: Vocabulary sizes.