It has long been argued that semantic representations may provide a useful linguistic bias to machine translation systems Weaver (1955); Bar-Hillel (1960). Semantic representations provide an abstraction which can generalize over different surface realizations of the same underlying ‘meaning’. Providing this information to a machine translation system, can, in principle, improve meaning preservation and boost generalization performance.
Though incorporation of semantic information into traditional statistical machine translation has been an active research topic (e.g., Baker et al. (2012); Liu and Gildea (2010); Wu and Fung (2009); Bazrafshan and Gildea (2013); Aziz et al. (2011); Jones et al. (2012)), we are not aware of any previous work considering semantic structures in neural machine translation (NMT). In this work, we aim to fill this gap by showing how information about predicate-argument structure of source sentences can be integrated into standard attention-based NMT models Bahdanau et al. (2015).
We consider PropBank-style Palmer et al. (2005) semantic role structures, or more specifically their dependency versions Surdeanu et al. (2008). The semantic-role representations mark semantic arguments of predicates in a sentence and categorize them according to their semantic roles. Consider Figure 1, the predicate gave has three arguments:111We slightly abuse the terminology: formally these are syntactic heads of arguments rather than arguments. John (semantic role A0, ‘the giver’), wife (A2, ‘an entity given to’) and present (A1, ‘the thing given’). Semantic roles capture commonalities between different realizations of the same underlying predicate-argument structures. For example, present will still be A1 in sentence “John gave a nice present to his wonderful wife”, despite different surface forms of the two sentences. We hypothesize that semantic roles can be especially beneficial in NMT, as ‘argument switching’ (flipping arguments corresponding to different roles) is one of frequent and severe mistakes made by NMT systems Isabelle et al. (2017).
There is a limited amount of work on incorporating graph structures into neural sequence models. Though, unlike semantics in NMT, syntactically-aware NMT has been a relatively hot topic recently, with a number of approaches claiming improvements from using treebank syntax Sennrich and Haddow (2016); Eriguchi et al. (2016); Nadejde et al. (2017); Bastings et al. (2017); Aharoni and Goldberg (2017), our graphs are different from syntactic structures. Unlike syntactic dependency graphs, they are not trees and thus cannot be processed in a bottom-up fashion as in eriguchi2016treetoseq or easily linearized as in aharonigoldberg2017stringtotree. Luckily, the modeling approach of bastings-EtAl:2017:EMNLP2017 does not make any assumptions about the graph structure, and thus we build on their method.
bastings-EtAl:2017:EMNLP2017 used Graph Convolutional Networks (GCNs) to encode syntactic structure. GCNs were originally proposed by kipf2016semigraphconv and modified to handle labeled and automatically predicted (hence noisy) syntactic dependency graphs by marcheggiani-titov:2017:srlgcn. Representations of nodes (i.e. words in a sentence) in GCNs are directly influenced by representations of their neighbors in the graph. The form of influence (e.g., transition matrices and parameters of gates) are learned in such a way as to benefit the end task (i.e. translation). These linguistically-aware word representations are used within a neural encoder. Although recent research has shown that neural architectures are able to learn some linguistic phenomena without explicit linguistic supervision Linzen et al. (2016); Vaswani et al. (2017), informing word representations with linguistic structures can provide a useful inductive bias.
We apply GCNs to the semantic dependency graphs and experiment on the English–German language pair (WMT16). We observe an improvement over the semantics-agnostic baseline (a BiRNN encoder; 23.3 vs 24.5 BLEU). As we use exactly the same modeling approach as in the syntactic method of bastings-EtAl:2017:EMNLP2017, we can easily compare the influence of the types of linguistic structures (i.e., syntax vs. semantics). We observe that when using full WMT data we obtain better results with semantics than with syntax (23.9 BLEU for syntactic GCN). Using syntactic and semantic GCN together, we obtain a further gain (24.9 BLEU) that suggests the complementarity of syntax and semantics.
2.1 Encoder-decoder Models
We use a standard attention-based encoder-decoder model Bahdanau et al. (2015) as a starting point for constructing our model. In encoder-decoder models, the encoder takes as input the source sentence and calculates a representation of each word in . The decoder outputs a translation
relying on the representations of the source sentence. Traditionally, the encoder is parametrized as a Recurrent Neural Network (RNN), but other architectures have also been successful, such as Convolutional Neural Networks (CNN)Gehring et al. (2017)
and hierarchical self-attention modelsVaswani et al. (2017), among others. In this paper we experiment with RNN and CNN encoders. We explore the benefits of incorporating information about semantic-role structures into such encoders.
More formally, RNNs Elman (1990) can be defined as a function
that calculates the hidden representationof a word based on its left context. Bidirectional RNNs use two RNNs: one runs in the forward direction and another one in the backward direction. The forward represents the left context of word , whereas the backward computes a representation of the right context. The two representations are concatenated in order to incorporate information about the entire sentence:
In contrast to BiRNNs, CNNs LeCun et al. (2001) calculate a representation of a word by considering a window of words around , such as
where is usually an affine transformation followed by a nonlinear function.
Once the sentence has been encoded, the decoder takes as input the induced sentence representation and generates the target sentence . The target sentence
is predicted word by word using an RNN decoder. At each step, the decoder calculates the probability of generating a word
conditioning on a context vectorand the previous state of the RNN decoder. The context vector is calculated based on the representation of the source sentence computed by the encoder, using an attention mechanism Bahdanau et al. (2015). Such a model is trained end-to-end on a parallel corpus to maximize the conditional likelihood of the target sentences.
2.2 Graph Convolutional Networks
|Baseline Bastings et al. (2017)||14.9||12.6|
|+Syn Bastings et al. (2017)||16.1||13.7|
|+Syn + Sem||15.8||14.3|
|Baseline Bastings et al. (2017)||23.3|
|+Syn Bastings et al. (2017)||23.9|
|+Syn + Sem||24.9|
Graph neural networks are a family of neural architectures Scarselli et al. (2009); Gilmer et al. (2017) specifically devised to induce representation of nodes in a graph relying on its graph structure. Graph convolutional networks (GCNs) belong to this family. While GCNs were introduced for modeling undirected unlabeled graphs Kipf and Welling (2016), in this paper we use a formulation of GCNs for labeled directed graphs, where the direction and the label of an edge are incorporated. In particular, we follow the formulation of marcheggiani-titov:2017:srlgcn and bastings-EtAl:2017:EMNLP2017 for syntactic graphs and apply it to dependency-based semantic-role structures Hajic et al. (2009) (as in Figure 1).
More formally, consider a directed graph , where is a set of nodes, and is a set of edges. Each node is represented by a feature vector , where is the latent space dimensionality. The GCN induces a new representation of a node while relying on representations of its neighbors:
where is the set of neighbors of , is a direction-specific parameter matrix. There are three possible directions (): self-loop edges were added in order to ensure that the initial representation of node directly affects its new representation . The vector is an embedding of a semantic role label of the edge (e.g., A0). The functions are scalar gates which weight the importance of each edge. Gates are particularly useful when the graph is predicted and thus may contain errors, i.e., wrong edges. In this scenario gates can down weight the influence of such edges.
is a non-linearity (ReLU).222Refer to marcheggiani-titov:2017:srlgcn and bastings-EtAl:2017:EMNLP2017 for further details.
As with CNNs, GCN layers can be stacked in order to incorporate higher order neighborhoods. In our experiments, we used GCNs on top of a standard BiRNN encoder and a CNN encoder (Figure 2). In other words, the initial representations of words fed into GCN were either RNN states or CNN representations.
We experimented with the English-to-German WMT16 dataset (4.5 million sentence pairs for training). We use its subset, News Commentary v11, for development and additional experiments (226.000 sentence pairs). For all these experiments, we use newstest2015 and newstest2016 as a validation and test set, respectively.
We parsed the English partitions of these datasets with a syntactic dependency parser Andor et al. (2016) and dependency-based semantic role labeler Marcheggiani et al. (2017). We constructed the English vocabulary by taking all words with frequency higher than three, while for German we used byte-pair encodings (BPE) Sennrich et al. (2016)
. All hyperparameter selection was performed on the validation set (see AppendixA). We measured the performance of the models with (cased) BLEU scores Papineni et al. (2002). The settings and the framework (Neural Monkey Helcl and Libovický (2017)) used for experiments are the ones used in bastings-EtAl:2017:EMNLP2017, which we use as baselines. As RNNs, we use GRUs Cho et al. (2014).
We now discuss the impact that different architectures and linguistic information have on the translation quality.
3.1 Results and Discussion
First, we start with experiments with the smaller News Commentary training set (See Table 1). As in bastings-EtAl:2017:EMNLP2017, we used the standard attention-based encoder-decoder model as a baseline.
We tested the impact of semantic GCNs when used on top of CNN and BiRNN encoders. As expected, BiRNN results are stronger than CNN ones. In general, for both encoders we observe the same trend: using semantic GCNs leads to an improvement over the baseline model. The improvements is 0.7 BLEU for BiRNN and 0.8 for CNN. This is slightly surprising as the potentially non-local semantic information should in principle be more beneficial within a less powerful and local CNN encoder. The syntactic GCNs Bastings et al. (2017) appear stronger than semantic GCNs. As exactly the same model and optimization are used for both GCNs, the differences should be due to the type of linguistic representations used.333Note that the SRL system we use Marcheggiani et al. (2017) does not use syntax and is faster than the syntactic parser of P16-1231, so semantic GCNs may still be preferable from the engineering perspective even in this setting. When syntactic and semantic GCNs are used together, we observe a further improvement with respect to the semantic GCN model, and a substantial improvement with respect to the syntactic GCN model with a CNN encoder.
Now we turn to the full WMT experiments. Though we expected that the linguistic bias should more valuable in a resource-poor setting, the improvement from using semantic-role structures is larger here (+1.2 BLEU). It is surprising but perhaps more data is beneficial for accurately modeling influence of semantics on the translation task. Interestingly, the semantic GCN now outperforms the syntactic one by 0.6 BLEU. Again, it is hard to pinpoint exact reasons for this. One may speculate though that, given enough data, RNNs were able to capture syntactic dependency and thus reducing the benefits from using treebank syntax, whereas (often less local and harder) semantic dependencies were more complementary. Finally, when syntactic and semantic GCN are trained together, we obtain a further improvement reaching 24.9 BLEU. These results suggest that syntactic and semantic dependency structures are complementary information when it comes to translation.
|Baseline Bastings et al. (2017)||14.1||12.1|
|+Syn (2L) Bastings et al. (2017)||14.8||13.1|
|+Syn (1L) + Sem (1L)||14.7||12.7|
|+Syn (1L) + Sem (2L)||14.6||12.8|
|+Syn (2L) + Sem (1L)||14.9||13.0|
|+Syn (2L) + Sem (2L)||14.9||13.5|
|BiRNN||John verkaufte das Auto nach Mark .|
|Sem||John verkaufte das Auto an Mark .|
|BiRNN||Der Junge zu Fuß die staubige Straße ist ein Bier trinken .|
|Sem||Der Junge , der die staubige Straße hinunter geht , trinkt ein Bier .|
|BiRNN||Der Junge auf einer Bank im Park spielt Schach .|
|Sem||Der Junge sitzt auf einer Bank im Park Schach .|
3.2 Ablation and Syntax-Semantics GCNs
We used the validation set to perform extra experiments, as well as to select hyper parameters (e.g., the number of GCN layers) for the experiments presented above. Table 3 presents the results. The annotation 1L, 2L and 3L refers to the number of GCN layers used.
First, we tested whether the gain we observed is an effect of an extra layer of non-linearity or an effect of the linguistic structures encoded with GCNs. In order to do so, we used the GCN layer without any structural information. In this way, only the self-loop edge is used within the GCN node updates. These results (e.g., BiRNN+SelfLoop) show that the linguistic-agnostic GCNs perform on par with the baseline, and thus using linguistic structure is genuinely beneficial in translation.
Since syntax and semantic structures seem to be individually beneficial and, though related, capture different linguistic phenomena, it is natural to try combining them. When syntax and semantic are combined together in the same GCN layer (SemSyn), we do not observe any improvement with respect to having semantic and syntactic information alone.444We used distinct matrices for syntax and semantics. We argue that the reason for this is that the two linguistic signals do not interact much when encoded into the same GCN layer with a simpler aggregation function. We thus stacked a semantic GCN on top of a syntactic one and varied the number of layers. Though this approach is more successful, we manage to obtain only very moderate improvements over the single-representation models.
3.3 Qualitative Analysis
We analyzed the behavior of the BiRNN baseline and the semantic GCN model trained on the full WMT16 training set. In Table 4 we show three examples where there is a clear difference between translations produced by the two models. Besides the two translations, we show the dependency SRL structure predicted by the labeler and exploited by our GCN model.
In the first sentence, the only difference is in the choice of the preposition for the argument Mark. Note that the argument is correctly assigned to role A2 (‘Buyer’) by the semantic role labeler. The BiRNN model translates to with nach, which in German expresses directionality and would be a correct translation should the argument refer to a location. In contrast, semantic GCN correctly translates to as an. We hypothesize that the semantic structure, namely the assignment of the argument to A2 rather than AM-DIR (‘Directionality’), helps the model to choose the right preposition. In the second sentence, the BiRNN’s translation is ungrammatical, whereas semantic GCN is able to correctly translate the source sentence. Again, the arguments, correctly identified by semantic role labeler, may have been useful in translating this somewhat tricky sentence. Finally, in the third case, we can observe that both translations are problematic. BiRNN and Semantic GCN ignored verbs sit and play, respectively. However, BiRNN’s translation for this sentence is preferable, as it is grammatically correct, even if not fluent or particularly precise.
In this work we propose injecting information about predicate-argument structures of sentences in NMT models. We observe that the semantic structures are beneficial for the English–German language pair. So far we evaluated the model performance in terms of BLEU only. It would be interesting in future work to both understand when semantics appears beneficial, and also to see which components of semantic structures play a role. Experiments on other language pairs are also left for future work.
We thank Stella Frank and Wilker Aziz for their suggestions and comments. The project was supported by the European Research Council (ERC StG BroadSem 678254), and the Dutch National Science Foundation (NWO VIDI 639.022.518). We thank NVIDIA for donating the GPUs used for this research.
- Aharoni and Goldberg (2017) Roee Aharoni and Yoav Goldberg. 2017. Towards string-to-tree neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL. pages 132–140. https://doi.org/10.18653/v1/P17-2021.
- Andor et al. (2016) Daniel Andor, Chris Alberti, David Weiss, Aliaksei Severyn, Alessandro Presta, Kuzman Ganchev, Slav Petrov, and Michael Collins. 2016. Globally normalized transition-based neural networks. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL. pages 2442–2452. https://doi.org/10.18653/v1/P16-1231.
- Aziz et al. (2011) Wilker Aziz, Miguel Rios, and Lucia Specia. 2011. Shallow semantic trees for SMT. In Proceedings of the Sixth Workshop on Statistical Machine Translation, WMT@EMNLP. pages 316–322. http://aclanthology.info/papers/W11-2136/shallow-semantic-trees-for-smt.
- Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. In Proceedings of the International Conference on Learning Representations, ICLR. http://arxiv.org/abs/1409.0473.
- Baker et al. (2012) Kathrin Baker, Michael Bloodgood, Bonnie J. Dorr, Chris Callison-Burch, Nathaniel Wesley Filardo, Christine D. Piatko, Lori S. Levin, and Scott Miller. 2012. Modality and negation in SIMT use of modality and negation in semantically-informed syntactic MT. Computational Linguistics 38(2):411–438. https://doi.org/10.1162/COLI_a_00099.
- Bar-Hillel (1960) Yehoshua Bar-Hillel. 1960. The present status of automatic translation of languages. Advances in Computers 1:91–163.
Bastings et al. (2017)
Joost Bastings, Ivan Titov, Wilker Aziz, Diego Marcheggiani, and Khalil Simaan.
encoders for syntax-aware neural machine translation.
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP. pages 1957–1967. https://www.aclweb.org/anthology/D17-1209.
- Bazrafshan and Gildea (2013) Marzieh Bazrafshan and Daniel Gildea. 2013. Semantic roles for string to tree machine translation. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, ACL. pages 419–423. http://aclweb.org/anthology/P/P13/P13-2074.pdf.
- Cho et al. (2014) Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP. pages 1724–1734. http://www.aclweb.org/anthology/D14-1179.
- Elman (1990) Jeffrey L Elman. 1990. Finding structure in time. Cognitive science 14(2):179–211.
- Eriguchi et al. (2016) Akiko Eriguchi, Kazuma Hashimoto, and Yoshimasa Tsuruoka. 2016. Tree-to-sequence attentional neural machine translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL. pages 823–833. http://www.aclweb.org/anthology/P16-1078.
- Gehring et al. (2017) Jonas Gehring, Michael Auli, David Grangier, and Yann Dauphin. 2017. A convolutional encoder model for neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL. pages 123–135. https://doi.org/10.18653/v1/P17-1012.
Gilmer et al. (2017)
Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and
George E. Dahl. 2017.
passing for quantum chemistry.
Proceedings of the 34th International Conference on Machine Learning, ICML. pages 1263–1272. http://proceedings.mlr.press/v70/gilmer17a.html.
Hajic et al. (2009)
Jan Hajic, Massimiliano Ciaramita, Richard Johansson, Daisuke Kawahara, Maria Antònia Martí, Lluís Màrquez, Adam Meyers, Joakim Nivre, Sebastian Padó, Jan Stepánek, Pavel Stranák, Mihai Surdeanu, Nianwen Xue, and Yi Zhang. 2009.The conll-2009 shared task: Syntactic and semantic dependencies in multiple languages. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning: Shared Task, CoNLL. pages 1–18. http://aclweb.org/anthology/W/W09/W09-1201.pdf.
- He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In . pages 770–778. https://doi.org/10.1109/CVPR.2016.90.
- Helcl and Libovický (2017) Jindřich Helcl and Jindřich Libovický. 2017. Neural monkey: An open-source tool for sequence learning. The Prague Bulletin of Mathematical Linguistics (107):5–17. https://doi.org/10.1515/pralin-2017-0001.
- Isabelle et al. (2017) Pierre Isabelle, Colin Cherry, and George F. Foster. 2017. A challenge set approach to evaluating machine translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP. pages 2486–2496. https://aclanthology.info/papers/D17-1263/d17-1263.
- Jones et al. (2012) Bevan Jones, Jacob Andreas, Daniel Bauer, Karl Moritz Hermann, and Kevin Knight. 2012. Semantics-based machine translation with hyperedge replacement grammars. In Proceedings of the 24th International Conference on Computational Linguistics, COLING. pages 1359–1376. http://aclweb.org/anthology/C/C12/C12-1083.pdf.
- Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations, ICLR. http://arxiv.org/abs/1412.6980.
- Kipf and Welling (2016) Thomas N. Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. In Proceedings of the International Conference on Learning Representations, ICLR. http://arxiv.org/abs/1609.02907.
- LeCun et al. (2001) Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. 2001. Gradient-based learning applied to document recognition. In Proceedings of Intelligent Signal Processing.
- Linzen et al. (2016) Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg. 2016. Assessing the ability of lstms to learn syntax-sensitive dependencies. Transactions of the Association for Computational Linguistics 4:521–535. https://www.transacl.org/ojs/index.php/tacl/article/view/972.
- Liu and Gildea (2010) Ding Liu and Daniel Gildea. 2010. Semantic role features for machine translation. In Proceedings of the23rd International Conference on Computational Linguistics, COLING. pages 716–724. http://aclweb.org/anthology/C10-1081.
- Marcheggiani et al. (2017) Diego Marcheggiani, Anton Frolov, and Ivan Titov. 2017. A simple and accurate syntax-agnostic neural model for dependency-based semantic role labeling. In Proceedings of the 21st Conference on Computational Natural Language Learning, CoNLL. pages 411–420. https://doi.org/10.18653/v1/K17-1041.
- Marcheggiani and Titov (2017) Diego Marcheggiani and Ivan Titov. 2017. Encoding sentences with graph convolutional networks for semantic role labeling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP. pages 1506–1515. https://aclanthology.info/papers/D17-1159/d17-1159.
- Nadejde et al. (2017) Maria Nadejde, Siva Reddy, Rico Sennrich, Tomasz Dwojak, Marcin Junczys-Dowmunt, Philipp Koehn, and Alexandra Birch. 2017. Predicting target language CCG supertags improves neural machine translation. In Proceedings of the Second Conference on Machine Translation, WMT. pages 68–79. http://aclanthology.info/papers/W17-4707/w17-4707.
- Palmer et al. (2005) Martha Palmer, Paul Kingsbury, and Daniel Gildea. 2005. The proposition bank: An annotated corpus of semantic roles. Computational Linguistics 31(1):71–106. https://doi.org/10.1162/0891201053630264.
- Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, ACL. pages 311–318. http://www.aclweb.org/anthology/P02-1040.pdf.
- Scarselli et al. (2009) Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. 2009. The graph neural network model. IEEE Trans. Neural Networks 20(1):61–80. https://doi.org/10.1109/TNN.2008.2005605.
- Sennrich and Haddow (2016) Rico Sennrich and Barry Haddow. 2016. Linguistic Input Features Improve Neural Machine Translation. In Proceedings of the First Conference on Machine Translation, WMT. pages 83–91. http://www.aclweb.org/anthology/W16-2209.
- Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL. pages 1715–1725. http://www.aclweb.org/anthology/P16-1162.
- Surdeanu et al. (2008) Mihai Surdeanu, Richard Johansson, Adam Meyers, Lluís Màrquez, and Joakim Nivre. 2008. The CoNLL 2008 shared task on joint parsing of syntactic and semantic dependencies. In Proceedings of the Twelfth Conference on Computational Natural Language Learning, CoNLL. pages 159–177. http://aclweb.org/anthology/W/W08/W08-2121.pdf.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems, NIPS. pages 6000–6010. http://papers.nips.cc/paper/7181-attention-is-all-you-need.
- Weaver (1955) Warren Weaver. 1955. Translation. Machine translation of languages 14:15–23.
- Wu and Fung (2009) Dekai Wu and Pascale Fung. 2009. Semantic roles for SMT: A hybrid two-pass model. In Proceedings of the Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, NAACL. pages 13–16. http://www.aclweb.org/anthology/N09-2004.
Appendix A Hyperparameters
For experiments on the News Commentary data we used 8000 BPE merges, whereas we used 16000 BPE merges for En–De experiments on the full dataset. For all the experiments, we used bidirectional GRUs and we set the embedding size to 256, we used word dropout with retain probability of 0.8 and edge dropout with the same probability, we used L2 regularization on all the parameters with value of
, translations are obtained using a greedy decoder. We placed residual connectionsHe et al. (2016) before every GCN layer. For the experiments on News Commentary data, we set GRU (for both encoder and decoder) and CNN hidden states to 512, we use Adam Kingma and Ba (2015)
as optimizer with an initial learning rate of 0.0002, and we trained the models for 50 epochs. For large scale experiments on En–De, we set the GRU hidden states to 800, and instead of greedy decoding we employed beam search (beam 12). We trained the model for 20 epochs with the same hyperparameters.
Appendix B Datasets Statistics
|English–German (full)||50000||16000 (BPE)|