in which an encoder (e.g an RNN) reads the source sentences sequentially to produce a fixed-length vector representation. Then, a decoder generates a translation from the encoded vector. Attention mechanism is one of the most successful extensions to the simple encoder-decoder models. Attentional models jointly learn to softly align to the source sentence while they attempt to generate words in the target language.
In a natural language, the relationship between words is considered to be in a latent nested hierarchy rather than a sequence order. Therefore, it is expected that modelling these hierarchies (e.g. syntactic tree) will improve the performance of the NMT model, specially for linguistically distant language pairs like English Farsi. Incorporating syntactic constituents of the source language has been studied in Statistical Machine Translation and it has improved translation accuracy. Recently, (Eriguchi et al., 2016) proposed a method to incorporate the hierarchical syntactic information of the source sentence. In addition to the embedding of words which is computed using the basic attentional method, they compute the embedding of phrases in a bottom-up fashion, directed by a parse tree of the source sentence. Since generating gold-standard parse trees is expensive, they used trees generated by off-the-shelf parsers. However, using 1-best parse tree can lead to translation mistakes due to parsing errors.
In this paper, we address the aforementioned issue by using a compact forest instead of a single 1-best tree. In this approach, the embeddings of phrases are calculated using a bottom-up fashion using dynamic programming based on the exponentially many trees encoded in the forest. Thus, in the encoding stage of this approach, different ways of constructing a phrase can be taken into consideration along with the probability of rules in the corresponding trees. Evaluation of this approach on English to Chinese, Persian and German translation showed that forests has better performance compared to 1-best tree and sequential ANMT models.
2 Neural Machine Translation
2.1 Attentional Sequence-to-Sequence model
The architecture of this approach is depicted in Figure 1 .
The source sentence is encoded using an RNN. The hidden states of these RNNs are computed as follows:
where is embedding lookup table of the source word
. We use Long Short-Term Memory (LSTM) unitHochreiter and Schmidhuber (1997) as . Context-dependent embedding of each word is the corresponding hidden state of the encoder ().
The decoder generates output using a deep output RNN Pascanu et al. (2013) which is an RNN followed by an MLP (1-layer in the basic version) to generate the next output word in each time-step. The decoder also consists of an attention mechanism to perform soft alignment to the computed embeddings of the source words.
For the generation of each word , the attention mechanism generates a context vector from the source sentence and with respect to the current state of the decoder. Then, the decoder produce the -th word with the following equations:
A crucial part of the decoder is the attention mechanism which provide relevant information from the source sentence with respect to the current state of the decoder. The context vector is calculated as follows:
where is the score determining how well the words around position in the source sentence match the state of the decoder before generating the -th word.
Training of this model is done by minimising the following objective function:
where is the set of bilingual training sentence pairs.
2.2 Attentional Tree-to-Sequence model
(Eriguchi et al., 2016) proposed a method to incorporate the hierarchical syntactic information of the source sentence. They compute embeddings of phrases in addition to the words, then attend to both words and phrases in the decoder. This process is depicted in Figure 2. Phrase embeddings are computed in a bottom-up fashion directed by the parse tree of source sentence. Since generating gold-standard parse trees is expensive, they used binary constituency parse trees generated by a Head-driven Phrase Structure Grammar (HPSG) parser.
The encoding process is done in two phases. In the first phase, a standard sequential encoder (Section 2.1) is operated on the input sequence. Then a tree-based encoder is operated on the embeddings resulted by the sequential encoder111This model uses one-directional LSTM in the sequential encoder. The embedding of node , , is calculated by combining the embeddings of its children using a non-linear function. Since, the tree is binary, the combining operation is as follow:
where is a non-linear function, and are hidden states of left and right children respectively. This method uses Tree-LSTM components Tai et al. (2015) to calculate the LSTM unit of the parent node using the two children units as follow:
where , , , , are the input gate, left and right forget gates, output gate, and a state for updating memory cell, and and are memory cells of the right and left units.
The initial state of the decoder is calculated by combining the final state of sequential and tree encoders as follow:
where is a function similar to with another set of parameters. The rest of the decoder is similar to the vanilla attentional decoder discussed in Section 2.1.2. The difference is that, in this model, the attention mechanism makes use of phrases as well as words. Thus, the context component is calculated as follows:
3 Forest-to-Sequence Attentional Encoder-Decoder Model
The tree-to-sequence model uses 1-best parse tree generated by a parser. Mistakes and uncertainty in parsing eventually affect the performance of the translation. In this section we propose a novel ANMT model which utilises forest of trees. Forest-to-sequence model computes embeddings of phrases using a bottom-up fashion based on exponentially many trees instead of 1-best one. Calculating embeddings of phrases using each tree separately results in an exponential time complexity. To reduce computational complexity, we use forest data structure and dynamic programming for the computation of phrase embeddings. We use the packed forests generated by Huang (2008)222generated trees are not necessarily binary which is an extension to the Charniak parser Charniak and Johnson (2005). Tags are discarded, and just the structures of trees are taken into consideration.
Figure 3 depicts two different parse trees for a phrase. As seen, a phrase can be constructed in more than one way. Thus, multiple embeddings can be computed for a phrase. In order to make computation of phrase embeddings more efficient, we combine all different embeddings of a phrase to get a unified embedding for each phrase. In upper levels of the forest, we just use the unified embedding of phrases in lower levels.
In order to compute embeddings of words, a standard sequential encoder is operated on the input sequence to obtain embeddings of words. Then, we compute the embedding of each phrase in two phases. First, if one or more trees generate a phrase, for each tree we calculate the intermediate LSTM representation of the phrase using the tree-LSTM encoder. Then, we combine these intermediate representations using the proposed forest-LSTM encoder to obtain the final representation of the phrase. Our forest-LSTM combines phrase embeddings resulted of multiple trees to obtain a unified embedding for the phrase as follows:
where is the number of children which in fact shows different ways to generate the current phrase, and is the probability of -th child which is the probability of the rule for generating the phrase in a corresponding tree. We use an additional layer to compute embedding vector for probability of each child () because it helps to impose probabilities into gates with respect to the hidden states and probability of all children. Also, in computation of probability embedding and forget gate of a child, we used different weight matrices for the current child () and others ().
For the decoding part, we used vanilla attentional decoder discussed in Section 2.1.2 with this change that the attention mechanism attends on both words and phrases. In each decoding step, context component is calculated as follows:
where is the number of different phrases in the forest. In contrast with tree-to-sequence model, in our method the number of embedded phrases are depend on parser and its ambiguities during parsing. Also, the probability of each phrase is taken into consideration.
We make use of three different language pairs, English (En) Chinese (Ch)/ Persian (Fa)/ German (De). For EnglishChinese, we use the BTEC corpus which has 44,016 sentence pairs for training where ’devset1_2’ and ’devset_3’ are used for development and test purpose, respectively. For En Fa we used TEP dataset Tiedemann (2009) that is extracted by aligning English and Persian movie subtitles. It has about 341K sentence pairs. We split the data in 337K, 2K, 2K parts for training, development and test purpose respectively. For En De, we use the first 100K sentences of Europarl v7 part of WMT’14 333http://www.statmt.org/wmt14/translation-task.html training data, and newstest2013 and newstest2014 are used as development and test sets, respectively.
In the preprocessing stage, datasets are lower-cased and tokenised using the Moses toolkit Koehn et al. (2007). Sentences longer than 50 words are removed, and words with frequency less than 5 are converted to a special symbol. Compact forests and tress of English side sentences are obtained from Huang (2008) parser. In all experiments, we used development data just for model selection, and test data for evaluation purposes.
Training details We use Cohn et al. (2016) implementation of the ANMT model, and implement the tree-to-sequence and our model using the DyNet toolkit Neubig et al. (2017)444Nvidia Tesla K20m. Validation-based early stopping (on development sets) is applied to regularise the model and prevent overfitting. The translation is decoded greedy, and BLEU score is computed using ’multi-bleu.perl’ script of Moses Koehn et al. (2007) toolkit.
4.2 Experimental Results
The perplexity and BLEU score of models for all translation tasks are presented in Table 1. In all cases, our work achieved higher BLEU score (up to 0.7 score) compared to the two other models for all of the datasets. Therefore, it can be inferred that using grammar forests can compensate the errors and ambiguities of parsers.
To analyse the effect of incorporating parse trees we investigate the effect of length of source sentences. We bucket each dataset into three parts. The first part contains pairs which source sentences have lengths equal or less than 10, the second part contains source sentences which their length lie between 10 and 20, and the last part has source sentences with more than 20 words. Figures 5, 5 and 5 show the BLEU score of all models on bucketed datasets. One may expects that the forest model would be just beneficial for longer sentences because the number of possible parse trees are higher. However, this results shows that the forest model can be beneficial for sentences from all lengths. On the other hand, the highest improvement belongs to the English Farsi dataset which is the noisiest one, and has the highest uncertainty in the parsing. Therefore, it can be concluded that the forest model is the most beneficial when there is ambiguity in the sentence regardless of its size.
In order to analyse how models use information provided by parse trees, we compute ratio of attention on tree/forest part to attention on words part for both tree-to-sequence and forest-to-sequence models. For each sentence we calculate sum of attention on words and phrases for all target words. Then, the ratio of attention on phrases to words are computed and is averaged for all sentences. Figure 4 shows this attention ratios for bucketed English Persian dataset. Forests provide richer phrase embedding part since the amount of attention on forest computed embeddings is tangibly higher than tree ones.
We have proposed a forest-based Attentional NMT model which uses a packed forest instead of the 1-best tree in the encoder. Although using the phrase structure tree for computing embedding of phrases of source sentence enhance the accuracy of translation, parsing errors are inevitable. Using forest of parse tress and dynamic programming, our method efficiently consider exponentially many grammar trees in order to compensate for parsing errors. Experimental results showed our method is superior to the tree-to-sequence ANMT model.
This research was partly supported by CSIRO’s Data61. We would like to thank Wray Buntine and Bin Li for fruitful discussions.
- Charniak and Johnson (2005) Eugene Charniak and Mark Johnson. 2005. Coarse-to-fine n-best parsing and MaxEnt discriminative reranking. In Proceedings of the 43rd annual meeting on association for computational linguistics. Association for Computational Linguistics, pages 173–180.
- Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation pages 1724–1734.
- Cohn et al. (2016) Trevor Cohn, Cong Duy Vu Hoang, Ekaterina Vymolova, Kaisheng Yao, Chris Dyer, and Gholamreza Haffari. 2016. Incorporating structural alignment biases into an attentional neural translation model. Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics .
- Eriguchi et al. (2016) Akiko Eriguchi, Kazuma Hashimoto, and Yoshimasa Tsuruoka. 2016. Tree-to-sequence attentional neural machine translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, pages 823–833.
- Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9(8):1735–1780.
- Huang (2008) Liang Huang. 2008. Forest reranking: Discriminative parsing with non-local features. In ACL. pages 586–594.
- Koehn et al. (2007) Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics on Interactive Poster and Demonstration Sessions. Association for Computational Linguistics, Stroudsburg, PA, USA, ACL ’07, pages 177–180. http://dl.acm.org/citation.cfm?id=1557769.1557821.
Luong et al. (2015)
Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015.
Effective approaches to attention-based neural machine translation.
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics.
- Neubig et al. (2017) Graham Neubig, Chris Dyer, Yoav Goldberg, Austin Matthews, Waleed Ammar, Antonios Anastasopoulos, Miguel Ballesteros, David Chiang, Daniel Clothiaux, Trevor Cohn, Kevin Duh, Manaal Faruqui, Cynthia Gan, Dan Garrette, Yangfeng Ji, Lingpeng Kong, Adhiguna Kuncoro, Gaurav Kumar, Chaitanya Malaviya, Paul Michel, Yusuke Oda, Matthew Richardson, Naomi Saphra, Swabha Swayamdipta, and Pengcheng Yin. 2017. DyNet: The dynamic neural network toolkit. arXiv preprint arXiv:1701.03980 .
- Pascanu et al. (2013) Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. 2013. How to construct deep recurrent neural networks. arXiv preprint arXiv:1312.6026 .
Sutskever et al. (2014)
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014.
Sequence to sequence learning with neural networks.In Advances in Neural information Processing Systems. pages 3104–3112.
- Tai et al. (2015) Kai Sheng Tai, Richard Socher, and Christopher D Manning. 2015. Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075 .
- Tiedemann (2009) Jörg Tiedemann. 2009. News from OPUS-A collection of multilingual parallel corpora with tools and interfaces. In Recent advances in natural language processing. volume 5, pages 237–248.