Incorporating Syntactic Uncertainty in Neural Machine Translation with Forest-to-Sequence Model

11/19/2017 ∙ by Poorya Zaremoodi, et al. ∙ 0

Incorporating syntactic information in Neural Machine Translation models is a method to compensate their requirement for a large amount of parallel training text, especially for low-resource language pairs. Previous works on using syntactic information provided by (inevitably error-prone) parsers has been promising. In this paper, we propose a forest-to-sequence Attentional Neural Machine Translation model to make use of exponentially many parse trees of the source sentence to compensate for the parser errors. Our method represents the collection of parse trees as a packed forest, and learns a neural attentional transduction model from the forest to the target sentence. Experiments on English to German, Chinese and Persian translation show the superiority of our method over the tree-to-sequence and vanilla sequence-to-sequence neural translation models.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, a great deal of attention has been focused on Neural Machine Translation (NMT). The main NMT approach is based on the encoder-decoder architecture Cho et al. (2014); Sutskever et al. (2014)

in which an encoder (e.g an RNN) reads the source sentences sequentially to produce a fixed-length vector representation. Then, a decoder generates a translation from the encoded vector. Attention mechanism is one of the most successful extensions to the simple encoder-decoder models. Attentional models jointly learn to softly align to the source sentence while they attempt to generate words in the target language.

In a natural language, the relationship between words is considered to be in a latent nested hierarchy rather than a sequence order. Therefore, it is expected that modelling these hierarchies (e.g. syntactic tree) will improve the performance of the NMT model, specially for linguistically distant language pairs like English Farsi. Incorporating syntactic constituents of the source language has been studied in Statistical Machine Translation and it has improved translation accuracy. Recently, (Eriguchi et al., 2016) proposed a method to incorporate the hierarchical syntactic information of the source sentence. In addition to the embedding of words which is computed using the basic attentional method, they compute the embedding of phrases in a bottom-up fashion, directed by a parse tree of the source sentence. Since generating gold-standard parse trees is expensive, they used trees generated by off-the-shelf parsers. However, using 1-best parse tree can lead to translation mistakes due to parsing errors.

In this paper, we address the aforementioned issue by using a compact forest instead of a single 1-best tree. In this approach, the embeddings of phrases are calculated using a bottom-up fashion using dynamic programming based on the exponentially many trees encoded in the forest. Thus, in the encoding stage of this approach, different ways of constructing a phrase can be taken into consideration along with the probability of rules in the corresponding trees. Evaluation of this approach on English to Chinese, Persian and German translation showed that forests has better performance compared to 1-best tree and sequential ANMT models.

2 Neural Machine Translation

2.1 Attentional Sequence-to-Sequence model

The architecture of this approach is depicted in Figure 1 .

2.1.1 Encoder

The source sentence is encoded using an RNN. The hidden states of these RNNs are computed as follows:

where is embedding lookup table of the source word

. We use Long Short-Term Memory (LSTM) unit

Hochreiter and Schmidhuber (1997) as . Context-dependent embedding of each word is the corresponding hidden state of the encoder ().

Figure 1: Attentional Encoder-Decoder model.

2.1.2 Decoder

The decoder generates output using a deep output RNN Pascanu et al. (2013) which is an RNN followed by an MLP (1-layer in the basic version) to generate the next output word in each time-step. The decoder also consists of an attention mechanism to perform soft alignment to the computed embeddings of the source words.

For the generation of each word , the attention mechanism generates a context vector from the source sentence and with respect to the current state of the decoder. Then, the decoder produce the -th word with the following equations:


where is the embedding of word looked-up from target language embedding table , and matrices and are parameters. The last term in the Equation 1 is the input-feeding term Luong et al. (2015).

A crucial part of the decoder is the attention mechanism which provide relevant information from the source sentence with respect to the current state of the decoder. The context vector is calculated as follows:

where is the score determining how well the words around position in the source sentence match the state of the decoder before generating the -th word.

Training of this model is done by minimising the following objective function:

where is the set of bilingual training sentence pairs.

2.2 Attentional Tree-to-Sequence model

(Eriguchi et al., 2016) proposed a method to incorporate the hierarchical syntactic information of the source sentence. They compute embeddings of phrases in addition to the words, then attend to both words and phrases in the decoder. This process is depicted in Figure 2. Phrase embeddings are computed in a bottom-up fashion directed by the parse tree of source sentence. Since generating gold-standard parse trees is expensive, they used binary constituency parse trees generated by a Head-driven Phrase Structure Grammar (HPSG) parser.

Figure 2: Attentional Tree-to-Sequence model.

2.2.1 Encoder

The encoding process is done in two phases. In the first phase, a standard sequential encoder (Section 2.1) is operated on the input sequence. Then a tree-based encoder is operated on the embeddings resulted by the sequential encoder111This model uses one-directional LSTM in the sequential encoder. The embedding of node , , is calculated by combining the embeddings of its children using a non-linear function. Since, the tree is binary, the combining operation is as follow:

where is a non-linear function, and are hidden states of left and right children respectively. This method uses Tree-LSTM components Tai et al. (2015) to calculate the LSTM unit of the parent node using the two children units as follow:

where , , , , are the input gate, left and right forget gates, output gate, and a state for updating memory cell, and and are memory cells of the right and left units.

2.2.2 Decoder

The initial state of the decoder is calculated by combining the final state of sequential and tree encoders as follow:

where is a function similar to with another set of parameters. The rest of the decoder is similar to the vanilla attentional decoder discussed in Section 2.1.2. The difference is that, in this model, the attention mechanism makes use of phrases as well as words. Thus, the context component is calculated as follows:

3 Forest-to-Sequence Attentional Encoder-Decoder Model

The tree-to-sequence model uses 1-best parse tree generated by a parser. Mistakes and uncertainty in parsing eventually affect the performance of the translation. In this section we propose a novel ANMT model which utilises forest of trees. Forest-to-sequence model computes embeddings of phrases using a bottom-up fashion based on exponentially many trees instead of 1-best one. Calculating embeddings of phrases using each tree separately results in an exponential time complexity. To reduce computational complexity, we use forest data structure and dynamic programming for the computation of phrase embeddings. We use the packed forests generated by Huang (2008)222generated trees are not necessarily binary which is an extension to the Charniak parser Charniak and Johnson (2005). Tags are discarded, and just the structures of trees are taken into consideration.

Figure 3: An example of generating a phrase from two different parse trees

Figure 3 depicts two different parse trees for a phrase. As seen, a phrase can be constructed in more than one way. Thus, multiple embeddings can be computed for a phrase. In order to make computation of phrase embeddings more efficient, we combine all different embeddings of a phrase to get a unified embedding for each phrase. In upper levels of the forest, we just use the unified embedding of phrases in lower levels.

In order to compute embeddings of words, a standard sequential encoder is operated on the input sequence to obtain embeddings of words. Then, we compute the embedding of each phrase in two phases. First, if one or more trees generate a phrase, for each tree we calculate the intermediate LSTM representation of the phrase using the tree-LSTM encoder. Then, we combine these intermediate representations using the proposed forest-LSTM encoder to obtain the final representation of the phrase. Our forest-LSTM combines phrase embeddings resulted of multiple trees to obtain a unified embedding for the phrase as follows:

where is the number of children which in fact shows different ways to generate the current phrase, and is the probability of -th child which is the probability of the rule for generating the phrase in a corresponding tree. We use an additional layer to compute embedding vector for probability of each child () because it helps to impose probabilities into gates with respect to the hidden states and probability of all children. Also, in computation of probability embedding and forget gate of a child, we used different weight matrices for the current child () and others ().

For the decoding part, we used vanilla attentional decoder discussed in Section 2.1.2 with this change that the attention mechanism attends on both words and phrases. In each decoding step, context component is calculated as follows:

where is the number of different phrases in the forest. In contrast with tree-to-sequence model, in our method the number of embedded phrases are depend on parser and its ambiguities during parsing. Also, the probability of each phrase is taken into consideration.

4 Experiments

En De En Ch En Fa Method H Perplexity BLEU Perplexity BLEU Perplexity BLEU Proposed 512 29.25 13.43 5.49 28.39 16.66 12.38 Forest-to-sequence 256 30.83 13.54 6.16 27.08 17.62 11.91 Tree-to-sequence 512 31.86 13.05 5.71 28 16.28 11.71 Eriguchi et al. (2016) 256 30.13 13 6.17 26.85 17.94 11.32 Vanilla ANMT 512 32.61 12.21 6.12 26.77 18.4 10.93 Luong et al. (2015) 256 33.07 11.98 6.48 25.43 19.21 10.17
Table 1: Comparison of the methods together with different hidden dimension size (H) for all datasets.

4.1 Datasets

We make use of three different language pairs, English (En) Chinese (Ch)/ Persian (Fa)/ German (De). For EnglishChinese, we use the BTEC corpus which has 44,016 sentence pairs for training where ’devset1_2’ and ’devset_3’ are used for development and test purpose, respectively. For En Fa we used TEP dataset Tiedemann (2009) that is extracted by aligning English and Persian movie subtitles. It has about 341K sentence pairs. We split the data in 337K, 2K, 2K parts for training, development and test purpose respectively. For En De, we use the first 100K sentences of Europarl v7 part of WMT’14 333 training data, and newstest2013 and newstest2014 are used as development and test sets, respectively.

In the preprocessing stage, datasets are lower-cased and tokenised using the Moses toolkit Koehn et al. (2007). Sentences longer than 50 words are removed, and words with frequency less than 5 are converted to a special symbol. Compact forests and tress of English side sentences are obtained from Huang (2008) parser. In all experiments, we used development data just for model selection, and test data for evaluation purposes.

Training details We use Cohn et al. (2016) implementation of the ANMT model, and implement the tree-to-sequence and our model using the DyNet toolkit Neubig et al. (2017)

on top of it. Models are trained end-to-end using Stochastic Gradient Descent (SGD) where the mini-batch size was is to 128. Each model is trained up to 20 epochs on a GPU

444Nvidia Tesla K20m. Validation-based early stopping (on development sets) is applied to regularise the model and prevent overfitting. The translation is decoded greedy, and BLEU score is computed using ’multi-bleu.perl’ script of Moses Koehn et al. (2007) toolkit.

4.2 Experimental Results

The perplexity and BLEU score of models for all translation tasks are presented in Table 1. In all cases, our work achieved higher BLEU score (up to  0.7 score) compared to the two other models for all of the datasets. Therefore, it can be inferred that using grammar forests can compensate the errors and ambiguities of parsers.

To analyse the effect of incorporating parse trees we investigate the effect of length of source sentences. We bucket each dataset into three parts. The first part contains pairs which source sentences have lengths equal or less than 10, the second part contains source sentences which their length lie between 10 and 20, and the last part has source sentences with more than 20 words. Figures 5, 5 and 5 show the BLEU score of all models on bucketed datasets. One may expects that the forest model would be just beneficial for longer sentences because the number of possible parse trees are higher. However, this results shows that the forest model can be beneficial for sentences from all lengths. On the other hand, the highest improvement belongs to the English Farsi dataset which is the noisiest one, and has the highest uncertainty in the parsing. Therefore, it can be concluded that the forest model is the most beneficial when there is ambiguity in the sentence regardless of its size.

In order to analyse how models use information provided by parse trees, we compute ratio of attention on tree/forest part to attention on words part for both tree-to-sequence and forest-to-sequence models. For each sentence we calculate sum of attention on words and phrases for all target words. Then, the ratio of attention on phrases to words are computed and is averaged for all sentences. Figure 4 shows this attention ratios for bucketed English Persian dataset. Forests provide richer phrase embedding part since the amount of attention on forest computed embeddings is tangibly higher than tree ones.

Figure 4: Comparison of attention ratios for English Persian bucketed dataset
Figure 5: Comparison of BLEU scores for datasets bucketed based on the length of source sentences.

5 Conclusion

We have proposed a forest-based Attentional NMT model which uses a packed forest instead of the 1-best tree in the encoder. Although using the phrase structure tree for computing embedding of phrases of source sentence enhance the accuracy of translation, parsing errors are inevitable. Using forest of parse tress and dynamic programming, our method efficiently consider exponentially many grammar trees in order to compensate for parsing errors. Experimental results showed our method is superior to the tree-to-sequence ANMT model.

6 Acknowledgment

This research was partly supported by CSIRO’s Data61. We would like to thank Wray Buntine and Bin Li for fruitful discussions.