C++ implementation for Neural Network-based NLP, such as LSTM machine translation!
Most of the existing Neural Machine Translation (NMT) models focus on the conversion of sequential data and do not directly use syntactic information. We propose a novel end-to-end syntactic NMT model, extending a sequence-to-sequence model with the source-side phrase structure. Our model has an attention mechanism that enables the decoder to generate a translated word while softly aligning it with phrases as well as words of the source sentence. Experimental results on the WAT'15 English-to-Japanese dataset demonstrate that our proposed model considerably outperforms sequence-to-sequence attentional NMT models and compares favorably with the state-of-the-art tree-to-string SMT system.READ FULL TEXT VIEW PDF
C++ implementation for Neural Network-based NLP, such as LSTM machine translation!
C++ code of "Tree-to-Sequence Attentional Neural Machine Translation (tree2seq ANMT)"
Machine Translation (MT) has traditionally been one of the most complex language processing problems, but recent advances of Neural Machine Translation (NMT) make it possible to perform translation using a simple end-to-end architecture. In the Encoder-Decoder model [Cho et al.2014b, Sutskever et al.2014]
, a Recurrent Neural Network (RNN) called theencoder
reads the whole sequence of source words to produce a fixed-length vector, and then another RNN called thedecoder generates the target words from the vector. The Encoder-Decoder model has been extended with an attention mechanism [Bahdanau et al.2015, Luong et al.2015a], which allows the model to jointly learn the soft alignment between the source language and the target language. NMT models have achieved state-of-the-art results in English-to-French and English-to-German translation tasks [Luong et al.2015b, Luong et al.2015a]. However, it is yet to be seen whether NMT is competitive with traditional Statistical Machine Translation (SMT) approaches in translation tasks for structurally distant language pairs such as English-to-Japanese.
Figure 1 shows a pair of parallel sentences in English and Japanese. English and Japanese are linguistically distant in many respects; they have different syntactic constructions, and words and phrases are defined in different lexical units. In this example, the Japanese word “緑茶” is aligned with the English words “green” and “tea”, and the English word sequence “a cup of” is aligned with a special symbol “null”, which is not explicitly translated into any Japanese words. One way to solve this mismatch problem is to consider the phrase structure of the English sentence and align the phrase “a cup of green tea” with “緑茶”. In SMT, it is known that incorporating syntactic constituents of the source language into the models improves word alignment [Yamada and Knight2001] and translation accuracy [Liu et al.2006, Neubig and Duh2014]. However, the existing NMT models do not allow us to perform this kind of alignment.
In this paper, we propose a novel attentional NMT model to take advantage of syntactic information. Following the phrase structure of a source sentence, we encode the sentence recursively in a bottom-up fashion to produce a vector representation of the sentence and decode it while aligning the input phrases and words with the output. Our experimental results on the WAT’15 English-to-Japanese translation task show that our proposed model achieves state-of-the-art translation accuracy.
In the RNN encoder, the -th hidden unit is calculated given the -th input and the previous hidden unit ,
where is a non-linear function, and the initial hidden unit is usually set to zeros. The encoding function is recursively applied until the -th hidden unit is obtained. The RNN Encoder-Decoder models assume that represents a vector of the meaning of the input sequence up to the -th word.
After encoding the whole input sentence into the vector space, we decode it in a similar way. The initial decoder unit is initialized with the input sentence vector (. Given the previous target word and the -th hidden unit of the decoder, the conditional probability that the -th target word is generated is calculated as follows:
where is a non-linear function. The -th hidden unit of the decoder is calculated by using another non-linear function as follows:
We employ Long Short-Term Memory (LSTM) units[Hochreiter and Schmidhuber1997, Gers et al.2000] in place of vanilla RNN units. The -th LSTM unit consists of several gates and two different types of states: a hidden unit and a memory cell ,
where each of , , and denotes an input gate, a forget gate, an output gate, and a state for updating the memory cell, respectively. and are weight matrices,
is a bias vector, andis the word embedding of the -th input word. is the logistic function, and the operator denotes element-wise multiplication between vectors.
The NMT models with an attention mechanism [Bahdanau et al.2015, Luong et al.2015a] have been proposed to softly align each decoder state with the encoder states. The attention mechanism allows the NMT models to explicitly quantify how much each encoder state contributes to the word prediction at each time step.
In the attentional NMT model in luong-pham-manning:2015:EMNLP, at the -th step of the decoder process, the attention score between the -th source hidden unit and the -th target hidden unit is calculated as follows:
where is the inner product of and , which is used to directly calculate the similarity score between and . The -th context vector is calculated as the summation vector weighted by :
To incorporate the attention mechanism into the decoding process, the context vector is used for the the -th word prediction by putting an additional hidden layer :
where is the concatenation of and , and and are a weight matrix and a bias vector, respectively. The model predicts the -th word by using the softmax function:
where and are a weight matrix and a bias vector, respectively. stands for the size of the vocabulary of the target language. Figure 2 shows an example of the NMT model with the attention mechanism.
The objective function to train the NMT models is the sum of the log-likelihoods of the translation pairs in the training data:
where denotes a set of parallel sentence pairs. The model parameters
are learned through Stochastic Gradient Descent (SGD).
The exsiting NMT models treat a sentence as a sequence of words and neglect the structure of a sentence inherent in language. We propose a novel tree-based encoder in order to explicitly take the syntactic structure into consideration in the NMT model. We focus on the phrase structure of a sentence and construct a sentence vector from phrase vectors in a bottom-up fashion. The sentence vector in the tree-based encoder is therefore composed of the structural information rather than the sequential data. Figure 3 shows our proposed model, which we call a tree-to-sequence attentional NMT model.
In Head-driven Phrase Structure Grammar (HPSG) [Sag et al.2003], a sentence is composed of multiple phrase units and represented as a binary tree as shown in Figure 1. Following the structure of the sentence, we construct a tree-based encoder on top of the standard sequential encoder. The -th parent hidden unit for the -th phrase is calculated using the left and right child hidden units and as follows:
where is a non-linear function.
We construct a tree-based encoder with LSTM units, where each node in the binary tree is represented with an LSTM unit. When initializing the leaf units of the tree-based encoder, we employ the sequential LSTM units described in Section 2.1. Each non-leaf node is also represented with an LSTM unit, and we employ Tree-LSTM [Tai et al.2015] to calculate the LSTM unit of the parent node which has two child LSTM units. The hidden unit and the memory cell for the -th parent node are calculated as follows:
where , , , , are an input gate, the forget gates for left and right child units, an output gate, and a state for updating the memory cell, respectively. and are the memory cells for the left and right child units, respectively. denotes a weight matrix, and represents a bias vector.
Our proposed tree-based encoder is a natural extension of the conventional sequential encoder, since Tree-LSTM is a generalization of chain-structured LSTM [Tai et al.2015]. Our encoder differs from the original Tree-LSTM in the calculation of the LSTM units for the leaf nodes. The motivation is to construct the phrase nodes in a context-sensitive way, which, for example, allows the model to compute different representations for multiple occurrences of the same word in a sentence because the sequential LSTMs are calculated in the context of the previous units. This ability contrasts with the original Tree-LSTM, in which the leaves are composed only of the word embeddings without any contextual information.
We now have two different sentence vectors: one is from the sequence encoder and the other from the tree-based encoder. As shown in Figure 3, we provide another Tree-LSTM unit which has the final sequential encoder unit () and the tree-based encoder unit () as two child units and set it as the initial decoder as follows:
where is the same function as with another set of Tree-LSTM parameters. This initialization allows the decoder to capture information from both the sequential data and phrase structures. corr/abs/1601.00710 proposed a similar method using a Tree-LSTM for initializing the decoder, with which they translate multiple source languages to one target language. When the syntactic parser fails to output a parse tree for a sentence, we encode the sentence with the sequential encoder by setting . Our proposed tree-based encoder therefore works with any sentences.
We adopt the attention mechanism into our tree-to-sequence model in a novel way. Our model gives attention not only to sequential hidden units but also to phrase hidden units. This attention mechanism tells us which words or phrases in the source sentence are important when the model decodes a target word. The -th context vector is composed of the sequential and phrase vectors weighted by the attention score :
Note that a binary tree has phrase nodes if the tree has leaves. We set a final decoder in the same way as Equation (7).
In addition, we adopt the input-feeding method [Luong et al.2015a] in our model, which is a method for feeding , the previous unit to predict the word , into the current target hidden unit ,
where is the concatenation of and . The input-feeding approach contributes to the enrichment in the calculation of the decoder, because is an informative unit which can be used to predict the output word as well as to be compacted with attentional context vectors. luong-pham-manning:2015:EMNLP showed that the input-feeding approach improves BLEU scores. We also observed the same improvement in our preliminary experiments.
The biggest computational bottleneck of training the NMT models is in the calculation of the softmax layer described in Equation (8), because its computational cost increases linearly with the size of the vocabulary. The speedup technique with GPUs has proven useful for sequence-based NMT models [Sutskever et al.2014, Luong et al.2015a] but it is not easily applicable when dealing with tree-structured data. In order to reduce the training cost of the NMT models at the softmax layer, we employ BlackOut [Ji et al.2016], a sampling-based approximation method. BlackOut has been shown to be effective in RNN Language Models (RNNLMs) and allows a model to run reasonably fast even with a million word vocabulary with CPUs.
At each word prediction step in the training, BlackOut estimates the conditional probability in Equation (2) for the target word and negative samples using a weighted softmax function. The negative samples are drawn from the unigram distribution raised to the power [Mikolov et al.2013]. The unigram distribution is estimated using the training data andGutmann and Hyvärinen2012] and achieves better perplexity than the original softmax and NCE in RNNLMs. The advantages of Blackout over the other methods are discussed in DBLP:journals/corr/JiVSAD15. Note that BlackOut can be used as the original softmax once the training is finished.
We applied the proposed model to the English-to-Japanese translation dataset of the ASPEC corpus given in WAT’15.111http://orchid.kuee.kyoto-u.ac.jp/WAT/WAT2015/index.html Following zhu:2015:WAT, we extracted the first 1.5 million translation pairs from the training data. To obtain the phrase structures of the source sentences, i.e., English, we used the probabilistic HPSG parser Enju [Miyao and Tsujii2008]. We used Enju only to obtain a binary phrase structure for each sentence and did not use any HPSG specific information. For the target language, i.e., Japanese, we used KyTea [Neubig et al.2011], a Japanese segmentation tool, and performed the pre-processing steps recommended in WAT’15.222http://orchid.kuee.kyoto-u.ac.jp/WAT/WAT2015/baseline/dataPreparationJE.html We then filtered out the translation pairs whose sentence lengths are longer than 50 and whose source sentences are not parsed successfully. Table 1 shows the details of the datasets used in our experiments.
We carried out two experiments on a small training dataset to investigate the effectiveness of our proposed model and on a large training dataset to compare our proposed methods with the other systems.
The vocabulary consists of words observed in the training data more than or equal to times. We set for the small training dataset and for the large training dataset. The out-of-vocabulary words are mapped to the special token “unk”. We added another special symbol “eos” for both languages and inserted it at the end of all the sentences. Table 2 shows the details of each training dataset and its corresponding vocabulary size.
|Train (small)||Train (large)|
The biases, softmax weights, and BlackOut weights are initialized with zeros. The hyperparameter of BlackOut is set to 0.4 as recommended by DBLP:journals/corr/JiVSAD15. Following conf/icml/JozefowiczZS15, we initialize the forget gate biases of LSTM and Tree-LSTM with 1.0. The remaining model parameters in the NMT models in our experiments are uniformly initialized in
. The model parameters are optimized by plain SGD with the mini-batch size of 128. The initial learning rate of SGD is 1.0. We halve the learning rate when the development loss becomes worse. Gradient norms are clipped to 3.0 to avoid exploding gradient problems[Pascanu et al.2012].
We conduct experiments with our proposed model and the sequential attentional NMT model with the input-feeding approach. Each model has 256-dimensional hidden units and word embeddings. The number of negative samples of BlackOut is set to 500 or 2000.
Our proposed model has 512-dimensional word embeddings and -dimensional hidden units (). is set to 2500.
Our code333https://github.com/tempra28/tree2seq is implemented in C++ using the Eigen library,444http://eigen.tuxfamily.org/index.php a template library for linear algebra, and we run all of the experiments on multi-core CPUs.55516 threads on Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz It takes about a week to train a model on the large training dataset with .
We use beam search to decode a target sentence for an input sentence and calculate the sum of the log-likelihoods of the target sentence as the beam score:
Decoding in the NMT models is a generative process and depends on the target language model given a source sentence. The score becomes smaller as the target sentence becomes longer, and thus the simple beam search does not work well when decoding a long sentence [Cho et al.2014a, Pouget-Abadie et al.2014]. In our preliminary experiments, the beam search with the length normalization in DBLP:journals/corr/ChoMBB14 was not effective in English-to-Japanese translation. The method in pougetabadie-EtAl:2014:SSST-8 needs to estimate the conditional probability using another NMT model and thus is not suitable for our work.
In this paper, we use statistics on sentence lengths in beam search. Assuming that the length of a target sentence correlates with the length of a source sentence, we redefine the score of each candidate as follows:
where is the penalty for the conditional probability of the target sentence length given the source sentence length . It allows the model to decode a sentence by considering the length of the target sentence. In our experiments, we computed the conditional probability in advance following the statistics collected in the first one million pairs of the training dataset. We allow the decoder to generate up to 100 words.
We evaluated the models by two automatic evaluation metrics, RIBES[Isozaki et al.2010] and BLEU [Papineni et al.2002] following WAT’15. We used the KyTea-based evaluation script for the translation results.666http://lotus.kuee.kyoto-u.ac.jp/WAT/evaluation/automatic_evaluation_systems/automaticEvaluationJA.html The RIBES score is a metric based on rank correlation coefficients with word precision, and the BLEU score is based on -gram word precision and a Brevity Penalty (BP) for outputs shorter than the references. RIBES is known to have stronger correlation with human judgements than BLEU in translation between English and Japanese as discussed in Isozaki:2010:AET:1870658.1870750.
|Proposed model (Softmax)||—||17.9||73.2||21.8||180|
|ANMT [Luong et al.2015a]||500||21.6||70.7||18.5||45|
|+ reverse input||500||22.6||69.8||17.7||45|
|ANMT [Luong et al.2015a]||2000||23.1||71.5||19.4||60|
|+ reverse input||2000||26.1||69.5||17.5||60|
Table 3 shows the perplexity, BLEU, RIBES, and the training time on the development data with the Attentional NMT (ANMT) models trained on the small dataset. We conducted the experiments with our proposed method using BlackOut and softmax. We decoded a translation by our proposed beam search with a beam size of 20.
As shown in Table 3, the results of our proposed model with BlackOut improve as the number of negative samples increases. Although the result of softmax is better than those of BlackOut (), the training time of softmax per epoch is about three times longer than that of BlackOut even with the small dataset.
As to the results of the ANMT model, reversing the word order in the input sentence decreases the scores in English-to-Japanese translation, which contrasts with the results of other language pairs reported in previous work [Sutskever et al.2014, Luong et al.2015a]. By taking syntactic information into consideration, our proposed model improves the scores, compared to the sequential attention-based approach.
We found that better perplexity does not always lead to better translation scores with BlackOut as shown in Table 3. One of the possible reasons is that BlackOut distorts the target word distribution by the modified unigram-based negative sampling where frequent words can be treated as the negative samples multiple times at each training step.
Table 4 shows the results on the development data of proposed method with BlackOut () by the simple beam search and our proposed beam search. The beam size is set to 6 or 20 in the simple beam search, and to 20 in our proposed search. We can see that our proposed search outperforms the simple beam search in both scores. Unlike RIBES, the BLEU score is sensitive to the beam size and becomes lower as the beam size increases. We found that the BP had a relatively large impact on the BLEU score in the simple beam search as the beam size increased. Our search method works better than the simple beam search by keeping long sentences in the candidates with a large beam size.
|Beam size||RIBES||BLEU (BP)|
|Simple BS||6||72.3||20.0 (90.1)|
|Proposed BS||20||72.6||20.5 (91.7)|
We also investigated the effects of the sequential LSTMs at the leaf nodes in our proposed tree-based encoder. Table 5 shows the result on the development data of our proposed encoder and that of an attentional tree-based encoder without sequential LSTMs with BlackOut ().777For this evaluation, we used the 1,789 sentences that were successfully parsed by Enju because the encoder without sequential LSTMs always requires a parse tree. The results show that our proposed encoder considerably outperforms the encoder without sequential LSTMs, suggesting that the sequential LSTMs at the leaf nodes contribute to the context-aware construction of the phrase representations in the tree.
|Without sequential LSTMs||69.4||19.5|
|With sequential LSTMs||72.3||20.0|
Table 6 shows the experimental results of RIBES and BLEU scores achieved by the trained models on the large dataset. We decoded the target sentences by our proposed beam search with the beam size of 20.888We found two sentences which ends without eos with , and then we decoded it again with the beam size of 1000 following zhu:2015:WAT. The results of the other systems are the ones reported in nakazawa-EtAl:2015:WAT.
All of our proposed models show similar performance regardless of the value of . Our ensemble model is composed of the three models with , and , and it shows the best RIBES score among all systems.999Our ensemble model yields a METEOR [Denkowski and Lavie2014] score of 53.6 with language option “-l other”.
As for the time required for training, our implementation needs about one day to perform one epoch on the large training dataset with . It would take about 11 days without using the BlackOut sampling.
The model of zhu:2015:WAT is an ANMT model [Bahdanau et al.2015]
with a bi-directional LSTM encoder, and uses 1024-dimensional hidden units and 1000-dimensional word embeddings. The model of lee-EtAl:2015:WAT is also an ANMT model with a bi-directional Gated Recurrent Unit (GRU) encoder, and uses 1000-dimensional hidden units and 200-dimensional word embeddings. Both models are sequential ANMT models. Our single proposed model withoutperforms the best result of zhu:2015:WAT’s end-to-end NMT model with ensemble and unknown replacement by RIBES and by BLEU scores. Our ensemble model shows better performance, in both RIBES and BLEU scores, than that of zhu:2015:WAT’s best system which is a hybrid of the ANMT and SMT models by RIBES and by BLEU scores and lee-EtAl:2015:WAT’s ANMT system with special character-based decoding by RIBES and BLEU scores.
PB, HPB and T2S are the baseline SMT systems in WAT’15: a phrase-based model, a hierarchical phrase-based model, and a tree-to-string model, respectively [Nakazawa et al.2015]. The best model in WAT’15 is neubig-morishita-nakamura:2015:WAT’s tree-to-string SMT model enhanced with reranking by ANMT using a bi-directional LSTM encoder. Our proposed end-to-end NMT model compares favorably with neubig-morishita-nakamura:2015:WAT.
|Proposed model ()||81.46||34.36|
|Proposed model ()||81.89||34.78|
|Proposed model ()||81.58||34.87|
|Ensemble of the above three models||82.45||36.95|
|ANMT with LSTMs [Zhu2015]||79.70||32.19|
|+ Ensemble, unk replacement||80.27||34.19|
|+ System combination,||80.91||36.21|
|3 pre-reordered ensembles|
|ANMT with GRUs [Lee et al.2015]||81.15||35.75|
|+ character-based decoding,|
|T2S model [Neubig and Duh2014]||79.65||36.58|
|+ ANMT Rerank [Neubig et al.2015]||81.38||38.17|
We illustrate the translations of test data by our model with and several attentional relations when decoding a sentence. In Figures 4 and 5, an English sentence represented as a binary tree is translated into Japanese, and several attentional relations between English words or phrases and Japanese word are shown with the highest attention score . The additional attentional relations are also illustrated for comparison. We can see the target words softly aligned with source words and phrases.
In Figure 4, the Japanese word “液晶” means “liquid crystal”, and it has a high attention score () with the English phrase “liquid crystal for active matrix”. This is because the -th target hidden unit has the contextual information about the previous words including “活性 マトリックス の” (“for active matrix” in English). The Japanese word “セル” is softly aligned with the phrase “the cells” with the highest attention score (). In Japanese, there is no definite article like “the” in English, and it is usually aligned with null described as Section 1.
In Figure 5, in the case of the Japanese word “示” (“showed” in English), the attention score with the English phrase “showed excellent performance” () is higher than that with the English word “showed” (). The Japanese word “の” (“of” in English) is softly aligned with the phrase “of Si dot MOS capacitor” with the highest attention score (). It is because our attention mechanism takes each previous context of the Japanese phrases “優れ た 性能” (“excellent performance” in English) and “Ｓｉ ドット ＭＯＳ コンデンサ” (“Si dot MOS capacitor” in English) into account and softly aligned the target words with the whole phrase when translating the English verb “showed” and the preposition “of”. Our proposed model can thus flexibly learn the attentional relations between English and Japanese.
We observed that our model translated the word “active” into “活性”, a synonym of the reference word “アクティブ”. We also found similar examples in other sentences, where our model outputs synonyms of the reference words, e.g. “女” and “女性” (“female” in English) and “NASA” and “航空宇宙局” (“National Aeronautics and Space Administration” in English). These translations are penalized in terms of BLEU scores, but they do not necessarily mean that the translations were wrong. This point may be supported by the fact that the NMT models were highly evaluated in WAT’15 by crowd sourcing [Nakazawa et al.2015].
kalchbrenner-blunsom:2013:EMNLP were the first to propose an end-to-end NMT model using Convolutional Neural Networks (CNNs) as the source encoder and using RNNs as the target decoder. The Encoder-Decoder model can be seen as an extension of their model, and it replaces the CNNs with RNNs using GRUs[Cho et al.2014b] or LSTMs [Sutskever et al.2014].
NIPS2014_5346 have shown that making the input sequences reversed is effective in a French-to-English translation task, and the technique has also proven effective in translation tasks between other European language pairs [Luong et al.2015a]. All of the NMT models mentioned above are based on sequential encoders. To incorporate structural information into the NMT models, DBLP:journals/corr/ChoMBB14 proposed to jointly learn structures inherent in source-side languages but did not report improvement of translation performance. These studies motivated us to investigate the role of syntactic structures explicitly given by existing syntactic parsers in the NMT models.
The attention mechanism [Bahdanau et al.2015]
has promoted NMT onto the next stage. It enables the NMT models to translate while aligning the target with the source. luong-pham-manning:2015:EMNLP refined the attention model so that it can dynamically focus on local windows rather than the entire sentence. They also proposed a more effective attentional path in the calculation of ANMT models. Subsequently, several ANMT models have been proposed[Cheng et al.2016, Cohn et al.2016]; however, each model is based on the existing sequential attentional models and does not focus on a syntactic structure of languages.
In this paper, we propose a novel syntactic approach that extends attentional NMT models. We focus on the phrase structure of the input sentence and build a tree-based encoder following the parsed tree. Our proposed tree-based encoder is a natural extension of the sequential encoder model, where the leaf units of the tree-LSTM in the encoder can work together with the original sequential LSTM encoder. Moreover, the attention mechanism allows the tree-based encoder to align not only the input words but also input phrases with the output words. Experimental results on the WAT’15 English-to-Japanese translation dataset demonstrate that our proposed model achieves the best RIBES score and outperforms the sequential attentional NMT model.
We thank the anonymous reviewers for their constructive comments and suggestions. This work was supported by CREST, JST, and JSPS KAKENHI Grant Number 15J12597.
Proceedings of the 25th International Joint Conference on Artificial Intelligence. to appear.
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 1724–1734.
Journal of Machine Learning Research, 13(1):307–361.