Zoph_RNN
GPU based language modeling and machine translation toolkit
view repo
We build a multi-source machine translation model and train it to maximize the probability of a target English string given French and German sources. Using the neural encoder-decoder framework, we explore several combination methods and report up to +4.8 Bleu increases on top of a very strong attention-based neural translation model.
READ FULL TEXT VIEW PDFGPU based language modeling and machine translation toolkit
kay2000 points out that if a document is translated once, it is likely to be translated again and again into other languages. This gives rise to an interesting idea: a human does the first translation by hand, then turns the rest over to machine translation (MT). The translation system now has two strings as input, which can reduce ambiguity via “triangulation” (Kay’s term). For example, the normally ambiguous English word “bank” may be more easily translated into French in the presence of a second, German input string containing the word “Flussufer” (river bank).
och01 describe such a multi-source MT system. They first train separate bilingual MT systems , , etc. At runtime, they separately translate input strings and into candidate target strings and , then select the best one of the two. A typical selection factor is the product of the system scores. lane08 revisits such factors in the context of log-linear models and Bleu score, while MAX10.823 re-rank
n-best lists using n-gram precision with respect to
translations. ccb02 exploits hypothesis selection in multi-source MT to expand available corpora, via co-training.Others use system combination techniques to merge hypotheses at the word level, creating the ability to synthesize new translations outside those proposed by the single-source translators. These methods include confusion networks [Matusov et al.2006, Schroeder et al.2009], source-side string combination [Schroeder et al.2009], and median strings [González-Rubio and Casacuberta2010].
The above work all relies on base MT systems trained on bilingual data, using traditional methods. This follows early work in sentence alignment [Gale and Church1993] and word alignment [Simard1999], which exploited trilingual text, but did not build trilingual models. Previous authors possibly considered a three-dimensional translation table t() to be prohibitive.
In this paper, by contrast, we train a P() model directly on trilingual data, and we use that model to decode an () pair simultaneously. We view this as a kind of multi-tape transduction [Elgot and Mezei1965, Kaplan and Kay1994, Deri and Knight2015] with two input tapes and one output tape. Our contributions are as follows:
We train a P() model directly on trilingual data, and we use it to decode a new source string pair () into target string .
We show positive Bleu improvements over strong single-source baselines.
We show that improvements are best when the two source languages are more distant from each other.
We are able to achieve these results using the framework of neural encoder-decoder models, where multi-target MT [Dong et al.2015] and multi-source, cross-modal mappings have been explored [Luong et al.2015a].
In the neural encoder-decoder framework for MT [Neco and Forcada1997, Castaño and Casacuberta1997, Sutskever et al.2014, Bahdanau et al.2014, Luong et al.2015b]
, we use a recurrent neural network (
encoder) to convert a source sentence into a dense, fixed-length vector. We then use another recurrent network (decoder) to convert that vector in a target sentence.^{1}^{1}1We follow previous authors in presenting the source sentence to the encoder in reverse order.In this paper, we use a four-layer encoder-decoder system (Figure 1
) with long short-term memory (LSTM) units
[Hochreiter and Schmidhuber1997]trained for maximum likelihood (via a softmax layer) with back-propagation through time
[Werbos1990]. For our baseline single-source MT system we use two different models, one of which implements the local attention plus feed-input model from luong2015effective.Figure 2 shows our approach to multi-source MT. Each source language has its own encoder. The question is how to combine the hidden states and cell states from each encoder, to pass on to the decoder. Black combiner blocks implement a function whose input is two hidden states ( and ) and two cell states ( and ), and whose output is a single hidden state and cell state . We propose two combination methods.
The Basic method works by concatenating the two hidden states from the source encoders, applying a linear transformation
(size 2000 x 1000), then sending its output through a tanh non-linearity. This operation is represented by the equation:(1) |
and all other weights in the network are learned from example string triples drawn from a trilingual training corpus.
The new cell state is simply the sum of the two cell states from the encoders.
(2) |
We also attempted to concatenate cell states and apply a linear transformation, but training diverges due to large cell values.
Our second combination method is inspired by the Child-Sum Tree-LSTMs of treeLSTM. Here, we use an LSTM variant to combine the two hidden states and cells. The standard LSTM input, output, and new cell value are all calculated. Then cell states from each encoder get their own forget gates. The final cell state and hidden state are calculated as in a normal LSTM. More precisely:
(3) |
(4) |
(5) |
(6) |
(7) |
(8) |
This method employs eight new matrices (the ’s in the above equations), each of size 1000 x 1000. The symbol represents an elementwise multiplication. In equation 3, represents the input gate of a typical LSTM cell. In equation 4, there are two forget gates indexed by the subscript that serve as the forget gates for each of the incoming cells for each of the encoders. In equation 5, represents the output gate of a normal LSTM. , , , and are all size-1000 vectors.
Our single-source attention model is modeled off the local-p attention model with feed input from luong2015effective, where hidden states from the top decoder layer can look back at the top hidden states from the encoder. The top decoder hidden state is combined with a weighted sum of the encoder hidden states, to make a better hidden state vector (
), which is passed to the softmax output layer. With input-feeding, the hidden state from the attention model is sent down to the bottom decoder layer at the next time step.The local-p attention model from luong2015effective works as follows. First, a position to look at in the source encoder is predicted by equation 9:
(9) |
is the source sentence length, and and are learned parameters, with being a vector of dimension 1000, and being a matrix of dimension 1000 x 1000.
After is computed, a window of size is looked at in the top layer of the source encoder centered around (). For each hidden state in this window, we compute an alignment score , between 0 and 1. This alignment score is computed by equations 10, 11 and 12:
(10) |
(11) |
(12) |
In equation 10, is set to be and is the source index for that hidden state. is a learnable parameter of dimension 1000 x 1000.
Once all of the alignments are calculated, is created by taking a weighted sum of all source hidden states multiplied by their alignment weight.
The final hidden state sent to the softmax layer is given by:
(13) |
We modify this attention model to look at both source encoders simultaneously. We create a context vector from each source encoder named and instead of the just in the single-source attention model:
(14) |
French | English | German | |
Word tokens | 66.2m | 59.4m | 57.0m |
Word types | 424,832 | 381,062 | 865,806 |
Segment pairs | 2,378,112 | ||
Ave. segment | 27.8 | 25.0 | 24.0 |
length (tokens) |
In our multi-source attention model we now have two variables, one for each source encoder. We also have two separate sets of alignments and therefore now have two values denoted by and as mentioned above. We also have distinct , , and parameters for each encoder.
We use English, French, and German data from a subset of the WMT 2014 dataset [Bojar et al.2014]. Figure 3 shows statistics for our training set. For development, we use the 3000 sentences supplied by WMT. For testing, we use a 1503-line trilingual subset of the WMT test set.
For the single-source models, we follow the training procedure used in luong2015effective, but with 15 epochs and halving the learning rate every full epoch after the 10th epoch. We also re-scale the normalized gradient when norm
5. For training, we use a minibatch size of 128, a hidden state size of 1000, and dropout as in dropout. The dropout rate is 0.2, the initial parameter range is -0.1, +0.1, and the learning rate is 1.0. For the normal and multi-source attention models, we adjust these parameters to 0.3, -0.08, +0.08, and 0.7, respectively, to adjust for overfitting.Target = English | |||
---|---|---|---|
Source | Method | Ppl | BLEU |
French | — | 10.3 | 21.0 |
German | — | 15.9 | 17.3 |
French+German | Basic | 8.7 | 23.2 |
French+German | Child-Sum | 9.0 | 22.5 |
French+French | Child-Sum | 10.9 | 20.7 |
French | Attention | 8.1 | 25.2 |
French+German | B-Attent. | 5.7 | 30.0 |
French+German | CS-Attent. | 6.0 | 29.6 |
Figure 4 show our results for target English, with source languages being French and German). We see that the Basic combination method yields a +4.8 Bleu improvement over the strongest single-source, attention-based system. It also improves Bleu by +2.2 over the non-attention baseline. The Child-Sum method gives improvements of +4.4 and +1.4. We also confirm that two copies of the same French input yields no BLEU improvement.
Figure 5 shows the action of the multi-attention model during decoding.
When our source languages are English and French (Figure 6), we observe smaller BLEU gains (up to +1.1). This is evidence that the more distinct the source languages, the better they disambiguate each other.
Target = German | |||
---|---|---|---|
Source | Method | Ppl | BLEU |
French | — | 12.3 | 10.6 |
English | — | 9.6 | 13.4 |
French+English | Basic | 9.1 | 14.5 |
French+English | Child-Sum | 9.5 | 14.4 |
English | Attention | 7.3 | 17.6 |
French+English | B-Attent. | 6.9 | 18.6 |
French+English | CS-Attent. | 7.1 | 18.2 |
We describe a multi-source neural MT system that gets up to +4.8 Bleu gains over a very strong attention-based, single-source baseline. We obtain this result through a novel encoder-vector combination method and a novel multi-attention system. We release the code for these experiments at https://github.com/isi-nlp/Zoph_RNN.