Neural Machine Translation (NMT) performance degrades sharply when parallel training data is limited (Koehn and Knowles, 2017). Past work has addressed this problem by leveraging monolingual data (Sennrich et al., 2016a; Ramachandran et al., 2017) or multilingual parallel data Zoph et al. (2016); Johnson et al. (2017); Gu et al. (2018a). We explore a complementary direction: how can we better exploit the potential of the limited bilingual training data available?
Our approach builds on the bi-directional NMT model (Niu et al., 2018), which improves low-resource translation by jointly modeling translation in both directions (e.g., Swahili to English and English to Swahili). We propose a new training objective for this model by augmenting the standard translation cross-entropy loss with a differentiable input reconstruction loss to further exploit the source side of parallel samples.
Input reconstruction is motivated by the idea of round-trip translation. Suppose sentence is translated forward to using model and then translated back to using model , then is more likely to be a good translation if the distance between and is small (Brislin, 1970). Prior work applied round-trip translation to monolingual examples and sampled the intermediate translation from a -best list generated by model using beam search (Cheng et al., 2016; He et al., 2016). However, beam search is not differentiable which prevents back-propagating reconstruction errors to
. As a result, reinforcement learning algorithms, or independent updates toand were required.
In this paper, we focus on the problem of making input reconstruction differentiable to simplify training. In past work, Tu et al. (2017) addressed this issue by reconstructing source sentences from the decoder’s hidden states. However, this reconstruction task can be artificially easy if hidden states over-memorize the input.
We propose instead to combine benefits from differentiable sampling and bi-directional NMT to obtain a compact model that can be trained end-to-end with back-propagation. Specifically,
A single bi-directional model is used as a translator and a reconstructor (i.e. ). By contrast, uni-directional models would require a distinct reconstructor which would introduce additional parameters.
Experiments show that our approach yields more desirable performance than reconstructing from hidden states. It achieves consistent improvements across various low-resource language pairs and directions, showing its effectiveness in making better use of limited parallel data.
2 Related Work
Using round-trip translations () as a training signal for NMT usually requires auxiliary models to perform back-translation and cannot be trained end-to-end without reinforcement learning. For instance, Cheng et al. (2016) added a reconstruction loss for monolingual examples to the training objective, but did not back-propagate errors to the forward translator. He et al. (2016) evaluated the quality of by a language model and by a reconstruction likelihood. Both approaches have symmetric forward and backward translation models which are updated alternatively. This require policy gradient algorithms for training, which are not always stable.
Back-translation (Sennrich et al., 2016a) performs half of the reconstruction process, by generating a synthetic source side for monolingual target language examples: . It uses an auxiliary backward model to generate the synthetic data but only the primary forward model is updated by training on it (i.e. reconstructing ). Forward and backward models can be updated iteratively (Zhang et al., 2018; Niu et al., 2018), however this is an expensive process as back-translations are regenerated at each iteration.
Prior work has seeked to simplify the optimization of reconstruction losses by side-stepping beam search. Tu et al. (2017) first proposed to reconstruct NMT input from the decoder’s hidden states while Wang et al. (2018a, b) suggested to use both encoder and decoder hidden states to improve translation of dropped pronouns. However, these models might achieve low reconstruction errors by learning to copy the input to hidden states. To avoid copying the input, Artetxe et al. (2018) and Lample et al. (2018)2008) in unsupervised NMT.
Our approach is based instead on the Gumbel Softmax (Jang et al., 2017; Maddison et al., 2017), which facilitates differentiable sampling of sequences of discrete tokens. It has been successfully applied in many sequence generation tasks, including learning to communicate (Havrylov and Titov, 2017), composing tree structures from text (Choi et al., 2018), and tasks under the umbrella of generative adversarial networks Goodfellow et al. (2014) such as generating the context-free grammar (Kusner and Hernández-Lobato, 2016), machine comprehension (Wang et al., 2017) and machine translation (Gu et al., 2018b).
NMT is framed as a conditional language model, where the probability of predicting target tokenat step is conditioned on the previously generated sequence of tokens and the source sequence given the model parameter where is the decoder’s hidden state at step :
The hidden state is calculated by neural network layers given the embeddings of the previous target tokensin the embedding matrix and the context coming from the source:
In our bi-directional model, the source sentence can be either or and it is translated to or . The language is marked by a tag (e.g., <en>) at the beginning of each source sentence (Niu et al., 2018; Johnson et al., 2017). To facilitate symmetric reconstruction, we also add language tags to target sentences. The training data corpus is then built by swapping the source and target sentences of a parallel corpus and appending the swapped version to the original.
3.1 Bi-Directional Reconstruction
Our bi-directional model performs both forward translation and backward reconstruction. By contrast, uni-directional models require an auxiliary reconstruction module, which introduces additional parameters. This module can be either a decoder-based reconstructor (Tu et al., 2017; Wang et al., 2018a, b) or a reversed dual NMT model (Cheng et al., 2016; He et al., 2016; Wang et al., 2018c; Zhang et al., 2018).
Here the reconstructor, which shares the same parameter with the translator, can also be trained end-to-end by maximizing the log-likelihood of reconstructing :
Combining with the forward translation likelihood
we use as the final training objective for . The dual model is trained simultaneously by swapping the language direction in bi-directional NMT.
3.2 Differentiable Sampling
We use differentiable sampling to side-step beam search and back-propagate error signals. We sample a translation token at each time step with the probability reparamterized using the Gumbel-Max trick (Maddison et al., 2014):
where is i.i.d. and drawn from 111i.e. and .. We use scaled with parameter , i.e. , to control the randomness. The sampling becomes deterministic (which is equivalent to greedy search) as approaches 0.
As approaches 0, is closer to but training might be more unstable. While the STGS estimator is biased when is large, it performs well in practice (Gu et al., 2018b; Choi et al., 2018) and is sometimes faster and more effective than reinforcement learning (Havrylov and Titov, 2017).
To rectify errors in the model’s own predictions, the decoder used for sampling translations only consumes its previously predicted . This contrasts with the usual teacher forcing strategy (Williams and Zipser, 1989), which always feeds in the ground-truth previous tokens when predicting the current token , and would force the decoder to replicate the ground-truth sequence during training.
|Baseline||33.60 0.14||30.70 0.19||27.23 0.11||32.15 0.21||12.25 0.08||20.80 0.12||12.90 0.04||15.32 0.11|
|Hidden||33.41 0.15||30.91 0.19||27.43 0.14||32.20 0.35||12.30 0.11||20.72 0.16||12.77 0.11||15.34 0.10|
|-0.19 0.24||0.21 0.14||0.19 0.13||0.04 0.17||0.05 0.11||-0.08 0.12||-0.13 0.13||0.01 0.07|
|33.92 0.10||31.37 0.18||27.65 0.09||32.75 0.32||12.47 0.08||21.14 0.19||13.26 0.07||15.60 0.19|
|0.32 0.12||0.66 0.11||0.42 0.16||0.59 0.13||0.22 0.04||0.35 0.15||0.36 0.09||0.28 0.11|
|33.97 0.08||31.39 0.09||27.65 0.10||32.65 0.24||12.48 0.09||21.20 0.14||13.16 0.08||15.52 0.07|
|0.37 0.09||0.69 0.11||0.42 0.11||0.50 0.08||0.23 0.03||0.41 0.13||0.25 0.09||0.19 0.05|
’ are the mean and standard deviation over five randomly seeded models. Our proposed methods () achieve small but consistent improvements. BLEU scores are in bold if meanstd is above zero while in red if the mean is below zero.
4.1 Tasks and Data
We evaluate our approach on four low-resource language pairs. Parallel data for SwahiliEnglish (SWEN), TagalogEnglish (TLEN) and SomaliEnglish (SOEN) contains a mixture of domains such as news and weblogs and is collected from the IARPA MATERIAL program222https://www.iarpa.gov/index.php/research-programs/material, the Global Voices parallel corpus333http://casmacat.eu/corpus/global-voices.html, Common Crawl (Smith et al., 2013), and the LORELEI Somali representative language pack (LDC2018T11). The test samples are extracted from the heldout ANALYSIS set of MATERIAL. Parallel TurkishEnglish (TREN) data is provided by the WMT news translation task (Bojar et al., 2018). We use pre-processed corpus/newsdev2016/newstest2017 as training/development/test sets.444http://data.statmt.org/wmt18/translation-task/preprocessed/
We apply normalization, tokenization, true-casing, joint source-target BPE with 32,000 operations (Sennrich et al., 2016b) and sentence-filtering (length 80 cutoff) to parallel data. Itemized data statistics after preprocessing can be found in Table 1. We report case-insensitive BLEU with the WMT standard ‘13a’ tokenization using SacreBLEU (Post, 2018).
4.2 Model Configuration and Baseline
. Our translation model uses a bi-directional encoder with a single LSTM layer of size 512, multilayer perceptron attention with a layer size of 512, and word representations of size 512. We apply layer normalization(Ba et al., 2016) and add dropout to embeddings and RNNs of the encoder and decoder with probability 0.2. We train using the Adam optimizer (Kingma and Ba, 2015) with a batch size of 48 sentences and checkpoint the model every 1000 updates. The learning rate for baseline models is initialized to 0.001 and reduced by 30% after 4 checkpoints without improvement of perplexity on the development set. Training stops after 10 checkpoints without improvement.
The bi-directional NMT model ties source and target embeddings to yield a bilingual vector space. It also ties the output layer’s weight matrix and embeddings to achieve better performance in low-resource scenarios Press and Wolf (2017); Nguyen and Chiang (2018).
We train five randomly seeded bi-directional baseline models by optimizing the forward translation objective and report the mean and standard deviation of test BLEU. We fine-tune baseline models with objective , inheriting all settings except the learning rate which is re-initialized to 0.0001. Each randomly seeded model is fine-tuned independently, so we are able to report the standard deviation of BLEU.
4.3 Contrastive Reconstruction Model
We compare our approach with reconstruction from hidden states. Following the best practice of Wang et al. (2018a), two reconstructors are used to take hidden states from both the encoder and the decoder. The corresponding two reconstruction losses and the canonical translation loss were originally uniformly weighted (i.e. ), but we found that balancing the reconstruction and translation losses yields better results (i.e. ) in preliminary experiments.
We use the reconstructor exclusively to compute the reconstruction training loss. It has also been used to re-rank translation hypotheses in prior work, but Tu et al. (2017) showed in ablation studies that the gains from re-ranking are small compared to those from training.
Table 2 shows that our reconstruction approach achieves small but consistent BLEU improvements over the baseline on all eight tasks.
We evaluate the impact of the Gumbel Softmax hyperparameters on the development set. We selectand based on training stability and BLEU. Interestingly, increased randomness in sampling does not have a strong impact on BLEU. Sampling by greedy search (i.e. ) performs similarly as sampling with increased Gumbel noise (i.e. more random translation selection when , possibly including lower quality samples).
The scale of improvement is influenced by the distance between training and test data. For example, the out-of-vocabulary (OOV) rate of TREN is more than twice the OOV rate of SWEN and the later obtains higher BLEU. This suggests that reconstructing the input helps to better fit the limited parallel training data.
Reconstructing from hidden states yields more mixed results. It fails to improve BLEU in more difficult cases, such as TREN with high OOV rates. We observe extremely low training perplexity for reconstructing from hidden states compared with our proposed approach (Figure (a)a). This suggests that reconstructing from hidden states yields representations that memorize the input rather than improve output representations.
Another advantage of our approach is that all parameters were jointly pre-trained, which result in more stable training behavior. By contrast, reconstructing from hidden states requires to initialize the reconstructors independently and suffers from unstable early training behavior (Figure 1).
We studied reconstructing the input of NMT from its intermediate translations to better exploit training samples in low-resource settings. We used a bi-directional NMT model and the Straight-Through Gumbel Softmax to build a fully differentiable reconstruction model that does not require any additional parameters. We empirically demonstrated that our approach is effective in low-resource scenarios. In future work, we will investigate the use of differentiable reconstruction from sampled sequences in unsupervised and semi-supervised sequence generation tasks.
This research is based upon work supported in part by an Amazon Web Services Machine Learning Research Award, and by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via contract #FA8650-17-C-9117. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.
- Artetxe et al. (2018) Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. 2018. Unsupervised neural machine translation. In Proceedings of the 6th International Conference on Learning Representations.
- Ba et al. (2016) Lei Jimmy Ba, Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer normalization. CoRR, abs/1607.06450.
- Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3th International Conference on Learning Representations.
- Bengio et al. (2013) Yoshua Bengio, Nicholas Léonard, and Aaron C. Courville. 2013. Estimating or propagating gradients through stochastic neurons for conditional computation. CoRR, abs/1308.3432.
- Bojar et al. (2018) Ondrej Bojar, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Philipp Koehn, and Christof Monz. 2018. Findings of the 2018 conference on machine translation (wmt18). In Proceedings of the Third Conference on Machine Translation, pages 272–307. Association for Computational Linguistics.
- Brislin (1970) Richard W. Brislin. 1970. Back-translation for cross-cultural research. Journal of Cross-Cultural Psychology, 1(3):185–216.
- Cheng et al. (2016) Yong Cheng, Wei Xu, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu. 2016. Semi-supervised learning for neural machine translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 1965–1974. Association for Computational Linguistics.
Choi et al. (2018)
Jihun Choi, Kang Min Yoo, and Sang-goo Lee. 2018.
to compose task-specific tree structures.
Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, pages 5094–5101. AAAI Press.
- Goodfellow et al. (2014) Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems 27, pages 2672–2680.
- Gu et al. (2018a) Jiatao Gu, Hany Hassan, Jacob Devlin, and Victor O. K. Li. 2018a. Universal neural machine translation for extremely low resource languages. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 344–354. Association for Computational Linguistics.
- Gu et al. (2018b) Jiatao Gu, Daniel Jiwoong Im, and Victor O. K. Li. 2018b. Neural machine translation with gumbel-greedy decoding. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, pages 5125–5132. AAAI Press.
- Havrylov and Titov (2017) Serhii Havrylov and Ivan Titov. 2017. Emergence of language with multi-agent games: Learning to communicate with sequences of symbols. In Advances in Neural Information Processing Systems 30, pages 2146–2156.
- He et al. (2016) Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tie-Yan Liu, and Wei-Ying Ma. 2016. Dual learning for machine translation. In Advances in Neural Information Processing Systems 29, pages 820–828. Curran Associates, Inc.
- Hieber et al. (2017) Felix Hieber, Tobias Domhan, Michael Denkowski, David Vilar, Artem Sokolov, Ann Clifton, and Matt Post. 2017. Sockeye: A toolkit for neural machine translation. CoRR, abs/1712.05690.
- Jang et al. (2017) Eric Jang, Shixiang Gu, and Ben Poole. 2017. Categorical reparameterization with gumbel-softmax. In Proceedings of the 5th International Conference on Learning Representations.
- Johnson et al. (2017) Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda B. Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2017. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics, 5:339–351.
- Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the 3th International Conference on Learning Representations.
- Koehn and Knowles (2017) Philipp Koehn and Rebecca Knowles. 2017. Six challenges for neural machine translation. In Proceedings of the First Workshop on Neural Machine Translation, pages 28–39. Association for Computational Linguistics.
- Kusner and Hernández-Lobato (2016) Matt J. Kusner and José Miguel Hernández-Lobato. 2016. GANS for sequences of discrete elements with the gumbel-softmax distribution. In Proceedings of the NIPS 2016 Workshop on Adversarial Training.
- Lample et al. (2018) Guillaume Lample, Ludovic Denoyer, and Marc’Aurelio Ranzato. 2018. Unsupervised machine translation using monolingual corpora only. In Proceedings of the 6th International Conference on Learning Representations.
- Maddison et al. (2017) Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. 2017. The concrete distribution: A continuous relaxation of discrete random variables. In Proceedings of the 5th International Conference on Learning Representations.
- Maddison et al. (2014) Chris J. Maddison, Daniel Tarlow, and Tom Minka. 2014. A* sampling. In Advances in Neural Information Processing Systems 27, pages 3086–3094. Curran Associates, Inc.
- Nguyen and Chiang (2018) Toan Q. Nguyen and David Chiang. 2018. Improving lexical choice in neural machine translation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 334–343. Association for Computational Linguistics.
- Niu et al. (2018) Xing Niu, Michael Denkowski, and Marine Carpuat. 2018. Bi-directional neural machine translation with synthetic parallel data. In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, pages 84–91. Association for Computational Linguistics.
- Post (2018) Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation, pages 186–191. Association for Computational Linguistics.
- Press and Wolf (2017) Ofir Press and Lior Wolf. 2017. Using the output embedding to improve language models. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Computational, pages 157–163. Association for Computational Linguistics.
Ramachandran et al. (2017)
Prajit Ramachandran, Peter J. Liu, and Quoc V. Le. 2017.
pretraining for sequence to sequence learning.
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 383–391. Association for Computational Linguistics.
- Sennrich et al. (2016a) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016a. Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 86–96. Association for Computational Linguistics.
- Sennrich et al. (2016b) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016b. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 1715–1725. Association for Computational Linguistics.
- Smith et al. (2013) Jason R. Smith, Herve Saint-Amand, Magdalena Plamada, Philipp Koehn, Chris Callison-Burch, and Adam Lopez. 2013. Dirt cheap web-scale parallel text from the common crawl. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pages 1374–1383. Association for Computational Linguistics.
- Tu et al. (2017) Zhaopeng Tu, Yang Liu, Lifeng Shang, Xiaohua Liu, and Hang Li. 2017. Neural machine translation with reconstruction. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pages 3097–3103. AAAI Press.
- Vincent et al. (2008) Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. 2008. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning, pages 1096–1103. ACM.
- Wang et al. (2017) Bingning Wang, Kang Liu, and Jun Zhao. 2017. Conditional generative adversarial networks for commonsense machine comprehension. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, pages 4123–4129.
- Wang et al. (2018a) Longyue Wang, Zhaopeng Tu, Shuming Shi, Tong Zhang, Yvette Graham, and Qun Liu. 2018a. Translating pro-drop languages with reconstruction models. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, pages 4937–4945. AAAI Press.
- Wang et al. (2018b) Longyue Wang, Zhaopeng Tu, Andy Way, and Qun Liu. 2018b. Learning to jointly translate and predict dropped pronouns with a shared reconstruction mechanism. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2997–3002. Association for Computational Linguistics.
- Wang et al. (2018c) Yijun Wang, Yingce Xia, Li Zhao, Jiang Bian, Tao Qin, Guiquan Liu, and Tie-Yan Liu. 2018c. Dual transfer learning for neural machine translation with marginal distribution regularization. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, pages 5553–5560. AAAI Press.
- Williams and Zipser (1989) Ronald J. Williams and David Zipser. 1989. A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1(2):270–280.
- Zhang et al. (2018) Zhirui Zhang, Shujie Liu, Mu Li, Ming Zhou, and Enhong Chen. 2018. Joint training for neural machine translation models with monolingual data. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, pages 555–562. AAAI Press.
- Zoph et al. (2016) Barret Zoph, Deniz Yuret, Jonathan May, and Kevin Knight. 2016. Transfer learning for low-resource neural machine translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1568–1575. Association for Computational Linguistics.