One of the most attractive features of neural machine translation (NMT) [Bahdanau et al.2015, Cho et al.2014, Sutskever et al.2014] is that it is possible to train an end to end system without the need to deal with word alignments, translation rules and complicated decoding algorithms, which are a characteristic of statistical machine translation (SMT) systems. However, it is reported that NMT works better than SMT only when there is an abundance of parallel corpora. In the case of low resource domains, vanilla NMT is either worse than or comparable to SMT [Zoph et al.2016].
Domain adaptation has been shown to be effective for low resource NMT. The conventional domain adaptation method is fine tuning, in which an out-of-domain model is further trained on in-domain data [Luong and Manning2015, Sennrich et al.2016b, Servan et al.2016, Freitag and Al-Onaizan2016]. However, fine tuning tends to overfit quickly due to the small size of the in-domain data. On the other hand, multi domain NMT [Kobus et al.2016] involves training a single NMT model for multiple domains. This method adds tags “<2domain>” by modifying the parallel corpora to indicate domains without any modifications to the NMT system architecture. However, this method has not been studied for domain adaptation in particular.
Motivated by these two lines of studies, we propose a new domain adaptation method called “mixed fine tuning,” where we first train an NMT model on an out-of-domain parallel corpus, and then fine tune it on a parallel corpus that is a mix of the in-domain and out-of-domain corpora. Fine tuning on the mixed corpus instead of the in-domain corpus can address the overfitting problem. All corpora are augmented with artificial tags to indicate specific domains. We tried two different corpora settings:
We observed that “mixed fine tuning” works significantly better than methods that use fine tuning and domain tag based approaches separately. Our contributions are twofold:
We propose a novel method that combines the best of existing approaches and have shown that it is effective.
To the best of our knowledge this is the first work on an empirical comparison of various domain adaptation methods.
2 Related Work
Besides fine tuning and multi domian NMT using tags, another direction for domain adaptation is using in-domain monolingual data. Either training an in-domain recurrent neural language (RNN) language model for the NMT decoder [Gülçehre et al.2015] or generating synthetic data by back translating target in-domain monolingual data [Sennrich et al.2016b] have been studied.
3 Methods for Comparison
All the methods that we compare are simple and do not need any modifications to the NMT system.
3.1 Fine Tuning
Fine tuning is the conventional way for domain adaptation, and thus serves as a baseline in this study. In this method, we first train an NMT system on a resource rich out-of-domain corpus till convergence, and then fine tune its parameters on a resource poor in-domain corpus (Figure 1).
3.2 Multi Domain
The multi domain method is originally motivated by [Sennrich et al.2016a], which uses tags to control the politeness of NMT translations. The overview of this method is shown in the dotted section in Figure 2. In this method, we simply concatenate the corpora of multiple domains with two small modifications: a. Appending the domain tag “<2domain>” to the source sentences of the respective corpora.111We verified the effectiveness of the domain tags by comparing against a setting that does not use them, see the “w/o tags” settings in Tables 1 and 2. This primes the NMT decoder to generate sentences for the specific domain. b. Oversampling the smaller corpus so that the training procedure pays equal attention to each domain.
We can further fine tune the multi domain model on the in-domain data, which is named as “multi domain + fine tuning.”
3.3 Mixed Fine Tuning
The proposed mixed fine tuning method is a combination of the above methods (shown in Figure 2). The training procedure is as follows:
Train an NMT model on out-of-domain data till convergence.
Resume training the NMT model from step 1 on a mix of in-domain and out-of-domain data (by oversampling the in-domain data) till convergence.
By default, we utilize domain tags, but we also consider settings where we do not use them (i.e., “w/o tags”). We can further fine tune the model from step 2 on the in-domain data, which is named as “mixed fine tuning + fine tuning.”
Note that in the “fine tuning” method, the vocabulary obtained from the out-of-domain data is used for the in-domain data; while for the “multi domain” and “mixed fine tuning” methods, we use a vocabulary obtained from the mixed in-domain and out-of-domain data for all the training stages.
4 Experimental Settings
We conducted NMT domain adaptation experiments in two different settings as follows:
4.1 High Quality In-domain Corpus Setting
Chinese-to-English translation was the focus of the high quality in-domain corpus setting. We utilized the resource rich patent out-of-domain data to augment the resource poor spoken language in-domain data. The patent domain MT was conducted on the Chinese-English subtask (NTCIR-CE) of the patent MT task at the NTCIR-10 workshop222http://ntcir.nii.ac.jp/PatentMT-2/ [Goto et al.2013]. The NTCIR-CE task uses 1000000, 2000, and 2000 sentences for training, development, and testing, respectively. The spoken domain MT was conducted on the Chinese-English subtask (IWSLT-CE) of the TED talk MT task at the IWSLT 2015 workshop [Cettolo et al.2015]. The IWSLT-CE task contains 209,491 sentences for training. We used the dev 2010 set for development, containing 887 sentences. We evaluated all methods on the 2010, 2011, 2012, and 2013 test sets, containing 1570, 1245, 1397, and 1261 sentences, respectively.
4.2 Low Quality In-domain Corpus Setting
Chinese-to-Japanese translation was the focus of the low quality in-domain corpus setting. We utilized the resource rich scientific out-of-domain data to augment the resource poor Wikipedia (essentially open) in-domain data. The scientific domain MT was conducted on the Chinese-Japanese paper excerpt corpus (ASPEC-CJ)333http://lotus.kuee.kyoto-u.ac.jp/ASPEC/ [Nakazawa et al.2016], which is one subtask of the workshop on Asian translation (WAT)444http://orchid.kuee.kyoto-u.ac.jp/WAT/ [Nakazawa et al.2015]. The ASPEC-CJ task uses 672315, 2090, and 2107 sentences for training, development, and testing, respectively. The Wikipedia domain task was conducted on a Chinese-Japanese corpus automatically extracted from Wikipedia (WIKI-CJ) [Chu et al.2016a] using the ASPEC-CJ corpus as a seed. The WIKI-CJ task contains 136013, 198, and 198 sentences for training, development, and testing, respectively.
4.3 MT Systems
For NMT, we used the KyotoNMT system555https://github.com/fabiencro/knmt [Cromieres et al.2016]. The NMT training settings are the same as those of the best systems that participated in WAT 2016. The sizes of the source and target vocabularies, the source and target side embeddings, the hidden states, the attention mechanism hidden states, and the deep softmax output with a 2-maxout layer were set to 32,000, 620, 1000, 1000, and 500, respectively. We used 2-layer LSTMs for both the source and target sides. ADAM was used as the learning algorithm, with a dropout rate of 20% for the inter-layer dropout, and L2 regularization with a weight decay coefficient of 1e-6. The mini batch size was 64, and sentences longer than 80 tokens were discarded. We early stopped the training process when we observed that the BLEU score of the development set converges. For testing, we self ensembled the three parameters of the best development loss, the best development BLEU, and the final parameters. Beam size was set to 100.
|System||NTCIR-CE||test 2010||test 2011||test 2012||test 2013||average|
|Multi domain w/o tags||37.32||12.57||17.40||15.02||15.96||14.97|
|Multi domain + Fine tuning||14.47||13.18||18.03||16.41||16.80||15.82|
|Mixed fine tuning||37.01||15.04||20.96||18.77||18.63||18.01|
|Mixed fine tuning w/o tags||39.67||14.47||20.53||18.10||17.97||17.43|
|Mixed fine tuning + Fine tuning||32.03||14.40||19.53||17.65||17.94||17.11|
For performance comparison, we also conducted experiments on phrase based SMT (PBSMT). We used the Moses PBSMT system [Koehn et al.2007] for all of our MT experiments. For the respective tasks, we trained 5-gram language models on the target side of the training data using the KenLM toolkit666https://github.com/kpu/kenlm/
with interpolated Kneser-Ney discounting, respectively. In all of our experiments, we used the GIZA++ toolkit777http://code.google.com/p/giza-pp for word alignment; tuning was performed by minimum error rate training [Och2003], and it was re-run for every experiment.
For both MT systems, we preprocessed the data as follows. For Chinese, we used KyotoMorph888https://bitbucket.org/msmoshen/kyotomorph-beta for segmentation, which was trained on the CTB version 5 (CTB5) and SCTB [Chu et al.2016b]. For English, we tokenized and lowercased the sentences using the tokenizer.perl script in Moses. Japanese was segmented using JUMAN999http://nlp.ist.i.kyoto-u.ac.jp/EN/index.php?JUMAN [Kurohashi et al.1994].
For NMT, we further split the words into sub-words using byte pair encoding (BPE) [Sennrich et al.2016c], which has been shown to be effective for the rare word problem in NMT. Another motivation of using sub-words is making the different domains share more vocabulary, which is important especially for the resource poor domain. For the Chinese-to-English tasks, we trained two BPE models on the Chinese and English vocabularies, respectively. For the Chinese-to-Japanese tasks, we trained a joint BPE model on both of the Chinese and Japanese vocabularies, because Chinese and Japanese could share some vocabularies of Chinese characters. The number of merge operations was set to 30,000 for all the tasks.
Tables 1 and 2 show the translation results on the Chinese-to-English and Chinese-to-Japanese tasks, respectively. The entries with SMT and NMT are the PBSMT and NMT systems, respectively; others are the different methods described in Section 3. In both tables, the numbers in bold indicate the best system and all systems that were not significantly different from the best system. The significance tests were performed using the bootstrap resampling method [Koehn2004] at .
We can see that without domain adaptation, the SMT systems perform significantly better than the NMT system on the resource poor domains, i.e., IWSLT-CE and WIKI-CJ; while on the resource rich domains, i.e., NTCIR-CE and ASPEC-CJ, NMT outperforms SMT. Directly using the SMT/NMT models trained on the out-of-domain data to translate the in-domain data shows bad performance. With our proposed “Mixed fine tuning” domain adaptation method, NMT significantly outperforms SMT on the in-domain tasks.
|Multi domain w/o tags||40.78||33.74|
|Multi domain + Fine tuning||22.78||34.61|
|Mixed fine tuning||42.56||37.57|
|Mixed fine tuning w/o tags||41.86||37.23|
|Mixed fine tuning + Fine tuning||31.63||37.77|
Comparing different domain adaptation methods, “Mixed fine tuning” shows the best performance. We believe the reason for this is that “Mixed fine tuning” can address the over-fitting problem of “Fine tuning.” We observed that while “Fine tuning” overfits quickly after only 1 epoch of training, “Mixed fine tuning” only slightly overfits until covergence. In addition, “Mixed fine tuning” does not worsen the quality of out-of-domain translations, while “Fine tuning” and “Multi domain” do. One shortcoming of “Mixed fine tuning” is that compared to “fine tuning,” it took a longer time for the fine tuning process, as the time until convergence is essentially proportional to the size of the data used for fine tuning.
“Multi domain” performs either as well as (IWSLT-CE) or worse than (WIKI-CJ) “Fine tuning,” but “Mixed fine tuning” performs either significantly better than (IWSLT-CE) or is comparable to (WIKI-CJ) “Fine tuning.” We believe the performance difference between the two tasks is due to their unique characteristics. As WIKI-CJ data is of relatively poorer quality, mixing it with out-of-domain data does not have the same level of positive effects as those obtained by the IWSLT-CE data.
The domain tags are helpful for both “Multi domain” and “Mixed fine tuning.” Essentially, further fine tuning on in-domain data does not help for both “Multi domain” and “Mixed fine tuning.” We believe the reason for this is that the “Multi domain” and “Mixed fine tuning” methods already utilize the in-domain data used for fine tuning.
In this paper, we proposed a novel domain adaptation method named “mixed fine tuning” for NMT. We empirically compared our proposed method against fine tuning and multi domain methods, and have shown that it is effective but is sensitive to the quality of the in-domain data used.
In the future, we plan to incorporate an RNN model into our current architecture to leverage abundant in-domain monolingual corpora. We also plan on exploring the effects of synthetic data by back translating large in-domain monolingual corpora.
- [Bahdanau et al.2015] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In In Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), San Diego, USA, May. International Conference on Learning Representations.
- [Cettolo et al.2015] M Cettolo, J Niehues, S Stüker, L Bentivogli, R Cattoni, and M Federico. 2015. The iwslt 2015 evaluation campaign. In Proceedings of the Twelfth International Workshop on Spoken Language Translation (IWSLT).
[Cho et al.2014]
Kyunghyun Cho, Bart van Merriënboer, ÇaÄlar Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio.
Learning phrase representations using rnn encoder–decoder for
statistical machine translation.
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734, Doha, Qatar, October. Association for Computational Linguistics.
[Chu et al.2016a]
Chenhui Chu, Raj Dabre, and Sadao Kurohashi.
Parallel sentence extraction from comparable corpora with neural network features.In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Paris, France, may. European Language Resources Association (ELRA).
- [Chu et al.2016b] Chenhui Chu, Toshiaki Nakazawa, Daisuke Kawahara, and Sadao Kurohashi. 2016b. SCTB: A Chinese treebank in scientific domain. In Proceedings of the 12th Workshop on Asian Language Resources (ALR12), pages 59–67, Osaka, Japan, December. The COLING 2016 Organizing Committee.
- [Cromieres et al.2016] Fabien Cromieres, Chenhui Chu, Toshiaki Nakazawa, and Sadao Kurohashi. 2016. Kyoto university participation to wat 2016. In Proceedings of the 3rd Workshop on Asian Translation (WAT2016), pages 166–174, Osaka, Japan, December. The COLING 2016 Organizing Committee.
- [Freitag and Al-Onaizan2016] Markus Freitag and Yaser Al-Onaizan. 2016. Fast domain adaptation for neural machine translation. arXiv preprint arXiv:1612.06897.
- [Goto et al.2013] Isao Goto, Ka-Po Chow, Bin Lu, Eiichiro Sumita, and Benjamin K. Tsou. 2013. Overview of the patent machine translation task at the ntcir-10 workshop. In Proceedings of the 10th NTCIR Conference, pages 260–286, Tokyo, Japan, June. National Institute of Informatics (NII).
- [Gülçehre et al.2015] Çaglar Gülçehre, Orhan Firat, Kelvin Xu, Kyunghyun Cho, Loïc Barrault, Huei-Chi Lin, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2015. On using monolingual corpora in neural machine translation. CoRR, abs/1503.03535.
- [Kobus et al.2016] Catherine Kobus, Josep Crego, and Jean Senellart. 2016. Domain control for neural machine translation. arXiv preprint arXiv:1612.06140.
- [Koehn et al.2007] Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, pages 177–180, Prague, Czech Republic, June. Association for Computational Linguistics.
- [Koehn2004] Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of EMNLP 2004, pages 388–395, Barcelona, Spain, July. Association for Computational Linguistics.
- [Kurohashi et al.1994] Sadao Kurohashi, Toshihisa Nakamura, Yuji Matsumoto, and Makoto Nagao. 1994. Improvements of Japanese morphological analyzer JUMAN. In Proceedings of the International Workshop on Sharable Natural Language, pages 22–28.
- [Luong and Manning2015] Minh-Thang Luong and Christopher D Manning. 2015. Stanford neural machine translation systems for spoken language domains. In Proceedings of the 12th International Workshop on Spoken Language Translation, pages 76–79, Da Nang, Vietnam, December.
- [Nakazawa et al.2015] Toshiaki Nakazawa, Hideya Mino, Isao Goto, Graham Neubig, Sadao Kurohashi, and Eiichiro Sumita. 2015. Overview of the 2nd Workshop on Asian Translation. In Proceedings of the 2nd Workshop on Asian Translation (WAT2015), pages 1–28, Kyoto, Japan, October.
- [Nakazawa et al.2016] Toshiaki Nakazawa, Manabu Yaguchi, Kiyotaka Uchimoto, Masao Utiyama, Eiichiro Sumita, Sadao Kurohashi, and Hitoshi Isahara. 2016. Aspec: Asian scientific paper excerpt corpus. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Paris, France, May. European Language Resources Association (ELRA).
- [Och2003] Franz Josef Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pages 160–167, Sapporo, Japan, July. Association for Computational Linguistics.
- [Sennrich et al.2016a] Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016a. Controlling politeness in neural machine translation via side constraints. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 35–40, San Diego, California, June. Association for Computational Linguistics.
- [Sennrich et al.2016b] Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016b. Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 86–96, Berlin, Germany, August. Association for Computational Linguistics.
- [Sennrich et al.2016c] Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016c. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725, Berlin, Germany, August. Association for Computational Linguistics.
- [Servan et al.2016] Christophe Servan, Josep Crego, and Jean Senellart. 2016. Domain specialization: a post-training domain adaptation for neural machine translation. arXiv preprint arXiv:1612.06141.
- [Sutskever et al.2014] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems, NIPS’14, pages 3104–3112, Cambridge, MA, USA. MIT Press.
- [Zoph et al.2016] Barret Zoph, Deniz Yuret, Jonathan May, and Kevin Knight. 2016. Transfer learning for low-resource neural machine translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pages 1568–1575.