Domain adaptation (DA) techniques in machine translation (MT) have been widely studied. For statistical machine translation (SMT), several DA methods have been proposed to overcome the lack of domain-specific data. For example, self-training [1, 2] uses a MT system trained on general corpus to translate in-domain monolingual data as additional training sentences. Topic-based DA [3, 4] employs topic-based translation models to adapt for different scenarios. Data selection approaches [5, 6, 7, 8] first score the out-of-domain data using language model trained on both domain-specific and non-domain-specific monolingual corpora, then rank and select the out-of-domain data that are similar to in-domain data. Instance weighting methods [9, 10] score each sentence/domain using statistical rules, then train the MT models by giving sentence/domain-level scores.
Neural machine translation (NMT) has become state-of-the-art in recent years [11, 12, 13, 14, 15]. There are several research works on NMT domain adaptation. For example, back-translation methods  use a NMT model trained on the reverse direction to translate domain-specific monolingual data as additional training sentences. Fast DA approaches [13, 17] train a base model using mixed in-domain and out-of-domain datasets, then fine-tuning on in-domain datasets. Mixed fine-tuning  combines fine-tuning and multi-domain NMT. Similar to instance weighting in SMT, sentence/domain weighting methods [19, 20] can also be used for NMT domain adaptation based on different objectives. DA with meta information  is proposed to train topic-aware models using domain-specific tags for the decoder. Chunk weighting method  describes a way of selecting and integrating positive partial feedback from model-generated sentences into NMT training.
In this paper, we propose word-level weighting for NMT domain adaptation. We compute the word weights in out-of-domain datasets based on the logarithm difference of probability according to a domain-specific language model and non-domain-specific language model followed by smoothing and binary quantization. This gives the in-domain words in out-of-domain sentences higher weights and biases the NMT model to generate more in-domain-like words. Thus, the work presented in this paper can be viewed as a generalization of instance weighting. To remove noise in the word weights, we study the effectiveness of using smoothing methods. Specifically, a weighted moving average filter is proposed to apply smoothing to the computed word scores with its nearby words.
Experiments on En Zh e-commerce domain translations tasks show that: 1) Domain adapted model with smoothed word weights gains significant improvement over non-smoothed weights; 2) Continuing training the model with computed word weights improves translation results significantly compared to continuing training without word weights; and 3) Compared to directly fine-tuning on in-domain datasets, fine-tuning after pre-training with word weights results in translation performance improvement on the in-domain e-commerce test set.
The rest of the paper is structured as follows. The approach and model we use is described in Section 2, where we first recap the NMT objective and then present the details of the proposed word-level weighting approach. Experimental results and discussions are presented in Section 3 and Section 4, followed by conclusions and outlook in Section 5.
We present word weighting objective on NMT before discussing how to generate the weights.
In this work we use attention-based neural machine translation model [11, 12, 14] for experiments. Given a parallel bilingual dataset D, the NMT model is trained to maximize the conditional likelihood L of a target sequence : , , given a source sequence : , , :
Training objective (1) can be simply modified to word-level loss with word weights :
The word weights for a target sequence can be 0 or 1. We set = 1 for all in-domain sentences. For out-of-domain sentences, = 1 means the word in the out-of-domain sentence is related to in-domain datasets (selected), = 0 means it is not.
Our training objective (2) can be seen as a generalization of the original training objective (1) and instance weighting methods [19, 20]. The original loss (1) sets = 1 for every word in all sentences. The instance-level loss can be expressed as giving a target sentence, , where w is the weight for the sentence or the domain. Our training objective is similar to , however, instead of generating chunk-based user feedback for model predictions, we compute the word weights using language models trained on real target data.
2.2 Approaches to the objective
To compute discriminative word weights, we first follow the data selection methods in SMT . To state this formally, let be the domain-specific corpus, be the non-domain-specific corpus, and be the word in out-of-domain sentences at target position . We denote by the per-word probability conditioned on previous words, according to a language model trained on . Similarly, we denote by the per-word probability conditioned on previous words according to a language model trained on
. We can estimateand by training language models on and , separately. Therefore, the word scores can be computed in the log domain:
Since the value of is strongly correlated with the neighborhood words, it is worth investigating smoothing of the word scores before binary thresholding to remove the noise. Hence, a weighted moving average kernel:
is then applied to smooth word score at each target position . Here is the kernel size and are values of the kernel for
. In our experiments, we heuristically set the values of the kernel based on mean average with= =
or gaussian distribution with= , where we set
to be the global variance of the word scores.
The special case of sentence-level weights can be expressed as , where is the averaged smoothed word scores for the target sentence . In this case, the training objective (2) becomes equivalent to sentence weighting method from  with appropriately modified scoring function.
After smoothing the word scores, we finally binarize the smoothed word scores based on a threshold:
In our experiments we set the threshold and only keep the words above the threshold. This means we select a word if and do not select it if . Considering word weights are gathered in a binary form during continuing training, the selected words would be good candidates that we want to extract from out-of-domain corpus . In fact, word weights are precomputed offline and used during the training. It can be set to any real value, depending on the way of thresholding.
2.3 Chunk-based weighting
Considering that the selected words in a target sentence might still be noisy and we select single random words, we alternatively experimented with selecting only the part (chunk) in the target sentence that has the longest consecutive weights (LCW) with . For each target sentence, we pick only one chunk and set all other weights to zero. See Figure 1 for an example. Then, because the surrounding context is also selected, the chunk is less likely to be noise. If there are multiple such chunks with the same length in the sentence, we simply randomly sample one of them. We found that the chunk-based approach in practice performs slightly better than word-level weighting.
In this section, we conduct a series of experiments to study how well NMT performs when word-level weights are given for out-of-domain training data. We also study the effectiveness of the smoothing methods.
3.1 Datasets and data processing
We report the results on our in-house English-to-Chinese e-commerce item descriptions dataset. Item descriptions are provided by private sellers and like any user-generated content, may contain ungrammatical sentences, spelling errors, and other type of noise. We first segmented the Chinese sentences with Stanford Chinese word segmentation tool  and tokenized English sentences with the scripts provided in Moses . On both languages, we use subword units based on byte-pair encoding (BPE)  with 42,000 subword symbols learned separately for each language. For En-Zh we have M in-domain e-commerce sentence pairs and
M sampled out-of-domain sentence pairs (UN, subtitles, TAUS data collections, etc.) that have significant n-gram overlap with the item description data. We validate our models on an in-house development set consisting ofitem descriptions, and evaluate on an in-house test set of item descriptions using case-insensitive character-level BLEU  and TER  with in-house tools. For development and test sets, a single reference translation is used. Statistics of the data sets are reported in Table 1.
|Data set||e-commerce + out-of-domain|
|Dev||Sentences||3173 (item descriptions)|
|Test||Sentences||739 (item descriptions)|
To compute our word weights we train a domain-specific 4-gram language model and a non-domain specific 4-gram language model using KenLM . For the domain-specific language model, we collected domain-specific monolingual data from an e-commerce website, resulting in the number of M sentences. For the non-domain-specific language model, we use sampled LDC Chinese Gigaword (LDC Catalog No.: LDC2003T09) with M sentences. It should be noted that we train our language models on the word-level. In order to score a BPE-level corpus with such a language model, we score its words and copy this score for each of the subword units. After the word scores are computed, we then smooth them via a guassian distributed kernel with window size . We choose window size considering that the language model is trained based on sequences of four words. We observed similar results with different window sizes, which is discussed in Section 4. Finally, we binarize the smoothed word scores into binary word weights by setting the threshold . The computed word weights are applied to the target side of out-of-domain sentences during the phase of continuing training. In order to get better translation results, we first trained the baseline model with mixed in-domain and out-of-domain data according to training objective 1, where no weights are used. We start our experiments by continuing training from this baseline model.
We implemented our NMT model using Tensorflow library. The encoder is a bidirectional LSTM with size of 512 and the decoder is a LSTM with 2 layers of same size. All the weight parameters are initialized uniformly in . We set dropout on RNN inputs with dropping probability . We train the networks with batch size using SGD with initial learning rate and gradually decaying to after the initial epochs.
|Corpus||Sent. count||Token count|
Statistics of the out-of-domain sentences/tokens selection after applying different types of weights are summarized in Table 2. Before the selection, the number of out-of-domain sentences is M and the number of tokens is M. When sentence-level weights are used, the sentences with are ignored, resulting in the number of remaining sentences/tokens around M and M, respectively. When word-level weights are used, there are sentences where all word weights in the sentences are equal to zero. After removing these sentences, around M sentences are preserved and the number of selected tokens with word weights = is around M. Given computed word weights, we alternatively choose only the chunk with the longest consecutive weights (LCW) where , resulting in chunk-level weights with the selected number of tokens further reduced to M.
We train a baseline NMT model on mixed in-domain and out-of-domain data with objective defined as Eq. 1 for epochs. The data is mixed completely (mixed M in-domain e-commerce and M sampled out-of-domain sentence pairs) while training the baseline model. The baseline model initialized by a mix of in-domain/out-of-domain data can be regarded as a kind of ”warm start”. We have also tried training a baseline with out-of-domain data only and observed slightly worse result after fine-tuning on in-domain data (0.5 BLEU). Hence, we use the baseline model trained on a mix of in-domain/out-of-domain data in the following experiments. Given the baseline model, we then directly fine-tune on in-domain data for another epochs or first continue training on the mixed data with sentence/chunk/word weights for epochs and then fine-tune on in-domain data for epochs. The model is saved after each epoch. We take the model which gives the best result on our development set for evaluation. Note that we always set word weights for our in-domain dataset.
|No.||System description||BLEU [%]||TER [%]|
|2||1 + continue training without word weights||24.31||61.69|
|3||1 + continue training with sentence weights||25.79||60.82|
|4||1 + continue training with word weights||26.14||60.34|
|5||1 + continue training with chunk weights||26.42||60.10|
|6||1 + fine-tuning on in-domain||26.06||59.93|
|7||5 + fine-tuning on in-domain||27.30||58.29|
In Table 3, we show the effect of different types of weights on translation performance. First, the baseline trained on mixed in-domain and out-of-domain datasets gives BLEU and TER, respectively. Directly fine-tuning on in-domain dataset already improves the model due to the bias of the model towards in-domain data.
Continuing training on mixed datasets with previous objective defined in Eq. 1 shows insignificant changes in terms of BLEU and TER. However, introducing sentence-level weights improves the model from to BLEU and to TER, respectively. Compared to continuing training without weights, sentence-level weights are generated as described in Section 2.2, where are set to the same sentence weight . We set the threshold equal to and keep the sentences with weights above the threshold. The result from sentence-level feedback suggests that mining good out-of-domain sentences which are similar to in-domain datasets and dissimilar to out-of-domain datasets benefits model translation towards in-domain-like sentences even without fine-tuning on in-domain datasets.
The use of word-level weights improves the baseline model even better, from to BLEU and to TER, respectively. In this approach, the number of selected tokens is drastically reduced to M from M tokens, nearly drop in number of tokens with improved translation performance. Word-level weights also outperform sentence-level weights by in BLEU score and in TER. It can be explained by the fact that each word in the sentences are given its own similarity to the in-domain datasets. Considering sentence-level weights set all words in a sentence with the same weight, even though part of the words in the sentences might not be related to the in-domain corpus, word-level weights are more accurate and effective.
Finally, chunk-level weights are generated from our word-level weights based on LCW. Here we aim to train the domain-adapted model from more consecutive segments rather than single selected words. On top of word-level weights, it improves by another BLEU absolute and TER, respectively. Out-of-domain sentences can be split into chunks which can be related to the in-domain and can be translated independently in terms of the context. The selection of the consecutive chunk with in-domain-like context can positively affect the training towards domain-adapted model. By focusing on in-domain related and out-of-domain unrelated part, word/chunk-level weights can effectively reduce the unnecessary noise in the out-of-domain training data. Compared to continuing training without word weights, we are able to further reduce the corpus by tokens (M vs. M selected tokens), resulting in an improvement of BLEU absolute and TER, respectively. It should also be noted that with similar number of tokens (M vs. M), chunk-level weights outperforms sentence-level weights by BLEU absolute and TER.
Next, we further fine-tune the model with chunk-level weights and obtain further improvements of BLEU absolute and TER. Compared to directly fine-tuning on the baseline, continuing training the model with chunk-level weights and then fine-tuning improves translation results from to BLEU and to TER, respectively.
|System||BLEU [%]||TER [%]|
|+w.w. without smooth.||21.38||66.25|
|+w.w. (mean smooth.)||25.99||60.70|
|+w.w. (gauss. smooth.)||26.14||60.34|
indicates using a normal distributed filter before thresholding. The approaches regarding different smoothing methods are described in Section2.2.
Results from the study on the effect of using different smoothing methods are shown in Table 4. The word weights generated without using smoothing methods, where , lead to poor translation quality of from BLEU and from TER, respectively. We need to smooth the word scores before thresholding because the values of are noisy. If there are selected isolated words like ’’ which have higher scores than the surrounding text, it may cause rare vocabulary problem after training.
The results from word weights computed from mean averaged filter and normal distributed filter are relatively close, vs. BLEU and vs. TER, respectively. These results are obtained via a filter with window size . In practice, we also tried setting window size and , but didn’t observe different results. We found that the surrounding word scores have to be considered for smoothing in order to make the word weights less noisy as well as more precisely representing the similarity to the in-domain/out-of-domain.
Additionally, we also experimented with randomly selecting words in the out-of-domain sentences with binary mask. However, we observed a drop in the translation accuracy.
In Table 5, we show an example for which the system trained with word weights produces a better translation. The English sentence is ”non-spill spout with patented valve”. The word ”spout” is rare in our data, appearing in the out-of-domain training sentences only once. The Chinese side of this training example can be seen in Figure 1 together with the weights assigned to the individual words by our method. When smoothing is applied, isolated Chinese words such as ”空气” (”air”) are removed. With the longest consecutive words (LCW) method, the only remaining chunk is ”防/溢出/喷口/内” (”inside the non-spills spout”), which is related to our in-domain data. The system with word weights is then trained only on this chunk on the target side, while the baseline model is trained on the entire sentence and generates inappropriate translations.
The domain adaptation techniques (sentence-level/chunk-level/word-level) introduced in this paper are all derived from word weights generation. They aim to select out-of-domain sentences/chunks/words which are more related to in-domain corpus and unrelated to out-of-domain corpus. The word weights are computed prior to system tuning via the logarithm difference of LM probability scoring and are then used for tuning the sequence-to-sequence model. By measuring domain similarity with external criteria such as LM, this kind of out-of-domain data selection is able to highlight the in-domain-related and out-of-domain-unrelated parts and leads to less variation and errors in our e-commerce domain adaptation. In addition, the selected out-of-domain segments have to be smoothed in order to reduce noise.
|Baseline||带 专 利 阀 的 防 溢 出 溅 漏|
|+ word weights||带 专 利 阀 的 防 溢 出 喷 口|
|Reference||带 有 专 利 阀 门 的 防 溢 口|
In this work, we generate word-level weights by calculating the logarithm difference of the probability of two external language models for domain adaptation. This approach better selects the out-of-domain segments related to e-commerce domain, and requires fewer tokens for training. We experimented with continuing training models with sentence/chunk/word weights and show that they all give translation improvement in terms of BLEU and TER compared to continuing training without word weights. Experiments on our in-house English-Chinese datasets also show that continuing training with word weights then fine-tuning improves results over directly fine-tuning on baseline model.
In future, with the computed word weights as the initial parameters, we want to devise strategies to make online domain adaptation possible by dynamically updating word weights during training, which could in turn lead the in-domain data translation to better match its references.
N. Ueffing and H. Ney, “Word-level confidence estimation for machine
translation using phrase-based translation models,” in
Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, ser. HLT ’05. Stroudsburg, PA, USA: Association for Computational Linguistics, 2005, pp. 763–770. [Online]. Available: https://doi.org/10.3115/1220575.1220671
-  H. Schwenk, “Investigations on large-scale lightly-supervised training for statistical machine translation,” in IWSLT, 2008.
-  Y.-C. Tam, I. R. Lane, and T. Schultz, “Bilingual-lsa based lm adaptation for spoken language translation,” in ACL, 2007.
-  S. Hewavitharana, D. Mehay, S. Ananthakrishnan, and P. Natarajan, “Incremental topic-based translation model adaptation for conversational spoken language translation,” in ACL, 2013.
-  R. C. Moore and W. Lewis, “Intelligent selection of language model training data,” in Proceedings of the ACL 2010 Conference Short Papers, ser. ACLShort ’10. Stroudsburg, PA, USA: Association for Computational Linguistics, 2010, pp. 220–224. [Online]. Available: http://dl.acm.org/citation.cfm?id=1858842.1858883
-  A. Axelrod, X. He, and J. Gao, “Domain adaptation via pseudo in-domain data selection,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing, ser. EMNLP ’11. Stroudsburg, PA, USA: Association for Computational Linguistics, 2011, pp. 355–362. [Online]. Available: http://dl.acm.org/citation.cfm?id=2145432.2145474
-  K. Duh, G. Neubig, K. Sudoh, and H. Tsukada, “Adaptation data selection using neural language models: Experiments in machine translation,” in ACL, 2013.
-  N. D. H. S. S. J. A. A. S. Vogel, “Using joint models for domain adaptation in statistical machine translation,” in MT Summit, 2015.
-  S. Matsoukas, A.-V. I. Rosti, and B. Zhang, “Discriminative corpus weight estimation for machine translation,” in Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2 - Volume 2, ser. EMNLP ’09. Stroudsburg, PA, USA: Association for Computational Linguistics, 2009, pp. 708–717. [Online]. Available: http://dl.acm.org/citation.cfm?id=1699571.1699605
-  G. Foster, C. Goutte, and R. Kuhn, “Discriminative instance weighting for domain adaptation in statistical machine translation,” in Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, ser. EMNLP ’10. Stroudsburg, PA, USA: Association for Computational Linguistics, 2010, pp. 451–459. [Online]. Available: http://dl.acm.org/citation.cfm?id=1870658.1870702
I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” inProceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, ser. NIPS’14. Cambridge, MA, USA: MIT Press, 2014, pp. 3104–3112. [Online]. Available: http://dl.acm.org/citation.cfm?id=2969033.2969173
-  D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” ICLR, 2015.
-  M.-T. Luong and C. D. Manning, “Stanford neural machine translation systems for spoken language domains,” in Proceedings of the International Workshop on Spoken Language Translation : December 3-4, 2015, Da Nang, Vietnam / Edited by Marcello Federico, Sebastian Stüker, Jan Niehues. International Workshop on Spoken Language Translation, Da Nang (Vietnam), 3 Dec 2015 - 4 Dec 2015, Dec 2015.
-  Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, Łukasz Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Corrado, M. Hughes, and J. Dean, “Google’s neural machine translation system: Bridging the gap between human and machine translation,” CoRR, vol. abs/1609.08144, 2016. [Online]. Available: http://arxiv.org/abs/1609.08144
-  A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in NIPS, 2017.
-  R. Sennrich, B. Haddow, and A. Birch, “Improving neural machine translation models with monolingual data,” in ACL, 2016.
-  M. Freitag and Y. Al-Onaizan, “Fast domain adaptation for neural machine translation,” CoRR, vol. abs/1612.06897, 2016.
-  C. Chu, R. Dabre, and S. Kurohashi, “An empirical comparison of domain adaptation methods for neural machine translation,” in ACL, 2017.
-  B. Chen, C. Cherry, G. Foster, and S. Larkin, “Cost weighting for neural machine translation domain adaptation,” in Proceedings of the First Workshop on Neural Machine Translation. Association for Computational Linguistics, 2017, pp. 40–46. [Online]. Available: http://aclweb.org/anthology/W17-3205
-  R. Wang, M. Utiyama, L. Liu, K. Chen, and E. Sumita, “Instance weighting for neural machine translation domain adaptation,” in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2017, pp. 1482–1488. [Online]. Available: http://aclweb.org/anthology/D17-1155
-  S. Khadivi, P. Wilken, L. Dahlmann, and E. Matusov, “Neural and statistical methods for leveraging meta-information in machine translation,” in MT Summit, 2017.
-  P. Petrushkov, S. Khadivi, and E. Matusov, “Learning from chunk-based feedback in neural machine translation,” in ACL, 2018.
-  P.-C. Chang, M. Galley, and C. D. Manning, “Optimizing chinese word segmentation for machine translation performance,” in Proceedings of the Third Workshop on Statistical Machine Translation, ser. StatMT ’08. Stroudsburg, PA, USA: Association for Computational Linguistics, 2008, pp. 224–232. [Online]. Available: http://dl.acm.org/citation.cfm?id=1626394.1626430
-  P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, and E. Herbst, “Moses: Open source toolkit for statistical machine translation,” in Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, ser. ACL ’07. Stroudsburg, PA, USA: Association for Computational Linguistics, 2007, pp. 177–180. [Online]. Available: http://dl.acm.org/citation.cfm?id=1557769.1557821
-  R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” CoRR, vol. abs/1508.07909, 2016.
-  K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: A method for automatic evaluation of machine translation,” in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ser. ACL ’02. Stroudsburg, PA, USA: Association for Computational Linguistics, 2002, pp. 311–318. [Online]. Available: https://doi.org/10.3115/1073083.1073135
-  M. Snover, B. J. Dorr, R. F. Schwartz, and L. Micciulla, “A study of translation edit rate with targeted human annotation,” 2006.
-  K. Heafield, “Kenlm: Faster and smaller language model queries,” in Proceedings of the Sixth Workshop on Statistical Machine Translation, ser. WMT ’11. Stroudsburg, PA, USA: Association for Computational Linguistics, 2011, pp. 187–197. [Online]. Available: http://dl.acm.org/citation.cfm?id=2132960.2132986
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng, “Tensorflow: A system for large-scale machine learning,” inProceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, ser. OSDI’16. Berkeley, CA, USA: USENIX Association, 2016, pp. 265–283. [Online]. Available: http://dl.acm.org/citation.cfm?id=3026877.3026899