Iterative Dual Domain Adaptation for Neural Machine Translation

12/16/2019 ∙ by Jiali Zeng, et al. ∙ Xiamen University Tsinghua University 0

Previous studies on the domain adaptation for neural machine translation (NMT) mainly focus on the one-pass transferring out-of-domain translation knowledge to in-domain NMT model. In this paper, we argue that such a strategy fails to fully extract the domain-shared translation knowledge, and repeatedly utilizing corpora of different domains can lead to better distillation of domain-shared translation knowledge. To this end, we propose an iterative dual domain adaptation framework for NMT. Specifically, we first pre-train in-domain and out-of-domain NMT models using their own training corpora respectively, and then iteratively perform bidirectional translation knowledge transfer (from in-domain to out-of-domain and then vice versa) based on knowledge distillation until the in-domain NMT model convergences. Furthermore, we extend the proposed framework to the scenario of multiple out-of-domain training corpora, where the above-mentioned transfer is performed sequentially between the in-domain and each out-of-domain NMT models in the ascending order of their domain similarities. Empirical results on Chinese-English and English-German translation tasks demonstrate the effectiveness of our framework.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Currently, neural machine translation (NMT) has become dominant in the community of machine translation due to its excellent performance Bahdanau:ICLR2015; Wu:Arxiv2016; Vaswani:NIPS2017. With the development of NMT, prevailing NMT models become more and more complex with large numbers of parameters, which often require abundant corpora for effective training. However, for translation tasks in most domains, domain-specific parallel sentences are often scarce. If we only use domain-specific data to train the NMT model for such a domain, the performance of resulting model is usually unsatisfying. Therefore, NMT for low-resource domains becomes a challenge in its research and applications.

To deal with this issue, many researchers have conducted studies on the domain adaptation for NMT, which can be classified into two general categories. One is to transfer the rich-resource domain (out-of-domain) translation knowledge to benefit the low-resource (in-domain) NMT model. The other is to use the mixed-domain training corpus to construct a unified NMT model for all domains. Here, we mainly focus on the first type of research, of which typical methods include fine-tuning

Luong:IWSLT2015; Zoph:EMNLP2016; Servan:Arxiv2016, mixed fine-tuning Chu:ACL2017, cost weighting Chen:FWNMT2017, data selection Wang:ACL2017; Wang:EMNLP2017; zhang:NAACL2019 and so on. The underlying assumption of these approaches is that in-domain and out-of-domain NMT models share the same parameter space or prior distributions, and the useful out-of-domain translation knowledge can be completely transferred to in-domain NMT model in a one-pass manner. However, it is difficult to achieve this goal due to domain differences. Particularly, when the domain difference is significant, such conventional brute-force transfer may be unsuccessful, facing the similar issue as the domain adaptation for other tasks Pan:IEEE2010.

In this paper, to tackle the above problem, we argue that corpora of different domains should be repeatedly utilized to fully distill domain-shared translation knowledge. To this end, we propose a novel Iterative Dual Domain Adaptation (IDDA) framework for NMT. Under this framework, we first train in-domain and out-of-domain NMT models using their own training corpora respectively, and then iteratively perform bidirectional translation knowledge transfer (from in-domain to out-of-domain and then vice versa). In this way, both in-domain and out-of-domain NMT models are expected to constantly reinforce each other, which is likely to achieve better NMT domain adaptation. Particularly, we employ a knowledge distillation Hinton:arXiv2015; Kim:2016 based approach to transfer translation knowledge. During this process, the target-domain NMT model is first initialized with the source-domain NMT model, and then trained to fit its own training data and match the output of its previous best model simultaneously. By doing so, the previously transferred translation knowledge can be effectively retained for better NMT domain adaptation. Finally, we further extend the proposed framework to the scenario of multiple out-of-domain training corpora, where the above-mentioned bidirectional knowledge transfer is performed sequentially between the in-domain and each out-of-domain NMT models in the ascending order of their domain similarities.

The contributions of this work are summarized as follows:

  • We propose an iterative dual domain adaptation framework for NMT, which is applicable to many conventional domain transfer approaches, such as fine-tune, mixed fine-tune. Compared with previous approaches, our framework is able to better exploit domain-shared translation knowledge for NMT domain adaptation.

  • We extend our framework to the setting of multiple out-of-domain training corpora, which is rarely studied in machine translation. Moreover, we explicitly differentiate the contributions of different out-of-domain training corpora based on the domain-level similarity with in-domain training corpus.

  • We provide empirical evaluations of the proposed framework on Chinese-English, German-English datasets for NMT domain adaptation. Experimental results demonstrate the effectiveness of our framework. Moreover, we deeply analyze impacts of various factors on our framework111We release code and results at https://github.com/DeepLearnXMU/IDDA..

2 Related Work

Our work is obviously related to the research on transferring the out-of-domain translation knowledge into the in-domain NMT model. In this aspect, fine-tuning Luong:IWSLT2015; Zoph:EMNLP2016; Servan:Arxiv2016 is the most popular approach, where the NMT model is first trained using the out-of-domain training corpus, and then fine-tuned on the in-domain training corpus. To avoid overfitting, Chu:ACL2017 blended in-domain with out-of-domain corpora to fine-tune the pre-trained model, and Freitag:Arxiv2016 combined the fine-tuned model with the baseline via ensemble method. Meanwhile, applying data weighting into NMT domain adaptation has attracted much attention. Wang:ACL2017 and Wang:EMNLP2017 proposed several sentence and domain weighting methods with a dynamic weight learning strategy. zhang:NAACL2019 ranked unlabeled domain training samples based on their similarity to in-domain data, and then adopts a probabilistic curriculum learning strategy during training. Chen:FWNMT2017 applied the sentence-level cost weighting to refine the training of NMT model. Recently, David:NAACL2018 introduced a weight to each hidden unit of out-of-domain model. Chu:COLING2018 gave a comprehensive survey of the dominant domain adaptation techniques for NMT. Gu:NAACL2019 not only maintained a private encoder and a private decoder for each domain, but also introduced a common encoder and a common decoder shared by all domains.

Figure 1: Traditional approach vs IDDA framework for one-to-one NMT domain adaptation. : out-of-domain training corpus, : in-domain training corpus, : out-of-domain NMT model, : in-domain NMT model, K denotes the iteration number.
1:Input: Training corpora , development sets , and the maximal iteration number .
2:Output: In-domain NMT model .
3: TrainModel(),     TrainModel()
4: ,    
5:for do
6:     TransferModel(,  ,  )
7:    if EvalModel(, ) EvalModel(, )
8:            
9:    end if
10:     TransferModel(,  ,  )
11:    if EvalModel(, ) EvalModel(, )
12:            
13:    end if
14:end for
Algorithm 1 Iterative Dual Domain Adaptation for NMT

Significantly different from the above methods, along with the studies of dual learning for NMT He:NIPS2016; Wang:AAAI2018; Zhang:AAAI2019, we iteratively perform bidirectional translation knowledge transfer between in-domain and out-of-domain training corpora. To the best of our knowledge, our work is the first attempt to explore such a dual learning based framework for NMT domain adaptation. Furthermore, we extend our framework to the scenario of multiple out-of-domain corpora. Particularly, we introduce knowledge distillation into the domain adaptation for NMT and experimental results demonstrate its effectiveness, echoing its successful applications on many tasks, such as speech recognition Hinton:arXiv2015

and natural language processing

Kim:2016; Tan:2019.

Besides, our work is also related to the studies of multi-domain NMT, which focus on building a unified NMT model trained on the mixed-domain training corpus for translation tasks in all domains Kobus:Arxiv2016; Tars:arXiv2018; Farajian:WMT2017; Pryzant:WMT2017; Sajjad:arXiv2017; Zeng:EMNLP2018; Bapna:NAACL2019. Although our framework is also able to refine out-of-domain NMT model, it is still significantly different from multi-domain NMT, since only the performance of in-domain NMT model is considered.

Finally, note that similar to our work, Tan:2019 introduced knowledge distillation into multilingual NMT. However, our work is still different from Tan:2019 in the following aspects: (1) Tan:2019 mainly focused on constructing a unified NMT model for multi-lingual translation task, while we aim at how to effectively transfer out-of-domain translation knowledge to in-domain NMT model; (2) Our translation knowledge transfer is bidirectional, while the procedure of knowledge distillation in Tan:2019 is unidirectional; (3) When using knowledge distillation under our framework, we iteratively update teacher models for better domain adaptation. In contrast, all language-specific teacher NMT models in Tan:2019 remain fixed.

3 Iterative Dual Domain Adaptation Framework

In this section, we first detailedly describe our proposed framework for conventional one-to-one NMT domain adaptation, and then extend this framework to the scenario of multiple out-of-domain corpora (many-to-one).

Figure 2: Traditional approach vs IDDA framework for many-to-one NMT domain adaptation. : a mixed out-of-domain training corpus.

3.1 One-to-one Domain Adaptation

As shown in Figure 1(a), previous studies mainly focus on the one-pass translation knowledge transfer from one out-of-domain NMT model to the in-domain NMT model. Unlike these studies, we propose to conduct iterative dual domain adaptation for NMT, of which framework is illustrated in Figure 1(b).

To better describe our framework, we summarize the training procedure of our framework in Algorithm 1. Specifically, we first individually train the initial in-domain and out-of-domain NMT models, respectively denoted by and , via minimizing the negative likelihood of their own training corpora and (Line 3):

(1)
(2)

Then, we iteratively perform bidirectional translation knowledge transfer to update both in-domain and out-of-domain NMT models, until the maximal iteration number is reached (Lines 5-14). More specifically, at the -th iteration, we first transfer the translation knowledge of the previous in-domain NMT model to the out-of-domain NMT model trained on (Line 6), and then reversely transfer the translation knowledge encoded by to the in-domain NMT model trained on (Line 10). During this process, we evaluate the new models and on their corresponding development sets, and then record the best model parameters as and (Lines 7-9, 11-13).

Obviously, during the above procedure, one of important steps is how to transfer the translation knowledge from one domain-specific NMT model to the other one. However, if we directly employ conventional domain transfer approaches, such as fine-tuning, as the iterative dual domain adaptation proceeds, the previously learned translation knowledge tends to be ignored. To deal with this issue, we introduce knowledge distillation Kim:2016 to conduct the translation knowledge transfer. Specifically, during the transfer process from to , we first initialize with parameters of , and then train not only to match the references of

, but also to be consistent with probability outputs of the previous best in-domain NMT model

, which is considered as the teacher model. To this end, we define the loss function as

(3)

where is the coefficient used to trade off these two loss terms, and it can be tuned on the development set. Notably, when =0, only the term of likelihood function affects the model training, and thus our transfer approach degenerate into fine-tuning at each iteration.

In this way, we enable in-domain NMT model to not only retain the previously learned effective translation knowledge, but also fully absorb the useful translation knowledge from out-of-domain NMT model . Similarly, we employ the above method to transfer translation knowledge from to using out-of-domain corpus and the previous best out-of-domain model . Due to the space limitation, we omit the specific description of this procedure.

3.2 Many-to-one Domain Adaptation

Usually, in practical applications, there exist multiple available out-of-domain training corpora simultaneously. As shown in Figure 2(a), previous studies usually mix them into one out-of-domain corpus, which is applicable for the conventional one-to-one NMT domain adaptation. However, various out-of-domain corpora are semantically related to in-domain corpus to different degrees, and thus intuitively, it is difficult to adequately play their roles without distinguishing them.

To address this issue, we extend the proposed framework to many-to-one NMT domain adaptation. Our extended framework is illustrated in Figure 2(b). Given an in-domain corpus and out-of-domain corpora, we first measure the semantic distance between each out-of-domain corpus and the in-domain corpus using the proxy -distance = Ganin:MLR2015; Pryzant:WMT2017, where the is the generalization error of a linear bag-of-words SVM classifier trained to discriminate between the two domains. Then, we determine the transfer order of these out-of-domain NMT models as , according to distances of their own training corpora to the in-domain corpus in a decreasing order. The reason behind this step is the translation knowledge of previously transferred out-of-domain NMT models will be partially forgotten during the continuous transfer. By setting transfer order according to their values in a decreasing order, we enable the in-domain NMT model to fully preserve the translation knowledge transferred from the most relevant out-of-domain NMT model. Finally, we sequentially perform bidirectional knowledge transfer between the in-domain and each out-of-domain models, where this process will be repeated for iterations.

4 Experiments

To verify the effectiveness of our framework, we first conducted one-to-one domain adaptation experiments on Chinese-English translation, where we further investigated impacts of various factors on our framework. Then, we carried out two-to-one domain adaptation experiments on English-German translation, so as to demonstrate the generality of our framework on different language pairs and multiple out-of-domain corpora.

4.1 Setup

Datasets. In the Chinese-English translation task, our in-domain training corpus is from IWSLT2015 dataset consisting of 210K TED Talk sentence pairs, and the out-of-domain training corpus contains 1.12M LDC sentence pairs related to News domain. For these two domains, we chose IWSLT dev2010 and NIST 2002 dataset as development sets. Finally, we used IWSLT tst2010, tst2011 and tst2012 as in-domain test sets. Particularly, in order to verify whether our framework can enable NMT models of two domains to benefit each other, we also tested the performance of out-domain NMT model on NIST 2003, 2004, 2005, 2006 datasets.

For the English-German translation task, our training corpora totally include one in-domain dataset: 200K TED Talk sentence pairs provided by IWSLT2015, and two out-of-domain datasets: 500K sentence pairs (News topic) extracted from WMT2014 corpus, and 500K sentence pairs (Medical topic) that are sampled from OPUS EMEA corpus222http://opus.nlpl.eu/. As for development sets, we chose IWSLT tst2012, WMT tst2012 and 1K sampled sentence pairs of OPUS EMEA corpus, respectively. In addition, IWSLT tst2013, tst2014 were used as in-domain test sets, WMT news-test2014 (News topic) and 1K sampled sentence pairs of OPUS EMEA corpus were used as two out-of-domain test sets.

We first employed Stanford Segmenter333https://nlp.stanford.edu/ to conduct word segmentation on Chinese sentences and MOSES script444http://www.statmt.org/moses/ to tokenize English and German sentences. Then, we limited the length of sentences to 50 words in the training stage. Besides, we employed Byte Pair Encoding Sennrich:ACL2016 to split words into subwords and set the vocabulary size for both Chinese-English and English-German as 32,000. We evaluated the translation quality with BLEU scores Papineni:ACL2002 as calculated by multi-bleu.perl script .

Settings. We chose Transformer Vaswani:NIPS2017 as our NMT model, which exhibits excellent performance due to its flexibility in parallel computation and long-range dependency modeling. We followed Vaswani:NIPS2017

to set the configurations. The dimensionality of all input and output layers is 512, and that of FFN layer is 2048. We employed 8 parallel attention heads in both encoder and decoder. Parameter optimization was performed using stochastic gradient descent, where

Adam Kingma:ICLR2015 was used to automatically adjust the learning rate of each parameter. We batched sentence pairs by approximated length, and limited input and output tokens per batch to 25000 tokens. As for decoding we employed beam search algorithm and set the beam size as 4. Besides, we set the distillation coefficient as .

Contrast Models. We compared our framework with the following models, namely:

  • Single A reimplemented Transformer only trained on a single domain-specific (in/out) training corpus.

  • Mix A reimplemented Transformer trained on the mix of in-domain and out-of-domain training corpora.

  • Fine-tuning (FT) Luong:IWSLT2015. It first trains the NMT model on out-of-domain training corpus and then fine-tunes it using in-domain training corpus.

  • Mixed Fine-tuning (MFT) Chu:ACL2017. It also first trains the NMT model on out-of-domain training corpus, and then fine-tunes it using both out-of-domain and over-sampling in-domain training corpora.

  • Knowledge Distillation (KD) Kim:2016. Using this method, we first train a out-of-domain and an in-domain NMT models using their own training corpus, respectively. Then, we use the in-domain training corpus to fine-tune the out-of-domain NMT model, supervised by the in-domain NMT model.

Besides, we reported the performance of some recently proposed multi-domain NMT models.

Model TED Talk (In-domain) News (Out-of-domain)
Tst10 Tst11 Tst12 Tst13 Ave. Nist03 Nist04 Nist05 Nist06 Ave.
Cross-domain Transfer Methods
Single 15.82 20.80 17.77 18.33 18.18 45.38 45.93 42.80 42.70 44.20
Mix 16.46 20.85 19.13 19.87 19.08 44.87 45.71 42.24 42.02 43.71
FT 16.77 21.16 19.31 20.53 19.44
MFT 17.19 22.02 20.09 21.05 20.08
KD 17.62 21.88 19.97 20.43 19.98
Multi-domain NMT Methods
DC 17.23 22.10 19.68 20.58 19.90 46.03 46.62 44.39 43.82 45.21
DM 16.45 21.35 18.77 20.27 19.21 45.12 45.83 42.77 42.59 44.08
WDCD 17.32 22.23 20.02 21.10 20.17 46.33 46.36 44.62 43.80 45.27
IDDA Framework
IDDA(=0) 18.00 22.71 20.36 21.82 20.72 45.91 45.84 43.61 42.17 44.46
IDDA 18.36 23.14 20.78 21.79 21.02 47.17 47.44 45.38 44.04 46.01
Table 1: Experimental results on the Chinese-English translation task. indicates statistically significantly better than (0.01) the result of WDCD.
  • Domain Control (DC) Kobus:Arxiv2016. It is also based on the mix-domain NMT model. However, it adds an additional domain tag to each source sentence, incorporating domain information into source annotations.

  • Discriminative Mixing (DM) Pryzant:WMT2017. It jointly trains NMT with domain classification via multitask learning. Please note that it performs the best among three approaches proposed by Pryzant et al., Pryzant:WMT2017.

  • Word-level Domain Context Discrimination (WDCD) Zeng:EMNLP2018. It discriminates the source-side word-level domain specific and domain-shared contexts for multi-domain NMT by jointly modeling NMT and domain classifications.

4.2 Results on Chinese-English Translation

4.2.1 Effect of Iteration Number

The iteration number is a crucial hyper-parameter that directly determines the amount of the transferred translation knowledge under our framework. Therefore, we first inspected its impacts on the development sets. To this end, we varied from 0 to 7 with an increment of 1 in each step, where our framework degrades to Single when =0.

Figure 3: Effect of iteration number () on the Chinese-English in-domain development set.

Figure 3 provides the experimental results using different s. We can observe that both IDDA(=0) and IDDA achieve the best performance at the -th iteration, respectively. Therefore, we directly used =3 in all subsequent experiments.

4.2.2 Overall Performance

Table 1 shows the overall experimental results. On all test sets, our framework significantly outperforms other contrast models. Furthermore, we reach the following conclusions:

First, on the in-domain test sets, both IDDA(=0) and IDDA surpass Single, Mix, FT, MFT and KD, most of which are commonly used in the domain adaptation for NMT. This confirms the difficulty in completely one-pass transferring the useful out-of-domain translation knowledge to the in-domain NMT model. Moreover, the in-domain NMT model benefits from multiple-pass knowledge transfers under our framework.

Second, compared with DC, DM and WDCD that are proposed for multi-domain NMT, both IDDA(=0) and IDDA still exhibit better performance on the in-domain test sets. The underlying reason is that these multi-domain models discriminate domain-specific and domain-shared information in encoder, however, their shared decoder are inadequate to effectively preserve domain-related text style and idioms. In contrast, our framework is adept at preserving these information since we construct an individual NMT model for each domain.

Third, IDDA achieves better performance than IDDA(=0), demonstrating the importance of retaining previously learned translation knowledge. Surprisingly, IDDA significantly outperforms IDDA(=0)

on out-of-domain data sets. We conjecture that during the process of knowledge distillation, by assigning non-zero probabilities to multiple words, the output distribution of teacher model is more smooth, leading to smaller variance in gradients

Hinton:arXiv2015. Consequently, the out-of-domain NMT model becomes more robust by iteratively absorbing the translation knowledge from the best out-of-domain model.

Finally, note that even on the out-of-domain test sets, IDDA still has better performance than all listed contrast models in the subsequent experimental analyses. This result demonstrates the advantage of dual domain adaptation under our framework.

According to the reported performance of our framework shown in Table 1, we only considered IDDA in all subsequent experiments. Besides, we only chose MFT, KD, and WDCD as typical contrast models. This is because KD is the basic domain adaption approach of our framework, MFT and WDCD are the best domain adaptation method and multi-domain NMT model for comparison, respectively.

4.2.3 Results on Source Sentences with Different Lengths

Figure 4: BLEU scores on different IWSLT test sets divided according to source sentence lengths.

Following previous work Bahdanau:ICLR2015, we divided IWSLT test sets into different groups based on the lengths of source sentences and then investigated the performance of various models.

Figure 4 illustrates the results. We observe that our framework also achieves the best performance in all groups, although the performances of all models degrade with the increase of the length of source sentences.

4.2.4 Effect of Out-of-domain Corpus Size

In this group of experiments, we investigated the impacts of out-of-domain corpus size on our proposed framework. Specifically, we inspected the results of our framework using different sizes of out-of-domain corpora: 50K, 200K and 1.12M, respectively

Figure 5: Experimental results with different sizes of out-of-domain corpora.

Figure 5 shows the comparison results on the average BLEU scores of all IWSLT test sets. No matter how large out-of-domain data is used, IDDA always achieves better performance than other contrast models, demonstrating the effectiveness and generality of our framework. Specially, IDDA with 200K out-of-domain corpus is comparable to KD with 1.12M corpus. From this result, we confirm again that our framework is able to better exploit the complementary information between domains than KD.

4.2.5 Effects of Dual Domain Adaptation and Updating Teacher Models

Model Ave.
IDDA-unidir 20.43
IDDA-fixTea 20.60
IDDA 21.02
Table 2: Experimental results of comparing IDDA with its two variants.

Two highlights of our framework consist of the usage of bidirectional translation knowledge transfer and continuous updating teacher models and (See Line 6, 10 of Algorithm 1). To inspect their effects on our framework, we compared our framework with its two variants: (1) IDDA-unidir, where we only iteratively transfer out-of-domain translation knowledge to the in-domain NMT model; (2) IDDA-fixTea, where teacher models are fixed as the initial out-of-domain and in-domain NMT models, respectively.

The results are displayed in Table 2. We can see that our framework exhibits better performance than its two variants, which demonstrates that dual domain adaptation enables NMT models of two domains to benefit from each other, and updating teacher models is more helpful to retain useful translation knowledge.

4.2.6 Case Study

Model                     Translation
Src zhè shì dì yī zhǒng zhílì xíngzǒu de
língzhǎnglèi dòngwù
Ref that was the first upright primate
MFT this is the first animal to walk upright
KD this is the first growing primate
WDCD this is the first primate walking around
IDDA-1 this is the first upright - walking primate
IDDA-2 this is the first upright - walking primate
IDDA-3 this is the first primates walking upright
IDDA-4 this is the first upright primate
IDDA-5 this is the first upright primate
IDDA-6 this is the first upright primate
IDDA-7 this is the first upright primate
Table 3: Translation examples of different NMT models. Src: source sentence and Ref: target reference. IDDA-k represents the in-domain NMT model at the -th iteration using our framework.

Table 3 displays the 1-best translations of a sampled test sentence generated by MFT, KD, WDCD, and IDDA at different iterations. Inspecting this example provides the insight into the advantage of our proposed framework to some extent. Specifically, we observe that MFT, KD, WDCD are unable to correctly understand the meaning of “zhílì xíngzǒu de língzhǎnglèi dòngwù” and thus generate incorrect or incomplete translations, while IDDA successfully corrects these errors by gradually absorbing transferred translation knowledge.

4.3 Results on English-German Translation

4.3.1 Overall Performance

Model In-domain Out-of-domain1 Out-of-domain2
TED Talk News Medical
IWSLT2013 IWSLT2014 Ave. WMT14 EMEA
Cross-domain Transfer Methods
Single 29.76 25.99 27.88 20.54 51.11
Mix 31.45 27.03 29.24 21.17 50.60
FT 30.54 27.02 28.78
MFT 31.86 27.49 29.67
KD 31.33 27.96 29.64
Multi-domain NMT Methods
DC 31.13 28.02 29.57 21.61 52.25
DM 31.57 27.60 29.58 21.75 52.60
WDCD 31.87 27.82 29.84 21.86 52.84
IDDA Framework
IDDA(=0) 32.11 28.10 30.11 22.01 52.07
IDDA 32.93 28.88 30.91 22.17 53.39
Table 4: Experimental results of the English-German translation task. * indicates statistically significantly better than (0.05) the result of WDCD.

We first calculated the distance between the in-domain and each out-of-domain corpora: (Ted Talk, News) = 0.92 and (Ted Talk, Medical) = 1.92. Obviously, the News domain is more relevant to TED Talk domain than Medical domain, and thus we determined the final transfer order as {, } for this task. Then, as implemented in the previous Chinese-English experiments, we determined the optimal =2 on the development set.

Table 4 shows experimental results. Similar to the previously reported experiment results, our framework still obtains the best performance among all models, which verifies the effectiveness of our framework on many-to-one domain adaptation for NMT.

As described above, we have two careful designs for many-to-one NMT domain adaptation: (1) We distinguish different out-of-domain corpora, and then iteratively perform bidirectional translation knowledge transfer between in-domain and each out-of-domain NMT models. (2) We determine the transfer order according to the semantic distance between each out-of-domain and in-domain training corpora. Here, we carried out two groups of experiments to investigate their impacts on our framework. In the first group of experiments, we first combined all out-of-domain training corpora into a mixed corpus, and then applied our framework to establish the in-domain NMT model. In the second group of experiments, we employed our framework in different transfer orders to perform domain adaptation.

Table 5 shows the final experimental results, which are in line with our expectations and verify the validity of our designs.

Model Transfer Order Ave.
IDDA-mix —— 30.17
IDDA {, } 30.51
IDDA {, } 30.91
Table 5: Experimental results of IDDA using different configurations.

5 Conclusion

In this paper, we have proposed an iterative dual domain adaptation framework for NMT, which continuously fully exploits the mutual complementarity between in-domain and out-domain corpora for translation knowledge transfer. Experimental results and in-depth analyses on translation tasks of two language pairs strongly demonstrate the effectiveness of our framework.

In the future, we plan to extend our framework to multi-domain NMT. Besides, how to leverage monolingual sentences of different domains to refine our proposed framework. Finally, we will apply our framework into other translation models Bahdanau:ICLR2015; Su:TASLP2018; Song:TACL2019, so as to verify the generality of our framework.

Acknowledgments

The authors were supported by National Natural Science Foundation of China (No. 61672440), Beijing Advanced Innovation Center for Language Resources, NSF Award (No. 1704337), the Fundamental Research Funds for the Central Universities (Grant No. ZK1024), and Scientific Research Project of National Language Committee of China (Grant No. YB135-49). We also thank the reviewers for their insightful comments

References