Meta-Learning for Few-Shot NMT Adaptation

by   Amr Sharaf, et al.
University of Maryland

We present META-MT, a meta-learning approach to adapt Neural Machine Translation (NMT) systems in a few-shot setting. META-MT provides a new approach to make NMT models easily adaptable to many target domains with the minimal amount of in-domain data. We frame the adaptation of NMT systems as a meta-learning problem, where we learn to adapt to new unseen domains based on simulated offline meta-training domain adaptation tasks. We evaluate the proposed meta-learning strategy on ten domains with general large scale NMT systems. We show that META-MT significantly outperforms classical domain adaptation when very few in-domain examples are available. Our experiments shows that META-MT can outperform classical fine-tuning by up to 2.5 BLEU points after seeing only 4, 000 translated words (300 parallel sentences).



There are no comments yet.


page 3

page 7

page 8


Meta-Curriculum Learning for Domain Adaptation in Neural Machine Translation

Meta-learning has been sufficiently validated to be beneficial for low-r...

Few-Shot Domain Adaptation for Grammatical Error Correction via Meta-Learning

Most existing Grammatical Error Correction (GEC) methods based on sequen...

Domain Adaptation in Dialogue Systems using Transfer and Meta-Learning

Current generative-based dialogue systems are data-hungry and fail to ad...

Learning to Transfer: Unsupervised Meta Domain Translation

Unsupervised domain translation has recently achieved impressive perform...

Reinforcement Learning for Few-Shot Text Generation Adaptation

Controlling the generative model to adapt a new domain with limited samp...

Improving both domain robustness and domain adaptability in machine translation

We address two problems of domain adaptation in neural machine translati...

Meta-Learning for Few-Shot Land Cover Classification

The representations of the Earth's surface vary from one geographic regi...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural Machine Translation (NMT) systems (Bahdanau et al., 2016; Sutskever et al., 2014) are usually trained on large general-domain parallel corpora to achieve state-of-the-art results (Barrault et al., 2019). Unfortunately, these generic corpora are often qualitatively different from the target domain of the translation system. Moreover, NMT models trained on one domain tend to perform poorly when translating sentences in a significantly different domain (Koehn and Knowles, 2017; Chu and Wang, 2018). A widely used approach for adapting NMT is domain adaptation by fine-tuning (Luong and Manning, 2015; Freitag and Al-Onaizan, 2016; Sennrich et al., 2016), where a model is first trained on general-domain data and then adapted by continuing the training on a smaller amount of in-domain data. This approach often leads to empirical improvements in the targeted domain; however, it falls short when the amount of in-domain training data is insufficient, leading to model over-fitting and catastrophic forgetting, where adapting to a new domain leads to degradation on the general-domain (Thompson et al., 2019). Ideally, we would like to have a model that is easily adaptable to many target domains with minimal amount of in-domain data.

We present a meta-learning approach that learns to adapt neural machine translation systems to new domains given only a small amount of training data in that domain. To achieve this, we simulate many domain adaptation tasks, on which we use a meta-learning strategy to learn how to adapt. Specifically, based on these simulations, our proposed approach, Meta-MT (Meta-learning for Machine Translation), learns model parameters that should generalize to future (real) adaptation tasks (§ 4.1).

At training time (§ 4.2), Meta-MT simulates many small-data domain adaptation tasks from a large pool of data. Using these tasks, Meta-MT simulates what would happen after fine-tuning the model parameters to each such task. It then uses this information to compute parameter updates that will lead to efficient adaptation during deployment. We optimize this using the Model Agnostic Meta-Learning algorithm (MAML) (Finn et al., 2017).

The contribution of this paper is as follows: first, we propose a new approach that enables NMT systems to effectively adapt to a new domain using few-shots learning. Second, we show what models and conditions enable meta-learning to be useful for NMT adaptation. Finally, We evaluate Meta-MT on ten different domains, showing the efficacy of our approach. To the best of our knowledge, this is the first work on adapting large scale NMT systems in a few-shot learning setup 111Code Release: We make the code publicly available online:

2 Related Work

Our goal for few-shot NMT adaptation is to adapt a pre-trained NMT model (e.g. trained on general domain data) to new domains (e.g. medical domain) with a small amount of training examples. Chu et al. (2018) surveyed several recent approaches that address the shortcomings of traditional fine-tuning when applied to domain adaptation. Our work distinguishes itself from prior work by learning to fine-tune with tiny amounts of training examples.

Most recently, Bapna et al. (2019) proposed a simple approach for adaptation in NMT. The approach consists of injecting task specific adapter layers into a pre-trained model. These adapters enable the model to adapt to new tasks as it introduces a bottleneck in the architecture that makes it easier to adapt. Our approach uses a similar model architecture, however, instead of injecting a new adapter for each task separately, Meta-MT uses a single adapter layer, and meta-learns a better initialization for this layer that can easily be fine-tuned to new domains with very few training examples.

Similar to our goal, Michel and Neubig (2018) proposed a space efficient approach to adaptation that learns domain specific biases to the output vocabulary. This enables large-scale personalization for NMT models when small amounts of data are available for a lot of different domains. However, this approach assumes that these domains are static and known at training time, while Meta-MT can dynamically generalize to totally new domains, previously unseen at meta-training time.

Several approaches have been proposed for lightweight adaptation of NMT systems. Vilar (2018) introduced domain specific gates to control the contribution of hidden units feeding into the next layer. However, Bapna et al. (2019) showed that this introduced a limited amount of per-domain capacity; in addition, the learned gates are not guaranteed to be easily adaptable to unseen domains. Khayrallah et al. (2017) proposed a lattice search algorithm for NMT adaptation, however, this algorithm assumes access to lattices generated from a phrase based machine translation system.

Our meta-learning strategy mirrors that of Gu et al. (2018) in the low resource translation setting, as well as Wu et al. (2019)

for cross-lingual named entity recognition with minimal resources,

Mi et al. (2019)

for low-resource natural language generation in task-oriented dialogue systems, and

Dou et al. (2019) for low-resource natural language understanding tasks. To the best of our knowledge, this is the first work using meta-learning for few-shot NMT adaptation.

3 Background

3.1 Neural Machine Translation

Neural Machine Translation (NMT) is a sequence to sequence model that parametrizes the conditional probability of the source and target sequences as a neural network following encoder-decoder architecture 

(Bahdanau et al., 2016; Sutskever et al., 2014)

. Initially, the encode-decoder architecture was represented by recurrent networks. Currently, this has been replaced by self-attention models aka Transformer models 

(Vaswani et al., 2017)). Currently, Transformer models achieves state-of-the-art performance in NMT as well as many other language modeling tasks. While transformers models are performing quite well on large scale NMT tasks, the models have huge number of parameters and require large amount of training data which is really prohibitive for adaptation tasks especially in few-shot setup like ours.

3.2 Few Shots Domain Adaptation

Traditional domain adaptation for NMT models assumes the availability of relatively large amount of in domain data. For instances most of the related work utilizing traditional fine-tuning experiment with hundred-thousand sentences in-domain. This setup in quite prohibitive, since practically the domain can be defined by few examples. In this work we focus on few-shot adaptation scenario where we can adapt to a new domain not seen during training time using just couple of hundreds of in-domain sentences. This introduces a new challenge where the models have to be quickly responsive to adaptation as well as robust to domain shift. Since we focus on the setting in which very few in-domain data is available, this renders many traditional domain adaptation approaches inappropriate.

3.3 Meta-Learning

Meta-learning or Learn-to-Learn is widely used for few-shot learning in many applications where a model trained for a particular task can learn another task with a few examples. A number of approaches are used in Meta-learning, namely: Model-agnostic Meta-Learning (MAML) and its first order approximations like First-order MAML (FoMAML) (Finn et al., 2017) and Reptile (Nichol et al., 2018). In this work, we focus on using MAML to enable few-shots adaptation of NMT transformer models.

4 Approach: Meta-Learning for Few-Shot NMT Adaptation

Neural Machine Translation systems are not robust to domain shifts (Chu and Wang, 2018). It is a highly desirable characteristic of the system to be adaptive to any domain shift using weak supervision without degrading the performance on the general domain. This dynamic adaptation task can be viewed naturally as a learning-to-learn (meta-learning) problem: how can we train a global model that is capable of using its previous experience in adaptation to learn to adapt faster to unseen domains? A particularly simple and effective strategy for adaptation is fine-tuning: the global model is adapted by training on in-domain data. One would hope to improve on such a strategy by decreasing the amount of required in-domain data. Meta-MT takes into account information from previous adaptation tasks, and aims at learning how to update the global model parameters, so that the resulting learned parameters after meta-learning can be adapted faster and better to previously unseen domains via a weakly supervised fine-tuning approach on a tiny amount of data.

Our goal in this paper is to learn how to adapt a neural machine translation system from experience. The training procedure for Meta-MT uses offline simulated adaptation problems to learn model parameters which can adapt faster to previously unseen domains. In this section, we describe Meta-MT, first by describing how it operates at test time when applied to a new domain adaptation task (§ 4.1), and then by describing how to train it using offline simulated adaptation tasks (§ 4.2).

Figure 1: Example meta-learning set-up for few-shot NMT adaptation. The top represents the meta-training set , where inside each box is a separate dataset that consists of the support set (left side of dashed line) and the query set (right side of dashed line). In this illustration, we are considering the books and TED talks domains for meta-training. The meta-test set is defined in the same way, but with a different set of domains not present in any of the datasets in : Medical and News.
Figure 2: [Top-A] a training step of Meta-MT. [Bottom-B] Differences between meta-learning and Traditional fine-tuning. Wide lines represent high resource domains (Medical, News), while thin lines represent low-resource domains (TED, Books). Traditional fine-tuning may favor high-resource domains over low-resource ones while meta-learning aims at learning a good initialization that can be adapted to any domain with minimal training samples. 333colorblind friendly palette was selected from Neuwirth and Brewer (2014).

4.1 Test Time Behavior of Meta-MT

At test time, Meta-MT adapts a pre-trained NMT model to a new given domain. The adaptation is done using a small in-domain data that we call the support set and then tested on the new domain using a query set. More formally, the model parametrized by takes as input a new adaptation task . This is illustrated in Figure 1: the adaptation task consists of a standard domain adaptation problem: includes a support set used for training the fine-tuned model, and a query set used for evaluation. We’re particularly interested in the distribution of tasks where the support and query sets are very small. In our experiments, we restrict the size of these sets to only few hundred parallel training sentences. We consider support sets of sizes: 4k to 64k source words (i.e. to sentences). At test time, the meta-learned model interacts with the world as follows (Figure 2):

  1. [noitemsep,nolistsep]

  2. Step 1: The world draws an adaptation task from a distribution , ;

  3. Step 2: The model adapts from to by fine-tuning on the task’s support set ;

  4. Step 3: The fine-tuned model is evaluated on the query set .

Intuitively, meta-training should optimize for a representation that can quickly adapt to new tasks, rather than a single individual task.

4.2 Training Meta-MT via Meta-learning

The meta-learning challenge is: how do we learn a good representation ? We initialize by training an NMT model on global-domain data. In addition, we assume access to meta-training tasks on which we can train ; these tasks must include support/query pairs, where we can simulate a domain adaptation setting by fine-tuning on the support set and then evaluating on the query. This is a weak assumption: in practice, we use purely simulated data as this meta-training data. We construct this data as follows: given a parallel corpus for the desired language pair, we randomly sample training example to form a few-shot adaptation task. We build tasks of 4k, 8k, 16k, 32k, and 64k training words. Under this formulation, it’s natural to think of ’s learning process as a process to learn a good parameter initialization for fast adaptation, for which a class of learning algorithms to consider are Model-agnostic Meta-Learning (MAML) and it’s first order approximations like First-order MAML (FoMAML) (Finn et al., 2017) and Reptile (Nichol et al., 2018).

Informally, at training time, Meta-MT will treat one of these simulated domains as if it were a domain adaptation dataset. At each time step, it will update the current model representation from to by fine-tuning on

and then ask: what is the meta-learning loss estimate given

, , and ? The model representation is then updated to minimize this meta-learning loss. More formally, in meta-learning, we assume access to a distribution over different tasks . From this, we can sample a meta-training dataset . The meta-learning problem is then to estimate to minimize the meta-learning loss on .

1:  while not done do
2:     Sample a batch of domain adaptation tasks
3:     for all  do
4:        Evaluate on the support set
5:        Compute adapted parameters with gradient descent:
6:     end for
7:     Update on the query set
8:  end while
Algorithm 1 Meta-MT (trained model , meta-training dataset , learning rates )

The meta-learning algorithm we use is MAML by Finn et al. (2017), and is instantiated for the meta-learning to adapt NMT systems in Alg 1. MAML considers a model represented by a parametrized function with parameters . When adapting to a new task , the model’s parameters become

. The updated vector

is computed using one or more gradient descent updates on the task . For example, when using one gradient update:


where is the learning rate and

is the task loss function. The model parameters are trained by optimizing for the performance of

with respect to across tasks sampled from . More concretely, the meta-learning objective is:


Following the MAML template, Meta-MT operates in an iterative fashion, starting with a trained NMT model and improving it through optimizing the meta-learning loss from Eq missing on the meta-training dataset . Over learning rounds, Meta-MT selects a random batch of training tasks from the meta-training dataset and simulates the test-time behavior on these tasks (Line 2). The core functionality is to observe how the current model representation is adapted for each task in the batch, and to use this information to improve by optimizing the meta-learning loss (Line 7). Meta-MT achieves this by simulating a domain adaptation setting by fine-tuning on the task specific support set (Line 4). This yields, for each task , a new adapted set of parameters (Line 5). These parameters are evaluated on the query sets for each task , and a meta-gradient w.r.t the original model representation is used to improve (Line 7).

Our pre-trained baseline NMT model is a sequence to sequence model that parametrizes the conditional probability of the source and target sequences as an encoder-decoder architecture using self-attention Transformer models (Vaswani et al., 2017)).

5 Experimental Setup and Results

We seek to answer the following questions experimentally:

  1. [noitemsep,nolistsep]

  2. How does Meta-MT compare empirically to alternative adaptation strategies? (§ 6.4)

  3. What is the impact of the support and the query sizes used for meta-learning? (§ 6.5)

  4. What is the effect of the NMT model architecture on performance? (§ 6.6)

6 Statistics of in-domain data sets

Domain # sentences # En Tokens
bible-uedin 62195 1550431
ECB 113174 3061513
KDE4 224035 1746216
Tanzil 537128 9489824
WMT-News 912212 5462820
Books 51467 1054718
EMEA 1108752 12322425
GlobalVoices 66650 1239921
ufal-Med 140600 5527010
TED 51368 1060765
Table 1: Dataset statistics for different domains.

Table 1 lists the sizes of various in-domain datasets from which we sample our in-domain data to simulate the few-shot adaptation setup.

In our experiments, we train Meta-MT only on simulated data, where we simulate a few-shot domain adaptation setting as described in § 4.2. This is possible because Meta-MT learns model parameters that can generalize to future adaptation tasks by optimizing the meta-objective function in Eq missing.

We train and evaluate Meta-MT on a collection of ten different datasets. All of these datasets are collected from the Open Parallel Corpus (OPUS) Tiedemann (23-25), and are publicly available online. The datasets cover a variety of diverse domains that should enable us to evaluate our proposed approach. The datasets we consider are:

  1. [noitemsep,nolistsep]

  2. Bible: a parallel corpus created from translations of the Bible Christodouloupoulos and Steedman (2015).

  3. European Central Bank: website and documentations from the European Central Bank.

  4. KDE: a corpus of KDE4 localization files.

  5. Quran: a collection of Quran translations compiled by the Tanzil project.

  6. WMT news test sets: a parallel corpus of News Test Sets provided by WMT.

  7. Books: a collection of copyright free books.

  8. European Medicines Agency (EMEA): a parallel corpus made out of PDF documents from the European Medicines Agency.

  9. Global Voices: parallel news stories from the Global Voices web site.

  10. Medical (ufal-Med): the UFAL medical domain dataset from Yepes et al. (2017).

  11. TED talks: talk subtitles from  Duh (2018).

We simulate the few-shot NMT adaptation scenarios by randomly sub-sampling these datasets with different sizes. We sample different data sets with sizes ranging from 4k to 64k training words (i.e. to sentences). This data is the only data used for any given domain across all adaptation setups. It is worth noting that different datasets have a wide range of sentence lengths. We opted to sample using number of words instead of number of sentences to avoid introducing any advantages for domains with longer sentences.

6.1 Domain Adaptation Approaches

Our experiments aim to determine how Meta-MT compares to standard domain adaptation strategies. In particular, we compare to:

  1. [label=(),noitemsep,nolistsep]

  2. No fine-tuning: The non-adaptive baseline. Here, the pre-trained model is evaluated on the meta-test and meta-validation datasets (see Figure 1) without any kind of adaptation.

  3. Fine-tuning on a single task: The domain adaptation by fine-tuning baseline. For a single adaptation task , this approach performs domain adaptation by fine-tuning only on the support set . For instance, if words, we fine tune the pre-trained model only on training words to show how classical fine-tuning behaves in few-shot settings.

  4. Fine-tuning on meta-train: Similar to 2, however, this approach fine-tunes on much more data. This approach fine-tunes on all the support sets used for meta-training: . The goal of this baseline is to ensure that Meta-MT doesn’t get an additional advantage by training on more data during the meta-training phase. For instance, if we are using adaptation tasks each with a support set of size , this will be using words for classical fine-tuning. This establishes a fair baseline to evaluate how classical fine-tuning would perform using the same data albeit in a different configuration.

  5. Meta-MT: Our proposed approach from Alg 1. In this setup, we use adaptation tasks in , each with a support set of size words to perform Meta-Learning. Second order meta-gradients are ignored to decrease the computational complexity.

6.2 Model Architecture and Implementation Details

We use the Transformer Model (Vaswani et al., 2017) implemented in fairseq (Ott et al., 2019). In this work, we use a transformer model with a modified architecture that can facilitate better adaptation. We use “Adapter Modules” (Houlsby et al., 2019; Bapna et al., 2019) which introduce an extra layer after each transformer block that can enable more efficient tuning of the models. Following Bapna et al. (2019)

, we augment the Transformer model with feed-forward adapters: simple single hidden-layer feed-forward networks, with a nonlinear activation function between the two projection layers. These adapter modules are introduced after the Layer Norm and before the residual connection layers. It is composed of a down projection layer, followed by a ReLU, followed by an up projection layer. This bottle-necked module with fewer parameters is very attractive for domain adaptation as we will discuss in 

§ 6.6. These modules are introduced after every layer in both the encoder and the decoder. All experiments are based on the “base” transformer model with six blocks in the encoder and decoder networks. Each encoder block contains a self-attention layer, followed by two fully connected feed-forward layers with a ReLU non-linearity between them. Each decoder block contains self-attention, followed by encoder-decoder attention, followed by two fully connected feed-forward layers with a ReLU non-linearity between them.

We use word representations of size , feed-forward layers with inner dimensions , multi-head attention with attention heads, and adapter modules with hidden units. We apply dropout (Srivastava et al., 2014) with probability . The model is optimized with Adam (Kingma and Ba, 2014) using , and a learning rate . We use the same learning rate schedule as Vaswani et al. (2017) where the learning rate increases linearly for steps to , after which it is decayed proportionally to the inverse square root of the number of steps. For meta-learning, we used a meta-batch size of . We optimized the meta-learning loss function using Adam with a learning rate of and default parameters for .

All data is pre-processed with joint sentence-pieces (Kudo and Richardson, 2018) of size 40k. In all cases, the baseline machine translation system is a neural English to German (En-De) transformer model (Vaswani et al., 2017), initially trained on 5.2M sentences filtered from the the standard parallel data (Europarl-v9, CommonCrawl, NewsCommentary-v14, wikititles-v1 and Rapid-2019) from the WMT-19 shared task (Barrault et al., 2019). We use WMT14 and WMT19 newtests as validation and test sets respectively. The baseline system scores 37.99 BLEU on the full WMT19 newstest which compares favorably with strong single system baselines at WMT19 shared task (Ng et al., 2019; Junczys-Dowmunt, 2019).

For meta-learning, we use the MAML algorithm as described in Alg 1. To minimize memory consumption, we ignored the second order gradient terms from Eq missing. We implement the First-Order MAML approximation (FoMAML) as described in Finn et al. (2017). We also experimented with the first-order meta-learning algorithm Reptile Nichol et al. (2018). We found that since Reptile doesn’t directly account for the performance on the task query set, along with the large model capacity of the Transformer architecture, it can easily over-fit to the support set, thus achieving almost perfect performance on the support, while the performance on the query degrades significantly. Even after performing early stopping on the query set, Reptile didn’t account correctly for learning rate scheduling, and finding suitable learning rates for optimizing the meta-learner and the task adaptation was difficult. In our experiments, we found it essential to match the behavior of the dropout layers when computing the meta-objective function in Eq missing with the test-time behavior described in § 4.1. In particular, the model has to run in “evaluation mode” when computing the loss on the task query set to match the test-time behavior during evaluation.

Domain A. No fine-tuning B. Fine-tuning on task C. Fine-tuning on meta-train D. Meta-MT
Table 2:

BLEU scores on meta-test split for different approaches evaluated across ten domains. Best results are highlighted in bold, results with-in two standard-deviations of the best value are underlined.

Figure 3: BLEU scores on meta-test split for different approaches evaluated across ten domains.

6.3 Evaluation Tasks and Metrics

Our experimental setup operates as follows: using a collection of simulated machine translation adaptation tasks, we train an NMT model using Meta-MT (Alg 1). This model learns to adapt faster to new domains, by fine-tuning on a tiny support set. Once is learned and fixed, we follow the test-time behavior described in § 4.1. We evaluate Meta-MT on the collection of ten different domains described in § 5. We simulate domain adaptation problems by sub-sampling tasks with 4k English tokens for the support set, and 32k tokens for the query set. We study the effect of varying the size of the query and the support sets in § 6.5. We use tasks for the meta-training dataset , where we sample tasks from each of the ten different domains. We use a meta-validation and meta-test sets of size , where we sample a single task from each domain. We report the mean and standard-deviation over three different meta-test sets. For evaluation, we use BLEU (Papineni et al., 2002). We measure case-sensitive de-tokenized BLEU with SacreBLEU (Post, 2018). All results use beam search with a beam of size five.

6.4 Experimental Results

Here, we describe our experimental results comparing the several algorithms from § 6.1. The overall results are shown in Table 2 and Figure 3. Table 2 shows the BLEU scores on the meta-test dataset for all the different approaches across the ten domains. From these results we draw the following conclusions:

  1. [noitemsep,nolistsep]

  2. The pre-trained En-De NMT model performs well on general domains. For instance, BLEU for WMT-News 444This is subset of the full test set to match the sizes of query sets from other domains, GlobalVoices, and ECB is at least points. However, performance degrades on closed domains like Books, Quran, and Bible. [Column A].

  3. Domain adaptation by fine-tuning on a single task doesn’t improve the BLEU score. This is expected, since we’re only fine-tuning on 4k tokens (i.e. sentences) [A vs B].

  4. Significant leverage is gained by increasing the amount of fine-tuning data. Fine-tuning on all the available data used for meta-learning improves the BLEU score significantly across all domains. [B vs C]. To put this into perspective, this setup is tuned on all data aggregated from all tasks: words which is approximately sentences.

  5. Meta-MT outperforms alternative domain adaptation approaches on all domains with negligible degradation on the baseline domain. Meta-MT is better than the non-adaptive baseline [A vs D], and succeeds in learning to adapt faster given the same amount of fine-tuning data [B vs D, C vs D]. Both Fine-tuning on meta-train [C] and Meta-MT [D] have access to exactly the same amount of training data, and both use the same model architecture. The difference however is in the learning algorithm. Meta-MT uses MAML (Alg 1) to optimize the meta-objective function in Eq missing. This ensures that the learned model initialization can easily be fine-tuned to new domains with very few examples.

Figure 4: Meta-MT and fine-tuning adaptation performance on the meta-test set vs different support set sizes per adaptation task.
Figure 5: Meta-MT and fine-tuning adaptation performance on the meta-test set vs different query set sizes per adaptation task.

6.5 Impact of Adaptation Task Size

To evaluate the effectiveness of Meta-MT when adapting with small in-domain corpora, we further compare the performance of Meta-MT with classical fine-tuning on varying amounts of training data per adaptation task. In Figure 4 we plot the overall adaptation performance on the ten domains when using different data sizes for the support set. In this experiment, the only parameter that varies is the size of the task support set . We fix the size of the query set per task to tokens, and we vary the size of the support set from to . To ensure that the total amount of meta-training data is the same, we use tasks for meta-training when the support size is , tasks when the support size is , tasks for support size of , tasks when the support size is , and finally meta-training tasks when the support size is . This controlled setup ensures that no setting has any advantage by getting access to additional amounts of training data. We notice that for reasonably small size of the support set ( and ), Meta-MT outperforms the classical fine-tuning baseline. However, when the data size increase ( to ), Meta-MT is outperformed by the fine-tuning baseline. This happens because for a larger support size, e.g. , we only have access to meta-training tasks in , this is not enough to generalize to new unseen adaptation tasks, and Meta-MT over-fits to the training tasks from , however, the performance degrades and doesn’t generalize to .

To understand more directly the impact of the query set on Meta-MT’s performance, in Figure 5 we show Meta-MT and fine-tuning adaptation performance on the meta-test set on varying sizes for the query set. We fix the support size to and vary the query set size from to . We observe that the edge of improvement of Meta-MT over fine-tuning adaptation increases as we increase the size of the query set. For instance, when we use a query set of size , Meta-MT outperforms fine-tuning by BLEU points, while the improvement is only points when the query set is .

6.6 Impact of Model Architecture

In our experiments, we used the Adapter Transformer architecture Bapna et al. (2019). This architecture fixes the parameters of the pre-trained Transformer model, and only adapts the feed-forward adapter module. Our model included parameters, out of which we adapt only (only ). We found this adaptation strategy to be more robust to meta-learning. To better understand this, Figure 6 shows the BLEU scores for the two different model architectures. We find that while the meta-learned Transformer architecture (Right) slightly outperforms the Adapter model (Left), it suffers from catastrophic forgetting: Meta-MT-0 shows the zero-shot BLEU score before fine-tuning the task on the support set. For the Transformer model, the score drops to zero and then quickly improves once the parameters are tuned on the support set. This is undesirable, since it hurts the performance of the pre-trained model, even on the general domain data. We notice that the Adapter model doesn’t suffer from this problem.

Figure 6: BLEU scores reported for two different model architectures: Adapter Transformer Bapna et al. (2019) (Left), and the Transformer base architecture Vaswani et al. (2012) (Right).

7 Conclusion

We presented Meta-MT, a meta-learning approach for few shot NMT adaptation. We formulated few shot NMT adaptation as a meta-learning problem, and presented a strategy that learns better parameters for NMT systems that can be easily adapted to new domains. We validated the superiority of Meta-MT to alternative domain adaptation approaches. Meta-MT outperforms alternative strategies in most domains using only a small fraction of fine-tuning data.


The authors would like to thank members of the Microsoft Machine Translation Team as well as members of the Computational Linguistics and Information Processing (CLIP) lab for reviewing earlier versions of this work. Part of this work was conducted when the first author was on a summer internship with Microsoft Research. This material is based upon work supported by the National Science Foundation under Grant No. 1618193. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.


  • D. Bahdanau, K. Cho, and Y. Bengio (2016) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. External Links: 1409.0473v7, Link Cited by: §1, §3.1.
  • A. Bapna, N. Arivazhagan, and O. Firat (2019) Simple, scalable adaptation for neural machine translation. arXiv preprint arXiv:1909.08478. Cited by: §2, §2, Figure 6, §6.2, §6.6.
  • L. Barrault, O. Bojar, M. R. Costa-jussà, C. Federmann, M. Fishel, Y. Graham, B. Haddow, M. Huck, P. Koehn, S. Malmasi, et al. (2019) Findings of the 2019 conference on machine translation (wmt19). In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pp. 1–61. Cited by: §1, §6.2.
  • C. Christodouloupoulos and M. Steedman (2015) A massively parallel corpus: the bible in 100 languages. Language resources and evaluation 49 (2), pp. 375–395. Cited by: item 1.
  • C. Chu, R. Dabre, and S. Kurohashi (2018) A comprehensive empirical comparison of domain adaptation methods for neural machine translation. Journal of Information Processing 26, pp. 529–538. Cited by: §2.
  • C. Chu and R. Wang (2018) A survey of domain adaptation for neural machine translation. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA, pp. 1304–1319. External Links: Link Cited by: §1, §4.
  • Z. Dou, K. Yu, and A. Anastasopoulos (2019) Investigating meta-learning algorithms for low-resource natural language understanding tasks. In

    Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

    Hong Kong, China, pp. 1192–1197. External Links: Link, Document Cited by: §2.
  • K. Duh (2018) The multitarget ted talks task. Note: Cited by: item 10.
  • C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In

    Proceedings of the 34th International Conference on Machine Learning

    , D. Precup and Y. W. Teh (Eds.),
    Proceedings of Machine Learning Research, Vol. 70, International Convention Centre, Sydney, Australia, pp. 1126–1135. External Links: Link Cited by: §1, §3.3, §4.2, §4.2, §6.2.
  • M. Freitag and Y. Al-Onaizan (2016) Fast domain adaptation for neural machine translation. ArXiv abs/1612.06897. Cited by: §1.
  • J. Gu, Y. Wang, Y. Chen, V. O. K. Li, and K. Cho (2018) Meta-learning for low-resource neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 3622–3631. External Links: Link, Document Cited by: §2.
  • N. Houlsby, A. Giurgiu, S. Jastrzkebski, B. Morrone, Q. de Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly (2019)

    Parameter-efficient transfer learning for NLP

    CoRR abs/1902.00751. External Links: Link, 1902.00751 Cited by: §6.2.
  • M. Junczys-Dowmunt (2019) Microsoft translator at wmt 2019: towards large-scale document-level neural machine translation. In WMT, Cited by: §6.2.
  • H. Khayrallah, G. Kumar, K. Duh, M. Post, and P. Koehn (2017) Neural lattice search for domain adaptation in machine translation. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Taipei, Taiwan, pp. 20–25. External Links: Link Cited by: §2.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §6.2.
  • P. Koehn and R. Knowles (2017) Six challenges for neural machine translation. In Proceedings of the First Workshop on Neural Machine Translation, Vancouver, pp. 28–39. External Links: Link, Document Cited by: §1.
  • T. Kudo and J. Richardson (2018) SentencePiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Brussels, Belgium, pp. 66–71. External Links: Link, Document Cited by: §6.2.
  • M. Luong and C. D. Manning (2015) Stanford neural machine translation systems for spoken language domain. In International Workshop on Spoken Language Translation, Cited by: §1.
  • F. Mi, M. Huang, J. Zhang, and B. Faltings (2019) Meta-learning for low-resource natural language generation in task-oriented dialogue systems. In

    Proceedings of the 28th International Joint Conference on Artificial Intelligence

    IJCAI’19, pp. 3151–3157. External Links: ISBN 978-0-9992411-4-1, Link Cited by: §2.
  • P. Michel and G. Neubig (2018) Extreme adaptation for personalized neural machine translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Melbourne, Australia, pp. 312–318. External Links: Link, Document Cited by: §2.
  • E. Neuwirth and R. C. Brewer (2014) ColorBrewer palettes. R package version, pp. 1–1. Cited by: footnote 3.
  • N. Ng, K. Yee, A. Baevski, M. Ott, M. Auli, and S. Edunov (2019) Facebook FAIR’s WMT19 news translation task submission. In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), Florence, Italy, pp. 314–319. External Links: Link, Document Cited by: §6.2.
  • A. Nichol, J. Achiam, and J. Schulman (2018) On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999. Cited by: §3.3, §4.2, §6.2.
  • M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli (2019) Fairseq: a fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations, Cited by: §6.2.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: §6.3.
  • M. Post (2018) A call for clarity in reporting bleu scores. arXiv preprint arXiv:1804.08771. Cited by: §6.3.
  • R. Sennrich, B. Haddow, and A. Birch (2016) Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 86–96. External Links: Link, Document Cited by: §1.
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §6.2.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 3104–3112. External Links: Link Cited by: §1, §3.1.
  • B. Thompson, J. Gwinnup, H. Khayrallah, K. Duh, and P. Koehn (2019) Overcoming catastrophic forgetting during domain adaptation of neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 2062–2068. External Links: Link, Document Cited by: §1.
  • J. Tiedemann (23-25) Parallel data, tools and interfaces in opus. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), N. C. (. Chair), K. Choukri, T. Declerck, M. U. Dogan, B. Maegaard, J. Mariani, J. Odijk, and S. Piperidis (Eds.), Istanbul, Turkey (english). External Links: ISBN 978-2-9517408-7-7 Cited by: §6.
  • A. Vaswani, L. Huang, and D. Chiang (2012) Smaller alignment models for better translations: unsupervised word alignment with the l0-norm. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jeju Island, Korea, pp. 311–319. External Links: Link Cited by: Figure 6.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §3.1, §4.2, §6.2, §6.2, §6.2.
  • D. Vilar (2018) Learning hidden unit contribution for adapting neural machine translation models. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, Louisiana, pp. 500–505. External Links: Link, Document Cited by: §2.
  • Q. Wu, Z. Lin, G. Wang, H. Chen, B. F. Karlsson, B. Huang, and C. Lin (2019) Enhanced meta-learning for cross-lingual named entity recognition with minimal resources. arXiv preprint arXiv:1911.06161. Cited by: §2.
  • A. J. Yepes, A. Névéol, M. Neves, K. Verspoor, O. Bojar, A. Boyer, C. Grozea, B. Haddow, M. Kittner, Y. Lichtblau, et al. (2017) Findings of the wmt 2017 biomedical translation shared task. In Proceedings of the Second Conference on Machine Translation, pp. 234–247. Cited by: item 9.