Soft Layer Selection with Meta-Learning for Zero-Shot Cross-Lingual Transfer

07/21/2021 ∙ by Weijia Xu, et al. ∙ University of Maryland Amazon 0

Multilingual pre-trained contextual embedding models (Devlin et al., 2019) have achieved impressive performance on zero-shot cross-lingual transfer tasks. Finding the most effective fine-tuning strategy to fine-tune these models on high-resource languages so that it transfers well to the zero-shot languages is a non-trivial task. In this paper, we propose a novel meta-optimizer to soft-select which layers of the pre-trained model to freeze during fine-tuning. We train the meta-optimizer by simulating the zero-shot transfer scenario. Results on cross-lingual natural language inference show that our approach improves over the simple fine-tuning baseline and X-MAML (Nooralahzadeh et al., 2020).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Despite the impressive performance of neural models on a wide variety of NLP tasks, these models are extremely data hungry – training them requires a large amount of annotated data. As collecting such amounts of data for every language of interest is extremely expensive, cross-lingual transfer that aims to transfer the task knowledge from high-resource (source) languages for which annotated data are more readily available to low-resource (target) languages becomes a promising direction. Cross-lingual transfer approaches using cross-lingual resources such as machine translation (MT) systems (Wan, 2009; Conneau et al., 2018) or bilingual dictionaries (Prettenhofer and Stein, 2010) have effectively reduced the amount of annotated data required to obtain reasonable performance on the target language. However, such cross-lingual resources are often limited for low-resource languages.

Recent advances in cross-lingual contextual embedding models have reduced the need for cross-lingual supervision (Devlin et al., 2019; Lample and Conneau, 2019). Wu and Dredze (2019) show that multilingual BERT (mBERT) (Devlin et al., 2019), a contextual embedding model pre-trained on the concatenated Wikipedia data from 104 languages without cross-lingual alignment, does surprisingly well on zero-shot cross-lingual transfer tasks, where they fine-tune the model on the annotated data from the source languages and evaluate on the target language. Wu and Dredze (2019) propose to freeze the bottom layers of mBERT during fine-tuning to improve the cross-lingual performance over the simple fine-tune-all-parameters strategy, as different layers of mBERT captures different linguistic information (Jawahar et al., 2019).

Selecting which layers to freeze for a downstream task is a non-trivial problem. In this paper, we propose a novel meta-learning algorithm for soft layer selection. Our meta-learning algorithm learns layer-wise update rate by simulating the zero-shot transfer scenario – at each round, we randomly split the source languages into a held-out language and the rest as training languages, fine-tune the model on the training languages, and update the meta-parameters based on the model performance on the held-out language. We build the meta-optimizer on top of a standard optimizer and learnable update rates, so that it generalizes well to large numbers of updates. Our method uses much less meta-parameters than the X-MAML approach (Nooralahzadeh et al., 2020) adapted from model-agnostic meta-learning (MAML) (Finn et al., 2017) to zero-shot cross-lingual transfer.

Experiments on zero-shot cross-lingual natural language inference show that our approach outperforms both the simple fine-tuning baseline and the X-MAML algorithm and that our approach brings larger gains when transferring from multiple source languages. Ablation study shows that both the layer-wise update rate and cross-lingual meta-training are key to the success of our approach.

2 Meta-Learning for Zero-Shot Cross-lingual Transfer

The idea of transfer learning is to improve the performance on the target task 

by learning from a set of related source tasks . In the context of cross-lingual transfer, we treat different languages as separate tasks, and our goal is to transfer the task knowledge from the source languages to the target language. In contrast to the transfer learning case where the inputs of the source and target tasks are from the same language, in cross-lingual transfer learning we need to handle inputs from different languages with different vocabularies and syntactic structures. To handle the issue, we use the pre-trained multilingual BERT (Devlin et al., 2019), a language model encoder trained on the concatenation of monolingual corpora from 104 languages.

The most widely used approach to zero-shot cross-lingual transfer using multilingual BERT is to fine-tune the BERT model  on the source language tasks  with training objective 

and then evaluate the fine-tuned model  on the target language task . The gap between training and testing can lead to sub-optimal performance on the target language.

To address the issue, we propose to train a meta-optimizer  for fine-tuning so that the fine-tuned model generalizes better to unseen languages. We train the meta-optimizer by

where  is a “surprise” language randomly selected from the source language tasks .

Input: Training data  in the source languages, learner model  with parameters , and meta-optimizer with base optimizer  and meta-parameters .
Output: Meta-optimizer with parameters .
1
2 Randomly initialize .
3 repeat  times
4      
5       Initialize  with mBERT and random values for the classification layer.
6       Randomly select a test language  to form the test data .
7      
8       repeat  times
9             random batch from 
10            
11            
12            
13            
14            
15            
16       end
17      
18      
19      
20      
21      
22 end
Algorithm 1 Meta-Training
Figure 1: Computational graph for the forward pass of the meta-optimizer. Each batch  is from the training data , and  denotes the entire test set. The meta-learner is comprised of a base optimizer that takes the history and current step gradients as inputs and suggests an update , and the meta parameters that control the layer-wise update rates  for the learner model . The dashed arrows indicate that we do not back-propagate the gradients through that step when updating the meta-parameters.

2.1 Meta-Optimizer

Our meta-optimizer consists of a standard optimizer as the base optimizer and a set of meta-parameters to control the layer-wise update rates. An update step is formulated as:

(1)

where  represent the parameters of the learner model at time step , and 

is the update vector produced by the base optimizer 

given the gradients  at the current and previous steps. The function  is defined by the optimization algorithm and its hyper-parameters. For example, a typical gradient descent algorithm uses  where  represents the learning rate. A standard optimization algorithm will update the model parameters by:

(2)

Our meta-optimizer is different in that we perform gated update using parametric update rates , which is computed by , where  represents the meta-parameters of the meta-optimizer 

. The sigmoid function ensures that the update rates are within the range 

. Different from Andrychowicz et al. (2016) in which the optimizer parameters are shared across all coordinates of the model, our meta-optimizer learns different update rates for different model layers. This is based on the findings that different layers of the BERT encoder capture different linguistic information, with syntactic features in middle layers and semantic information in higher layers (Jawahar et al., 2019). And thus, different layers may generalize differently across languages.

Figure 1 illustrates the computational graph for the forward pass when training the meta-optimizer. Note that as the losses  and gradients  are dependent on the parameters of the meta-optimizer, computing the gradients along the dashed edges would normally require taking second derivatives, which is computationally expensive. Following Andrychowicz et al. (2016), we drop the gradients along the dashed edges and only compute gradients along the solid edges.

2.2 Meta-Training

A good meta-optimizer will, given the training data in the source languages and the training objective, suggest an update rule for the learner model so that it performs well on the target language. Thus, we would like the training condition to match that of the test time. However, in zero-shot transfer we assume no access to the target language data, so we need to simulate the test scenario using only the training data on the source languages.

As shown in Algorithm 1, at each episode in the outer loop, we randomly choose a test language  to construct the test data  and use the remaining data as the training data . Then, we re-initialize the parameters of the learner model and start the training simulation. At each training step, we first use the base optimizer  to compute the update vector  based on the current and history gradients . We then perform the gated update using the meta-optimizer  with Eq. 1. The resulting model  can be viewed as the output of a forward pass of the meta-optimizer. After every  iterations of model update, we compute the gradient of the loss on the test data  with respect to the old meta parameters  and make an update to the meta parameters. Our meta-learning algorithm is different from X-MAML (Nooralahzadeh et al., 2020) in that 1) X-MAML is designed mainly for few-shot transfer while our algorithm is designated for zero-shot transfer, and 2) our algorithm uses much less meta-parameters than X-MAML as it only requires training the update rate for each layer while in X-MAML we meta-learn the initial parameters of the entire model.

3 Experiments

fr es de ar ur bg sw th tr vi zh ru el hi avg
Devlin et al. (2019) 74.30 70.50 62.10 58.35 63.80
Wu and Dredze (2019) 74.60 74.90 72.00 66.10 58.60 69.80 49.40 55.70 62.00 71.90 70.40 69.80 67.90 61.20 66.02
Nooralahzadeh et al. (2020) 74.42 75.07 71.83 66.05 61.51 69.45 49.76 55.39 61.20 71.82 71.11 70.19 67.95 62.20 66.28
Aux. language el el el el el el el el el el ur ur ur ur
Fine-tuning baseline 75.42 75.77 72.57 67.22 61.08 70.23 51.70 51.03 64.26 71.61 72.52 69.97 69.16 55.40 66.28
Meta-Optimizer 75.78 75.87 73.15 67.34 62.00 70.47 51.22 50.54 63.96 72.06 72.32 70.20 69.34 55.88 66.44
Aux. language: el + ur
Fine-tuning baseline 74.87 75.78 72.27 66.96 62.73 70.16 50.21 48.20 63.86 71.61 71.97 70.24 69.64 56.04 66.04
Meta-Optimizer 75.53 75.93 72.68 67.04 63.33 70.88 51.51 49.89 64.33 72.06 72.36 70.32 70.38 56.29 66.61
Table 1: Accuracy of our approach compared with baselines on the XNLI dataset (averaged over five runs). We compare our approach (Meta-Optimizer) with our fine-tuning baseline with one or two auxiliary languages, the fine-tuning results in Devlin et al. (2019), the highest scores (with a selected subset of layers fixed during fine-tuning) in Wu and Dredze (2019), the best zero-shot results using X-MAML (Nooralahzadeh et al., 2020) with one auxiliary language. We boldface the highest scores within each auxiliary language setting.

We evaluate our meta-learning approach on natural language inference. Natural Language Inference (NLI) can be cast into a sequence pair classification problem where, given a premise and a hypothesis sentence, the model needs to predict whether the premise entails the hypothesis, contradicts it, or neither (neutral). We use the Multi-Genre Natural Language Inference Corpus (Williams et al., 2018), which consists of 433k English sentence pairs labeled with textual entailment information, and the XNLI dataset (Conneau et al., 2018), which has 2.5k development and 5k test sentence pairs in 15 languages including English (en), French (fr), Spanish (es), German (de), Greek (el), Bulgarian (bg), Russian (ru), Turkish (tr), Arabic (ar), Vietnamese (vi), Thai (th), Chinese (zh), Hindi (hi), Swahili (sw), and Urdu (ur). We use this dataset to evaluate the effectiveness of our meta-learning algorithm when transferring from English and one or more low-resource auxiliary languages to the target language.

fr es de ar ur bg sw th tr vi zh ru el hi avg
Meta-Optim 75.53 75.93 72.68 67.04 63.33 70.88 51.51 49.89 64.33 72.06 72.36 70.32 70.38 56.29 66.61
No layer-wise update 73.45 73.90 70.73 65.19 60.31 69.10 50.87 46.47 62.74 70.42 70.24 68.85 68.17 53.50 64.57
No cross-lingual meta-train 73.66 74.84 71.54 66.15 61.16 69.33 50.89 48.43 63.16 71.57 70.53 69.14 67.93 55.07 65.24
Table 2: Ablation results on the XNLI dataset using Greek and Urdu as the auxiliary languages (averaged over five runs). Results show that ablating the layer-wise update rate or cross-lingual meta-training degrades accuracy on all target languages.

3.1 Model and Training Configurations

Our model is based on the multilingual BERT (mBERT) (Devlin et al., 2019) implemented in GluonNLP (Guo et al., 2020). As in previous work (Devlin et al., 2019; Wu and Dredze, 2019)

, we tokenize the input sentences using WordPiece, concatenate them, feed the sequence to BERT, and use the hidden representation of the first token (

) for classification. The final output is computed by applying a linear projection and a softmax layer to the hidden representation. We use a dropout rate of 

on the final encoder layer and fix the embedding layer during fine-tuning. Following Nooralahzadeh et al. (2020), we fine-tune mBERT by 1) fine-tune mBERT on the English data for one epoch to get initial model parameters, and 2) continue fine-tuning the model on the other source languages for two epochs. We compare using the standard optimizer (fine-tuning baseline) and our meta-optimizer for Step 2. We use Adam optimizer (Kingma and Ba, 2015) with a learning rate of , and  as the standard optimizer and base optimizer in our meta-optimizer. To train our meta-optimizer, we use Adam with a learning rate of  for epochs with  training batches per iteration (Algorithm 1). Different from Nooralahzadeh et al. (2020) who select the auxiliary languages for each target language that lead to the best transfer results, we simulate a more realistic scenario where only a limited set of auxiliary languages is available. We choose two distant auxiliary languages – Greek (Hellenic branch of the Indo-European language family) and Urdu (Indo-Aryan branch of the Indo-European language family) – and evaluate the transfer performance on the other languages.

3.2 Main Results

As shown in Table 1, we compare our meta-learning approach with the fine-tuning baseline and the zero-shot transfer results reported in prior work that uses mBERT. Our approach outperforms the fine-tuning methods in Devlin et al. (2019) by 1.6–8.5%. Compared with the best fine-tuning method in Wu and Dredze (2019) which freezes a selected subset of mBERT layers during fine-tuning, our approach achieves +0.4% higher accuracy on average. We compare our approach with a strong fine-tuning baseline which achieves competitive accuracy scores to the best X-MAML results (Nooralahzadeh et al., 2020) using a single auxiliary language, even though we limit our choice of the auxiliary language to Greek and Urdu, while Nooralahzadeh et al. (2020) select the best auxiliary language among all languages except for the target one. Overall, our approach outperforms the strong fine-tuning baseline on 10 out of 14 languages and by +0.2% accuracy on average.

Our approach brings larger gains when using two auxiliary languages – it outperforms the fine-tuning baseline on all languages and improves the average accuracy by +0.6%. This suggests that our meta-learning approach is more effective when transferring from multiple source languages.111Using two auxiliary languages improves over one auxiliary language the most on lower-resource languages in mBERT pre-training (such as Turkish and Hindi), but does not bring gains or even hurts on high-resource languages (such as French and German). This is consistent with the findings in prior work that the choice of the auxiliary languages is crucial in cross-lingual transfer (Lin et al., 2019). We leave further investigation on its impact on our meta-learning approach for future work.

3.3 Ablation Study

Our approach is different from Andrychowicz et al. (2016) in that 1) it adopts layer-wise update rates while the meta-parameters are shared across all model parameters in Andrychowicz et al. (2016), and 2) it trains the meta-parameters in a cross-lingual setting while Andrychowicz et al. (2016) is designated to few-shot learning. We conduct ablation experiments on XNLI using Greek and Urdu as the auxiliary languages to understand how they contribute to the model performance.

Impact of Layer-Wise Update Rate

We compare our approach with its variant that replaces the layer-wise update rate with one update rate for all layers. Table 2 shows that our approach significantly outperforms this variant on all target languages with an average margin of 2.0%. This suggests that layer-wise update rate contributes greatly to the effectiveness of our approach.

Impact of Cross-Lingual Meta-Training

We measure the impact of cross-lingual meta-training by replacing the cross-lingual meta-training in our approach with a joint training of the layer-wise update rate and model parameters. As shown in Table 2, ablating the cross-lingual meta-training degrades accuracy significantly on all target languages by 1.4% on average, which shows that our cross-lingual meta-training strategy is beneficial.

4 Related Work

4.1 Cross-lingual Transfer Learning

The idea of cross-lingual transfer is to use the annotated data in the source languages to improve the task performance on the target language with minimal or even zero target labeled data (aka zero-shot). There is a large body of work on using external cross-lingual resources such as bilingual word dictionaries (Prettenhofer and Stein, 2010; Schuster et al., 2019b; Liu et al., 2020), MT systems (Wan, 2009), or parallel corpora (Eriguchi et al., 2018; Yu et al., 2018; Singla et al., 2018; Conneau et al., 2018) to bridge the gap between the source and target languages. Recent advances in unsupervised cross-lingual representations have paved the road for transfer learning without cross-lingual resources (Yang et al., 2017; Chen et al., 2018; Schuster et al., 2019a). Our work builds on Mulcaire et al. (2019); Lample and Conneau (2019); Pires et al. (2019)

who show that language models trained on monolingual text from multiple languages provide powerful multilingual representations that generalize across languages. Recent work has shown that more advanced techniques such as freezing the model’s bottom layers 

(Wu and Dredze, 2019) or continual learning (Liu et al., 2020) can further boost the cross-lingual performance on downstream tasks. In this paper, we explore meta-learning to softly select the layers to freeze during fine-tuning.

4.2 Meta Learning

A typical meta-learning algorithm consists of two loops of training: 1) an inner loop where the learner model is trained, and 2) an outer loop where, given a meta-objective, we optimize a set of meta-parameters which controls aspects of the learning process in the inner loop. The goal is to find the optimal meta-parameters such that the inner loop performs well on the meta-objective. Existing meta-learning approaches differ in the choice of meta-parameters to be optimized and the meta-objective. Depending on the choice of meta-parameters, existing work can be divided into four categories: (a) neural architecture search (Stanley and Miikkulainen, 2002; Zoph and Le, 2016; Baker et al., 2016; Real et al., 2017; Zoph et al., 2018); (b) metric-based (Koch et al., 2015; Vinyals et al., 2016); (c) model-agnostic (MAML) (Finn et al., 2017; Ravi and Larochelle, 2016); (d) model-based (learning update rules) (Schmidhuber, 1987; Hochreiter et al., 2001; Maclaurin et al., 2015; Li and Malik, 2017).

In this paper, we focus on model-based meta-learning for zero-shot cross-lingual transfer. Early work introduces a type of networks that can update their own weights (Schmidhuber, 1987, 1992, 1993). More recently, Andrychowicz et al. (2016) propose to model gradient-based update rules using an RNN and optimize it with gradient descent. However, as Wichrowska et al. (2017) point out, the RNN-based meta-optimizers fail to make progress when run for large numbers of steps. They address the issue by incorporating features motivated by the standard optimizers into the meta-optimizer. We instead base our meta-optimizer on a standard optmizer like Adam so that it generalizes better to large-scale training.

Meta-learning has been previously applied to few-shot cross-lingual named entity recognition 

(Wu et al., 2019), low-resource machine translation (Gu et al., 2018), and improving cross-domain generalization for semantic parsing (Wang et al., 2021). For zero-shot cross-lingual transfer, Nooralahzadeh et al. (2020) introduce an optimization-based meta-learning algorithm called X-MAML which meta-learns the initial model parameters on supervised data from low-resource languages. By contrast, our meta-learning algorithm requires much less meta-parameters and is thus simpler than X-MAML. Bansal et al. (2020) show that MAML combined with meta-learning for learning rates improves few-shot learning. Different from their approach which learns layer-wise learning rates only for task-specific layers specified as a hyper-parameter as part of the MAML algorithm, our approach learns layer-wise learning rates for all layers, and we show the effectiveness of our approach without being used with MAML on zero-shot cross-lingual transfer.

5 Conclusion

We propose a novel meta-optimizer that learns to soft-select which layers to freeze when fine-tuning a pretrained language model (mBERT) for zero-shot cross-lingual transfer. Our meta-optimizer learns the update rate for each layer by simulating the zero-shot transfer scenario where the model fine-tuned on the source languages is tested on an unseen language. Experiments show that our approach outperforms the simple fine-tuning baseline and the X-MAML algorithm on cross-lingual natural language inference.

References

  • M. Andrychowicz, M. Denil, S. Gómez, M. W. Hoffman, D. Pfau, T. Schaul, B. Shillingford, and N. de Freitas (2016) Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, pp. 3981–3989. External Links: Link Cited by: §2.1, §2.1, item 1, item 2, §3.3, §4.2.
  • B. Baker, O. Gupta, N. Naik, and R. Raskar (2016)

    Designing neural network architectures using reinforcement learning

    .
    arXiv preprint arXiv:1611.02167. Cited by: §4.2.
  • T. Bansal, R. Jha, and A. McCallum (2020) Learning to few-shot learn across diverse natural language classification tasks. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain (Online), pp. 5108–5123. External Links: Link, Document Cited by: §4.2.
  • X. Chen, Y. Sun, B. Athiwaratkun, C. Cardie, and K. Weinberger (2018) Adversarial deep averaging networks for cross-lingual sentiment classification. Transactions of the Association for Computational Linguistics 6, pp. 557–570. External Links: Link, Document Cited by: §4.1.
  • A. Conneau, R. Rinott, G. Lample, A. Williams, S. Bowman, H. Schwenk, and V. Stoyanov (2018) XNLI: evaluating cross-lingual sentence representations. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

    ,
    Brussels, Belgium, pp. 2475–2485. External Links: Link, Document Cited by: §1, §3, §4.1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: Soft Layer Selection with Meta-Learning for Zero-Shot Cross-Lingual Transfer, §1, §2, §3.1, §3.2, Table 1.
  • A. Eriguchi, M. Johnson, O. Firat, H. Kazawa, and W. Macherey (2018)

    Zero-shot cross-lingual classification using multilingual neural machine translation

    .
    CoRR abs/1809.04686. External Links: Link, 1809.04686 Cited by: §4.1.
  • C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In

    Proceedings of the 34th International Conference on Machine Learning-Volume 70

    ,
    pp. 1126–1135. Cited by: §1, §4.2.
  • J. Gu, Y. Wang, Y. Chen, V. O. K. Li, and K. Cho (2018) Meta-learning for low-resource neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 3622–3631. External Links: Link, Document Cited by: §4.2.
  • J. Guo, H. He, T. He, L. Lausen, M. Li, H. Lin, X. Shi, C. Wang, J. Xie, S. Zha, A. Zhang, H. Zhang, Z. Zhang, Z. Zhang, S. Zheng, and Y. Zhu (2020)

    GluonCV and gluonnlp: deep learning in computer vision and natural language processing

    .
    Journal of Machine Learning Research 21 (23), pp. 1–7. External Links: Link Cited by: §3.1.
  • S. Hochreiter, A. S. Younger, and P. R. Conwell (2001) Learning to learn using gradient descent. In International Conference on Artificial Neural Networks, pp. 87–94. Cited by: §4.2.
  • G. Jawahar, B. Sagot, and D. Seddah (2019) What does BERT learn about the structure of language?. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3651–3657. External Links: Link, Document Cited by: §1, §2.1.
  • D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. In International Conference on Learning Representations (ICLR), Cited by: §3.1.
  • G. Koch, R. Zemel, and R. Salakhutdinov (2015) Siamese neural networks for one-shot image recognition. In ICML deep learning workshop, Vol. 2. Cited by: §4.2.
  • G. Lample and A. Conneau (2019) Cross-lingual language model pretraining. Advances in Neural Information Processing Systems (NeurIPS). Cited by: §1, §4.1.
  • K. Li and J. Malik (2017) Learning to optimize. In International Conference on Learning Representations, Cited by: §4.2.
  • Y. Lin, C. Chen, J. Lee, Z. Li, Y. Zhang, M. Xia, S. Rijhwani, J. He, Z. Zhang, X. Ma, A. Anastasopoulos, P. Littell, and G. Neubig (2019) Choosing transfer languages for cross-lingual learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3125–3135. External Links: Link, Document Cited by: footnote 1.
  • Z. Liu, G. I. Winata, Z. Lin, P. Xu, and P. Fung (2020) Attention-informed mixed-language training for zero-shot cross-lingual task-oriented dialogue systems.

    Proceedings of the AAAI Conference on Artificial Intelligence

    34 (05), pp. 8433–8440.
    External Links: Link, Document Cited by: §4.1.
  • Z. Liu, G. I. Winata, A. Madotto, and P. Fung (2020) Exploring fine-tuning techniques for pre-trained cross-lingual models via continual learning. External Links: 2004.14218 Cited by: §4.1.
  • D. Maclaurin, D. Duvenaud, and R. Adams (2015)

    Gradient-based hyperparameter optimization through reversible learning

    .
    In International Conference on Machine Learning, pp. 2113–2122. Cited by: §4.2.
  • P. Mulcaire, J. Kasai, and N. A. Smith (2019) Polyglot contextual representations improve crosslingual transfer. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 3912–3918. External Links: Link, Document Cited by: §4.1.
  • F. Nooralahzadeh, G. Bekoulis, J. Bjerva, and I. Augenstein (2020) Zero-shot cross-lingual transfer with meta learning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 4547–4562. External Links: Link, Document Cited by: Soft Layer Selection with Meta-Learning for Zero-Shot Cross-Lingual Transfer, §1, §2.2, §3.1, §3.2, Table 1, §4.2.
  • T. Pires, E. Schlinger, and D. Garrette (2019) How multilingual is multilingual bert?. CoRR abs/1906.01502. External Links: Link, 1906.01502 Cited by: §4.1.
  • P. Prettenhofer and B. Stein (2010) Cross-language text classification using structural correspondence learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 1118–1127. External Links: Link Cited by: §1, §4.1.
  • S. Ravi and H. Larochelle (2016) Optimization as a model for few-shot learning. In International Conference on Learning Representations, Cited by: §4.2.
  • E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. V. Le, and A. Kurakin (2017)

    Large-scale evolution of image classifiers

    .
    In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2902–2911. Cited by: §4.2.
  • J. Schmidhuber (1987) Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-… hook. Ph.D. Thesis, Technische Universität München. Cited by: §4.2, §4.2.
  • J. Schmidhuber (1992) Learning to control fast-weight memories: an alternative to dynamic recurrent networks. Neural Computation 4 (1), pp. 131–139. Cited by: §4.2.
  • J. Schmidhuber (1993) A neural network that embeds its own meta-levels. In IEEE International Conference on Neural Networks, pp. 407–412. Cited by: §4.2.
  • S. Schuster, S. Gupta, R. Shah, and M. Lewis (2019a) Cross-lingual transfer learning for multilingual task oriented dialog. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 3795–3805. External Links: Link Cited by: §4.1.
  • T. Schuster, O. Ram, R. Barzilay, and A. Globerson (2019b) Cross-lingual alignment of contextual word embeddings, with applications to zero-shot dependency parsing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 1599–1613. External Links: Link, Document Cited by: §4.1.
  • K. Singla, D. Can, and S. Narayanan (2018) A multi-task approach to learning multilingual representations. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 214–220. External Links: Link, Document Cited by: §4.1.
  • K. O. Stanley and R. Miikkulainen (2002) Evolving neural networks through augmenting topologies. Evolutionary computation 10 (2), pp. 99–127. Cited by: §4.2.
  • O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. (2016) Matching networks for one shot learning. In Advances in neural information processing systems, pp. 3630–3638. Cited by: §4.2.
  • X. Wan (2009) Co-training for cross-lingual sentiment classification. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pp. 235–243. External Links: Link Cited by: §1, §4.1.
  • B. Wang, M. Lapata, and I. Titov (2021) Meta-learning for domain generalization in semantic parsing. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, pp. 366–379. External Links: Link Cited by: §4.2.
  • O. Wichrowska, N. Maheswaranathan, M. W. Hoffman, S. G. Colmenarejo, M. Denil, N. Freitas, and J. Sohl-Dickstein (2017) Learned optimizers that scale and generalize. In International Conference on Machine Learning, pp. 3751–3760. Cited by: §4.2.
  • A. Williams, N. Nangia, and S. Bowman (2018) A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 1112–1122. External Links: Link, Document Cited by: §3.
  • Q. Wu, Z. Lin, G. Wang, H. Chen, B. F. Karlsson, B. Huang, and C. Lin (2019) Enhanced meta-learning for cross-lingual named entity recognition with minimal resources. arXiv preprint arXiv:1911.06161. Cited by: §4.2.
  • S. Wu and M. Dredze (2019) Beto, bentz, becas: the surprising cross-lingual effectiveness of BERT. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 833–844. External Links: Link, Document Cited by: §1, §3.1, §3.2, Table 1, §4.1.
  • Z. Yang, R. Salakhutdinov, and W. W. Cohen (2017) Transfer learning for sequence tagging with hierarchical recurrent networks. In International Conference on Learning Representations, External Links: Link Cited by: §4.1.
  • K. Yu, H. Li, and B. Oguz (2018) Multilingual seq2seq training with similarity loss for cross-lingual document classification. In Proceedings of The Third Workshop on Representation Learning for NLP, pp. 175–179. External Links: Link, Document Cited by: §4.1.
  • B. Zoph and Q. V. Le (2016) Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578. Cited by: §4.2.
  • B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le (2018) Learning transferable architectures for scalable image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 8697–8710. Cited by: §4.2.