Despite the impressive performance of neural models on a wide variety of NLP tasks, these models are extremely data hungry – training them requires a large amount of annotated data. As collecting such amounts of data for every language of interest is extremely expensive, cross-lingual transfer that aims to transfer the task knowledge from high-resource (source) languages for which annotated data are more readily available to low-resource (target) languages becomes a promising direction. Cross-lingual transfer approaches using cross-lingual resources such as machine translation (MT) systems (Wan, 2009; Conneau et al., 2018) or bilingual dictionaries (Prettenhofer and Stein, 2010) have effectively reduced the amount of annotated data required to obtain reasonable performance on the target language. However, such cross-lingual resources are often limited for low-resource languages.
Recent advances in cross-lingual contextual embedding models have reduced the need for cross-lingual supervision (Devlin et al., 2019; Lample and Conneau, 2019). Wu and Dredze (2019) show that multilingual BERT (mBERT) (Devlin et al., 2019), a contextual embedding model pre-trained on the concatenated Wikipedia data from 104 languages without cross-lingual alignment, does surprisingly well on zero-shot cross-lingual transfer tasks, where they fine-tune the model on the annotated data from the source languages and evaluate on the target language. Wu and Dredze (2019) propose to freeze the bottom layers of mBERT during fine-tuning to improve the cross-lingual performance over the simple fine-tune-all-parameters strategy, as different layers of mBERT captures different linguistic information (Jawahar et al., 2019).
Selecting which layers to freeze for a downstream task is a non-trivial problem. In this paper, we propose a novel meta-learning algorithm for soft layer selection. Our meta-learning algorithm learns layer-wise update rate by simulating the zero-shot transfer scenario – at each round, we randomly split the source languages into a held-out language and the rest as training languages, fine-tune the model on the training languages, and update the meta-parameters based on the model performance on the held-out language. We build the meta-optimizer on top of a standard optimizer and learnable update rates, so that it generalizes well to large numbers of updates. Our method uses much less meta-parameters than the X-MAML approach (Nooralahzadeh et al., 2020) adapted from model-agnostic meta-learning (MAML) (Finn et al., 2017) to zero-shot cross-lingual transfer.
Experiments on zero-shot cross-lingual natural language inference show that our approach outperforms both the simple fine-tuning baseline and the X-MAML algorithm and that our approach brings larger gains when transferring from multiple source languages. Ablation study shows that both the layer-wise update rate and cross-lingual meta-training are key to the success of our approach.
2 Meta-Learning for Zero-Shot Cross-lingual Transfer
The idea of transfer learning is to improve the performance on the target taskby learning from a set of related source tasks . In the context of cross-lingual transfer, we treat different languages as separate tasks, and our goal is to transfer the task knowledge from the source languages to the target language. In contrast to the transfer learning case where the inputs of the source and target tasks are from the same language, in cross-lingual transfer learning we need to handle inputs from different languages with different vocabularies and syntactic structures. To handle the issue, we use the pre-trained multilingual BERT (Devlin et al., 2019), a language model encoder trained on the concatenation of monolingual corpora from 104 languages.
The most widely used approach to zero-shot cross-lingual transfer using multilingual BERT is to fine-tune the BERT model on the source language tasks with training objective
and then evaluate the fine-tuned model on the target language task . The gap between training and testing can lead to sub-optimal performance on the target language.
To address the issue, we propose to train a meta-optimizer for fine-tuning so that the fine-tuned model generalizes better to unseen languages. We train the meta-optimizer by
where is a “surprise” language randomly selected from the source language tasks .
Our meta-optimizer consists of a standard optimizer as the base optimizer and a set of meta-parameters to control the layer-wise update rates. An update step is formulated as:
where represent the parameters of the learner model at time step , and
is the update vector produced by the base optimizergiven the gradients at the current and previous steps. The function is defined by the optimization algorithm and its hyper-parameters. For example, a typical gradient descent algorithm uses where represents the learning rate. A standard optimization algorithm will update the model parameters by:
Our meta-optimizer is different in that we perform gated update using parametric update rates , which is computed by , where represents the meta-parameters of the meta-optimizer
. The sigmoid function ensures that the update rates are within the range. Different from Andrychowicz et al. (2016) in which the optimizer parameters are shared across all coordinates of the model, our meta-optimizer learns different update rates for different model layers. This is based on the findings that different layers of the BERT encoder capture different linguistic information, with syntactic features in middle layers and semantic information in higher layers (Jawahar et al., 2019). And thus, different layers may generalize differently across languages.
Figure 1 illustrates the computational graph for the forward pass when training the meta-optimizer. Note that as the losses and gradients are dependent on the parameters of the meta-optimizer, computing the gradients along the dashed edges would normally require taking second derivatives, which is computationally expensive. Following Andrychowicz et al. (2016), we drop the gradients along the dashed edges and only compute gradients along the solid edges.
A good meta-optimizer will, given the training data in the source languages and the training objective, suggest an update rule for the learner model so that it performs well on the target language. Thus, we would like the training condition to match that of the test time. However, in zero-shot transfer we assume no access to the target language data, so we need to simulate the test scenario using only the training data on the source languages.
As shown in Algorithm 1, at each episode in the outer loop, we randomly choose a test language to construct the test data and use the remaining data as the training data . Then, we re-initialize the parameters of the learner model and start the training simulation. At each training step, we first use the base optimizer to compute the update vector based on the current and history gradients . We then perform the gated update using the meta-optimizer with Eq. 1. The resulting model can be viewed as the output of a forward pass of the meta-optimizer. After every iterations of model update, we compute the gradient of the loss on the test data with respect to the old meta parameters and make an update to the meta parameters. Our meta-learning algorithm is different from X-MAML (Nooralahzadeh et al., 2020) in that 1) X-MAML is designed mainly for few-shot transfer while our algorithm is designated for zero-shot transfer, and 2) our algorithm uses much less meta-parameters than X-MAML as it only requires training the update rate for each layer while in X-MAML we meta-learn the initial parameters of the entire model.
|Devlin et al. (2019)||–||74.30||70.50||62.10||58.35||–||–||–||–||–||63.80||–||–||–||–|
|Wu and Dredze (2019)||74.60||74.90||72.00||66.10||58.60||69.80||49.40||55.70||62.00||71.90||70.40||69.80||67.90||61.20||66.02|
|Nooralahzadeh et al. (2020)||74.42||75.07||71.83||66.05||61.51||69.45||49.76||55.39||61.20||71.82||71.11||70.19||67.95||62.20||66.28|
|Aux. language: el + ur|
We evaluate our meta-learning approach on natural language inference. Natural Language Inference (NLI) can be cast into a sequence pair classification problem where, given a premise and a hypothesis sentence, the model needs to predict whether the premise entails the hypothesis, contradicts it, or neither (neutral). We use the Multi-Genre Natural Language Inference Corpus (Williams et al., 2018), which consists of 433k English sentence pairs labeled with textual entailment information, and the XNLI dataset (Conneau et al., 2018), which has 2.5k development and 5k test sentence pairs in 15 languages including English (en), French (fr), Spanish (es), German (de), Greek (el), Bulgarian (bg), Russian (ru), Turkish (tr), Arabic (ar), Vietnamese (vi), Thai (th), Chinese (zh), Hindi (hi), Swahili (sw), and Urdu (ur). We use this dataset to evaluate the effectiveness of our meta-learning algorithm when transferring from English and one or more low-resource auxiliary languages to the target language.
|No layer-wise update||73.45||73.90||70.73||65.19||60.31||69.10||50.87||46.47||62.74||70.42||70.24||68.85||68.17||53.50||64.57|
|No cross-lingual meta-train||73.66||74.84||71.54||66.15||61.16||69.33||50.89||48.43||63.16||71.57||70.53||69.14||67.93||55.07||65.24|
3.1 Model and Training Configurations
, we tokenize the input sentences using WordPiece, concatenate them, feed the sequence to BERT, and use the hidden representation of the first token (
) for classification. The final output is computed by applying a linear projection and a softmax layer to the hidden representation. We use a dropout rate ofon the final encoder layer and fix the embedding layer during fine-tuning. Following Nooralahzadeh et al. (2020), we fine-tune mBERT by 1) fine-tune mBERT on the English data for one epoch to get initial model parameters, and 2) continue fine-tuning the model on the other source languages for two epochs. We compare using the standard optimizer (fine-tuning baseline) and our meta-optimizer for Step 2. We use Adam optimizer (Kingma and Ba, 2015) with a learning rate of , , and as the standard optimizer and base optimizer in our meta-optimizer. To train our meta-optimizer, we use Adam with a learning rate of for epochs with training batches per iteration (Algorithm 1). Different from Nooralahzadeh et al. (2020) who select the auxiliary languages for each target language that lead to the best transfer results, we simulate a more realistic scenario where only a limited set of auxiliary languages is available. We choose two distant auxiliary languages – Greek (Hellenic branch of the Indo-European language family) and Urdu (Indo-Aryan branch of the Indo-European language family) – and evaluate the transfer performance on the other languages.
3.2 Main Results
As shown in Table 1, we compare our meta-learning approach with the fine-tuning baseline and the zero-shot transfer results reported in prior work that uses mBERT. Our approach outperforms the fine-tuning methods in Devlin et al. (2019) by 1.6–8.5%. Compared with the best fine-tuning method in Wu and Dredze (2019) which freezes a selected subset of mBERT layers during fine-tuning, our approach achieves +0.4% higher accuracy on average. We compare our approach with a strong fine-tuning baseline which achieves competitive accuracy scores to the best X-MAML results (Nooralahzadeh et al., 2020) using a single auxiliary language, even though we limit our choice of the auxiliary language to Greek and Urdu, while Nooralahzadeh et al. (2020) select the best auxiliary language among all languages except for the target one. Overall, our approach outperforms the strong fine-tuning baseline on 10 out of 14 languages and by +0.2% accuracy on average.
Our approach brings larger gains when using two auxiliary languages – it outperforms the fine-tuning baseline on all languages and improves the average accuracy by +0.6%. This suggests that our meta-learning approach is more effective when transferring from multiple source languages.111Using two auxiliary languages improves over one auxiliary language the most on lower-resource languages in mBERT pre-training (such as Turkish and Hindi), but does not bring gains or even hurts on high-resource languages (such as French and German). This is consistent with the findings in prior work that the choice of the auxiliary languages is crucial in cross-lingual transfer (Lin et al., 2019). We leave further investigation on its impact on our meta-learning approach for future work.
3.3 Ablation Study
Our approach is different from Andrychowicz et al. (2016) in that 1) it adopts layer-wise update rates while the meta-parameters are shared across all model parameters in Andrychowicz et al. (2016), and 2) it trains the meta-parameters in a cross-lingual setting while Andrychowicz et al. (2016) is designated to few-shot learning. We conduct ablation experiments on XNLI using Greek and Urdu as the auxiliary languages to understand how they contribute to the model performance.
Impact of Layer-Wise Update Rate
We compare our approach with its variant that replaces the layer-wise update rate with one update rate for all layers. Table 2 shows that our approach significantly outperforms this variant on all target languages with an average margin of 2.0%. This suggests that layer-wise update rate contributes greatly to the effectiveness of our approach.
Impact of Cross-Lingual Meta-Training
We measure the impact of cross-lingual meta-training by replacing the cross-lingual meta-training in our approach with a joint training of the layer-wise update rate and model parameters. As shown in Table 2, ablating the cross-lingual meta-training degrades accuracy significantly on all target languages by 1.4% on average, which shows that our cross-lingual meta-training strategy is beneficial.
4 Related Work
4.1 Cross-lingual Transfer Learning
The idea of cross-lingual transfer is to use the annotated data in the source languages to improve the task performance on the target language with minimal or even zero target labeled data (aka zero-shot). There is a large body of work on using external cross-lingual resources such as bilingual word dictionaries (Prettenhofer and Stein, 2010; Schuster et al., 2019b; Liu et al., 2020), MT systems (Wan, 2009), or parallel corpora (Eriguchi et al., 2018; Yu et al., 2018; Singla et al., 2018; Conneau et al., 2018) to bridge the gap between the source and target languages. Recent advances in unsupervised cross-lingual representations have paved the road for transfer learning without cross-lingual resources (Yang et al., 2017; Chen et al., 2018; Schuster et al., 2019a). Our work builds on Mulcaire et al. (2019); Lample and Conneau (2019); Pires et al. (2019)
who show that language models trained on monolingual text from multiple languages provide powerful multilingual representations that generalize across languages. Recent work has shown that more advanced techniques such as freezing the model’s bottom layers(Wu and Dredze, 2019) or continual learning (Liu et al., 2020) can further boost the cross-lingual performance on downstream tasks. In this paper, we explore meta-learning to softly select the layers to freeze during fine-tuning.
4.2 Meta Learning
A typical meta-learning algorithm consists of two loops of training: 1) an inner loop where the learner model is trained, and 2) an outer loop where, given a meta-objective, we optimize a set of meta-parameters which controls aspects of the learning process in the inner loop. The goal is to find the optimal meta-parameters such that the inner loop performs well on the meta-objective. Existing meta-learning approaches differ in the choice of meta-parameters to be optimized and the meta-objective. Depending on the choice of meta-parameters, existing work can be divided into four categories: (a) neural architecture search (Stanley and Miikkulainen, 2002; Zoph and Le, 2016; Baker et al., 2016; Real et al., 2017; Zoph et al., 2018); (b) metric-based (Koch et al., 2015; Vinyals et al., 2016); (c) model-agnostic (MAML) (Finn et al., 2017; Ravi and Larochelle, 2016); (d) model-based (learning update rules) (Schmidhuber, 1987; Hochreiter et al., 2001; Maclaurin et al., 2015; Li and Malik, 2017).
In this paper, we focus on model-based meta-learning for zero-shot cross-lingual transfer. Early work introduces a type of networks that can update their own weights (Schmidhuber, 1987, 1992, 1993). More recently, Andrychowicz et al. (2016) propose to model gradient-based update rules using an RNN and optimize it with gradient descent. However, as Wichrowska et al. (2017) point out, the RNN-based meta-optimizers fail to make progress when run for large numbers of steps. They address the issue by incorporating features motivated by the standard optimizers into the meta-optimizer. We instead base our meta-optimizer on a standard optmizer like Adam so that it generalizes better to large-scale training.
Meta-learning has been previously applied to few-shot cross-lingual named entity recognition(Wu et al., 2019), low-resource machine translation (Gu et al., 2018), and improving cross-domain generalization for semantic parsing (Wang et al., 2021). For zero-shot cross-lingual transfer, Nooralahzadeh et al. (2020) introduce an optimization-based meta-learning algorithm called X-MAML which meta-learns the initial model parameters on supervised data from low-resource languages. By contrast, our meta-learning algorithm requires much less meta-parameters and is thus simpler than X-MAML. Bansal et al. (2020) show that MAML combined with meta-learning for learning rates improves few-shot learning. Different from their approach which learns layer-wise learning rates only for task-specific layers specified as a hyper-parameter as part of the MAML algorithm, our approach learns layer-wise learning rates for all layers, and we show the effectiveness of our approach without being used with MAML on zero-shot cross-lingual transfer.
We propose a novel meta-optimizer that learns to soft-select which layers to freeze when fine-tuning a pretrained language model (mBERT) for zero-shot cross-lingual transfer. Our meta-optimizer learns the update rate for each layer by simulating the zero-shot transfer scenario where the model fine-tuned on the source languages is tested on an unseen language. Experiments show that our approach outperforms the simple fine-tuning baseline and the X-MAML algorithm on cross-lingual natural language inference.
- Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, pp. 3981–3989. External Links: Cited by: §2.1, §2.1, item 1, item 2, §3.3, §4.2.
- . arXiv preprint arXiv:1611.02167. Cited by: §4.2.
- Learning to few-shot learn across diverse natural language classification tasks. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain (Online), pp. 5108–5123. External Links: Cited by: §4.2.
- Adversarial deep averaging networks for cross-lingual sentiment classification. Transactions of the Association for Computational Linguistics 6, pp. 557–570. External Links: Cited by: §4.1.
XNLI: evaluating cross-lingual sentence representations.
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 2475–2485. External Links: Cited by: §1, §3, §4.1.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Cited by: Soft Layer Selection with Meta-Learning for Zero-Shot Cross-Lingual Transfer, §1, §2, §3.1, §3.2, Table 1.
Zero-shot cross-lingual classification using multilingual neural machine translation. CoRR abs/1809.04686. External Links: Cited by: §4.1.
Model-agnostic meta-learning for fast adaptation of deep networks.
Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1126–1135. Cited by: §1, §4.2.
- Meta-learning for low-resource neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 3622–3631. External Links: Cited by: §4.2.
- . Journal of Machine Learning Research 21 (23), pp. 1–7. External Links: Cited by: §3.1.
- Learning to learn using gradient descent. In International Conference on Artificial Neural Networks, pp. 87–94. Cited by: §4.2.
- What does BERT learn about the structure of language?. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3651–3657. External Links: Cited by: §1, §2.1.
- Adam: a method for stochastic optimization. In International Conference on Learning Representations (ICLR), Cited by: §3.1.
- Siamese neural networks for one-shot image recognition. In ICML deep learning workshop, Vol. 2. Cited by: §4.2.
- Cross-lingual language model pretraining. Advances in Neural Information Processing Systems (NeurIPS). Cited by: §1, §4.1.
- Learning to optimize. In International Conference on Learning Representations, Cited by: §4.2.
- Choosing transfer languages for cross-lingual learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3125–3135. External Links: Cited by: footnote 1.
Attention-informed mixed-language training for zero-shot cross-lingual task-oriented dialogue systems.
Proceedings of the AAAI Conference on Artificial Intelligence34 (05), pp. 8433–8440. External Links: Cited by: §4.1.
- Exploring fine-tuning techniques for pre-trained cross-lingual models via continual learning. External Links: Cited by: §4.1.
Gradient-based hyperparameter optimization through reversible learning. In International Conference on Machine Learning, pp. 2113–2122. Cited by: §4.2.
- Polyglot contextual representations improve crosslingual transfer. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 3912–3918. External Links: Cited by: §4.1.
- Zero-shot cross-lingual transfer with meta learning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 4547–4562. External Links: Cited by: Soft Layer Selection with Meta-Learning for Zero-Shot Cross-Lingual Transfer, §1, §2.2, §3.1, §3.2, Table 1, §4.2.
- How multilingual is multilingual bert?. CoRR abs/1906.01502. External Links: Cited by: §4.1.
- Cross-language text classification using structural correspondence learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 1118–1127. External Links: Cited by: §1, §4.1.
- Optimization as a model for few-shot learning. In International Conference on Learning Representations, Cited by: §4.2.
Large-scale evolution of image classifiers. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 2902–2911. Cited by: §4.2.
- Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-… hook. Ph.D. Thesis, Technische Universität München. Cited by: §4.2, §4.2.
- Learning to control fast-weight memories: an alternative to dynamic recurrent networks. Neural Computation 4 (1), pp. 131–139. Cited by: §4.2.
- A neural network that embeds its own meta-levels. In IEEE International Conference on Neural Networks, pp. 407–412. Cited by: §4.2.
- Cross-lingual transfer learning for multilingual task oriented dialog. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 3795–3805. External Links: Cited by: §4.1.
- Cross-lingual alignment of contextual word embeddings, with applications to zero-shot dependency parsing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 1599–1613. External Links: Cited by: §4.1.
- A multi-task approach to learning multilingual representations. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 214–220. External Links: Cited by: §4.1.
- Evolving neural networks through augmenting topologies. Evolutionary computation 10 (2), pp. 99–127. Cited by: §4.2.
- Matching networks for one shot learning. In Advances in neural information processing systems, pp. 3630–3638. Cited by: §4.2.
- Co-training for cross-lingual sentiment classification. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pp. 235–243. External Links: Cited by: §1, §4.1.
- Meta-learning for domain generalization in semantic parsing. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, pp. 366–379. External Links: Cited by: §4.2.
- Learned optimizers that scale and generalize. In International Conference on Machine Learning, pp. 3751–3760. Cited by: §4.2.
- A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 1112–1122. External Links: Cited by: §3.
- Enhanced meta-learning for cross-lingual named entity recognition with minimal resources. arXiv preprint arXiv:1911.06161. Cited by: §4.2.
- Beto, bentz, becas: the surprising cross-lingual effectiveness of BERT. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 833–844. External Links: Cited by: §1, §3.1, §3.2, Table 1, §4.1.
- Transfer learning for sequence tagging with hierarchical recurrent networks. In International Conference on Learning Representations, External Links: Cited by: §4.1.
- Multilingual seq2seq training with similarity loss for cross-lingual document classification. In Proceedings of The Third Workshop on Representation Learning for NLP, pp. 175–179. External Links: Cited by: §4.1.
- Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578. Cited by: §4.2.
Learning transferable architectures for scalable image recognition.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8697–8710. Cited by: §4.2.