The rise of massively pre-trained multilingual language models (LM)111We loosely use the term LM to describe unsupervised pretrained models including Masked-LMs and Causal-LMs XLM; XLMR; chi2020infoxlm; luo2020veco; xue2020mt5 has significantly improved cross-lingual generalization across many languages xBERT; pires-etal-2019-multilingual; k2020crosslingual; Keung_2019. Recent work on zero-shot cross-lingual adaptation fang2020filter; pfeiffer2020madx; bari2020multimix, in the absence of labelled target data, has also demonstrated impressive performance gains. Despite these successes, however, there still remains a sizeable gap between supervised and zero-shot performances. On the other hand, when limited target language data are available (i.e few-shot setting), traditional fine-tuning of large pre-trained models can cause over-fitting overfitting.
One way to deal with the scarcity of annotated data is to augment synthetic data using techniques like paraphrasing gao2020paraphrase, machine translation BT and/or data-diversification bari2020multimix. Few-shot learning, on the other hand, deals with handling out-of-distribution (OOD) generalization problems using only a small amount of data Koch2015SiameseNN; vinyals2017matching; snell2017prototypical; santoro2017simple; finn2017modelagnostic. In this setup, the model is evaluated over few-shot tasks, such that the model learns to generalize to new data (query set) using only a hand full of labeled samples (support set).
In a cross-lingual few-shot setup, the model learns cross-lingual features to generalize to new languages. Recently, nooralahzadeh2020zeroshot used Meta-Learning MAML for few-shot adaptation on several cross-lingual tasks. Their few-shot setup used full development datasets of various target languages (XNLI development set, for instance, has over 2K samples). In general, they showed the effectiveness of cross-lingual meta-training in the presence of a large quantity of OOD data. However, they did not provide any fine-tuning baseline. On the contrary, lauscher2020zero explored few-shot learning but did not explore beyond fine-tuning. To the best of our knowledge, there has been no prior work in cross-lingual NLP that uses only a handful of target samples () and yet surpasses or matches traditional fine-tuning (on the same number of samples).
Traditional finetuning (parametric) approaches require proper hyperparameter tuning techniques for the learning rate, scheduling, optimizer, batch size, up-sampling few-shot support samples and failing to do so would often led to model over-fitting. It can be expensive to update parameters of large model frequently for few shot adaption, each time there is a fresh batch of support samples. As the model grows bigger, it becomes almost unscalable to update weights frequently for few shot adaptation. It takes significant amount of time to update gradients for a few number of samples and then perform inference.
In this work, we explore a simple Nearest Neighbor Few-shot Inference (NNFS) approach for cross-lingual classification tasks. Our main objective is to utilize very few samples to perform adaptation on a given target language. To achieve this, we first fine-tune a multilingual LM on a high resource source language (i.e., English), and then apply few-shot inference using few support examples from the target language. Unlike other popular meta-learning approaches that focus on improving the fine-tuning/training setup to achieve better generalization finn2017modelagnostic; Ravi2017, our approach applies to the inference phase. Hence, we do not update the weights of the LM using target language samples. This makes our approach complimentary to other regularized fine-tuning based few-shot meta-learning approaches. Our key contributions are as follows:
We propose a simple method for cross-lingual few-shot adaptation on classification tasks during inference. Since our approach applies to inference, it does not require updating the LM weights using target language data.
Using only a few labeled target support samples, we test our approach across 16 distinct languages belonging to two NLP tasks (XNLI and PAWS-X), and achieve consistent sizable improvements over traditional fine-tuning.
We also demonstrate that our proposed method can generalize well not only across languages but also across tasks.
The objective of few-shot learning is to adapt from a source distribution to a new target distribution using only few samples. The traditional few-shot setup finn2017modelagnostic; proto_net; vinyals2017matching involves adapting a model to the distribution of new classes. Similarly, in a cross-lingual setup, we adapt a pre-trained LM, that has been fine-tuned using a high resource language, to a new target language distribution lauscher2020zero; nooralahzadeh2020zeroshot.
We begin by fine-tuning a pre-trained model XLMR to a specific task using a high resource (source) language data set , to get an adapted model . We use to perform few-shot adaptation.
In our few-shot setup, we assume to possess very few labeled support samples from the target language distribution. A support set covers classes, where each class carries number of samples. This is a standard -way--shot
few-shot learning setup. The objective of our proposed method is to classify the unlabeled query samples). We denote the latent representation of the support and query samples as and , respectively, where and .
2.2 Nearest Neighbor Class
Let and be the total number of support and query samples. For query samples , feature representations is obtained by forward propagation on model. For each query representation
, we define a latent binary assignment vector. Here,
is a binary variable such that,
and . Let denote the matrix where each row represents the term of each query.
We compute the centroid, , of each class by taking the mean of its support representations . Next, we compute the distances between each and (Equation 2
). Our loss function becomes,
Finally, we assign each the label of the class it has the minimum distance to. This is done using the following function,
Traditional inductive inference handles each query sample (one at a time), independent of other query samples. On the contrary, our proposed approach includes additional Normalization and Transduction steps. Algorithm 1 illustrates our approach. Here we discuss these additional steps in more detail.
Norm. We measure the cross-lingual shift as the difference between the mean representations of the support set (target language) and the training set (en), . We then perform cross-lingual shift correction on the query set. To achieve this, at first, we extract the latent representation of both support and query samples from . We then center the representation (Alg 1 #3) by subtracting the mean representation of the train/dev data of the source language, followed by L2 normalization of both representations (train/dev). Algorithm 1 (#2-7) further details our approach.
Transduction. We apply prototypical rectification (proto-rect) proto-rect on the extracted features of LM. In the rectification step, to compute (in Alg.1), initially, we obtain the mean representation for each of the support classes by taking the weighted combination of and . Finally, we calculate predictions on the query set using equation 3. We also present our proposed NNFS inference in Figure 2 in the Appendix.
3 Experimental Settings
We use two standard multilingal datasets - XNLI multiNLI (15 languages) and PAWS-X zhang-etal-2019-paws (7 languages) to evaluate our proposed method. Additional details on languages and complexity of the task can be found in the Appendix. For few-Shot inference, we use samples from the target language development data to construct the support sets and the test data to construct the query sets.
|= Finetuned-XLM-R-large with XNLI dataset|
Few-shot XNLI accuracy results across 14 languages with average improvements for each of the methods. All the confidence interval is less than .07 in the experiments. ”fs-3.5” means 3-way-5-shot learning.
|= Finetuned-XLM-R-large with PAWS-X dataset|
|= Finetuned-XLM-R-large with XNLI dataset|
We use XLMR-large XLMR as our pre-trained language model and perform standard fine-tuning using labeled English data to adapt it to task model . We tune the hyper-parameters using English development data and report results using the best performing model (optimal hyper-parameters have been enlisted in the appendix). We train our model using 5 different seeds and report average results across them. We use the same optimal hyper-parameters to fine-tune on the target languages. As baseline we add two additional fine-tuning named head and full. Fine-tuning full means all the parameters of the model are updated. This is very unlikely in Few-shot scenarios. Fine-tuning head means only the parameters of the last linear layer are updated.
3.3 Evaluation Setup
nooralahzadeh2020zeroshot and lauscher2020zero used 10 and 5 different seeds to measure the few-shot performance. As few-shot learning involves randomly selecting small support sets, results may vary greatly from one experiment to the next, and hence may not be reliable le2020continual. In computer vision, episodic testing Ravi2017OptimizationAA; DBLP:journals/corr/abs-1905-11116; ziko2020laplacian is often used for evaluating few-shot experiments. Each episode is composed of small randomly selected support and query sets. Model’s performance on each episode is noted, and the average performance score, alongside the confidence interval (95%) across all episodes are reported. To the best of our knowledge, episodic testing has not been leveraged for cross-lingual few-shot learning in NLP.
We evaluate our approach using 300 episodes per seed model totalling 1500 episodic testing and report their average scores. For each episode, we perform C-way-N-shot inference. For 2-way-5-shot setting, for instance, we randomly select 15 query samples per class, and number of support samples. For XNLI and PAWS-X, we use and as the value of C, respectively. Our episodic testing approach has been detailed further in the Episodic Algorithm of the Appendix.
3.4 Results and Analysis
After training the model with the source language samples (i.e. labeled English data), we perform additional fine-tuning using -way-5-shot target language samples. Finally, we perform our proposed NNFS inference.
The fine-tuning baseline using limited target language samples result in small but non-significant improvements over the zero-shot baseline. The NNFS inference approach, however, resulted in performance gains using only 15 (3-way-5-shot) and 10 (2-way-5-shot) support examples for both XNLI and PAWS-X tasks. When compared to the few-shot baseline, we got an average improvement of 0.6 on XNLI (table 4) and 1.0 on PAWS-X (table 5). At first we experimented with 3-shot support samples but did not observe any few-shot capability in the model. We also experimented with 10-shot setup and found similar improvements of NNFS on top of the Fine-tuning baseline (results have been added to the Appendix). Interestingly, for both cases, we observed higher performance gains on low resource languages.
To further evaluate the effectiveness of our model, we tested it in a cross-task setting. We first trained the model on XNLI (EN data) and then used NNFS inference on PAWS-X. Table 3 demonstrates an impressive average performance gain of +7.3 across all PAWS-X languages, over the fine-tuning baseline.
In addition to that, NNFS inference approach is fast. When compared to the zero-shot inference (), our approach takes only time of computation cost compared to the finetuning time which takes . Table 6 in appendix shows the inference time details on both tasks.
The paper proposes a nearest neighbour based few-shot adaptation algorithm accompanied by a necessary evaluation protocol. Our approach does not require updating the LM weights and thus avoids over-fitting to limited samples. We experiment using two classification tasks and results demonstrate consistent improvements over finetuning not only across languages, but also across tasks.
Appendix A Appendices
a.1 Decision choice for Episodic Testing
In the traditional testing framework, we sample a batch from the dataset and calculate the batch’s prediction. Finally, accumulate all the predictions to calculate the score of the evaluation metric. However, Few-shot experiments are quite unpredictable because of the following two reasons,
Support set: Per class sampling strategy of the support set is random. In a few shot experiments, we perform inference on the test dataset utilizing support-samples. For a different support set, the prediction may vary drastically. However, taking few samples (ie., 10 out of 2500 or 15 out of 2000) and doing experiments 5-10 times doesn’t reflect the true potential of a few-shot algorithm.
Transductive inference: On the contrary, for a few shot experiments, algorithms often perform transductive inference. In transductive inference, predictions may vary based on the combination of the query samples. Hence it is challenging to benchmark the few shot algorithms with the traditional testing framework.
In Episodic testing, we randomly sample a query set and support set from the dataset and perform few-shot experiments. We perform the experiments until we get a low confidence-interval (95%). In this way, we may iterate over the test dataset 5-10 times more. However, it is not affected by the above problems mentioned and can benchmark any few-shot algorithm properly.
a.2 Extended Dataset
We use XNLI dataset XNLI which extends the MultiNLI dataset multiNLI to 15 languages. MultiNLI dataset contains sentences from 10 different genres. The objective is to identify if a premise entails with the hypothesis. It is a crowd sourced 3-class classification dataset covering 14 languages that have been translated from English. These locales include French (fr), Spanish (es), German (de), Greek (el), Bulgarian (bg), Russian (ru), Turkish (tr), Arabic (a), Vietnamese (vi), Thai (th), Chinese (zh), Hindi (hi), Swahili (sw), and Urdu (ur). It comes with human translated dev and test splits. The dataset is balanced and contains 392702, 2490 and 5010 numbers of train, dev and test instance for each of the language, respectively.
Given a pair of sentences, the objective of PAWS (Paraphrase Adversaries from Word Scrambling) (zhang-etal-2019-paws) is to classify if the pair is a paraphrase or not. PAWS-X dataset contains six topologically different languages that have been machine translated from English. These include French (fr), Spanish (es), German (de), Korean (ko), Japanese (ja), and Chinese (zh). Similar to XNLI, it also comes with human translated dev and test split.
Both datasets posses different challenges. NLI task requires rich and a high level of factual understanding of the text. The PAWS task, on the other hand, contains pairs of sentences that usually have a high lexical overlap and may/may not be paraphrases. We use accuracy as the evaluation metric for both datasets.
10 Shot results
For reference we have added 10 shot experiment for XNLI and PAWSX dataset with same setup as Table 1 and Table 2 of main paper.
|= Finetuned-XLM-R-large with XNLI dataset|
|= Finetuned-XLM-R-large with PAWS-X dataset|
a.3 Hyperparameters and Resource Description
We used 8 V100 GPUs (amazon p3.16xlarge) to run all experiments. The hyper-parameters of the best performing model are enlisted in Table 6. In the pretrained language model finetuning, We use (1e-5, 3e-5, 5e-5, 7.5e-6 ,5e-6) boundary values to search for proper learning rate.
|# of params||550M|
|Max Sequence Length||128|
|Per GPU batch size||8|
|Gradient accumulation step||2|
|Effective batch size||128|
Number of epoch
|Warmup step in pre-training||6% of total number of steps|
|Total number of episodic test||1000|
|finetuning learning rate||7.5e-6|
|finetuning schedueler||constant scheduler|
Appendix B Reproducibility Settings and Notes
. , ,
Average runtime: See table 7.