DeepAI
Log In Sign Up

Crosslingual Generalization through Multitask Finetuning

11/03/2022
by   Niklas Muennighoff, et al.
0

Multitask prompted finetuning (MTF) has been shown to help large language models generalize to new tasks in a zero-shot setting, but so far explorations of MTF have focused on English data and models. We apply MTF to the pretrained multilingual BLOOM and mT5 model families to produce finetuned variants called BLOOMZ and mT0. We find finetuning large multilingual language models on English tasks with English prompts allows for task generalization to non-English languages that appear only in the pretraining corpus. Finetuning on multilingual tasks with English prompts further improves performance on English and non-English tasks leading to various state-of-the-art zero-shot results. We also investigate finetuning on multilingual tasks with prompts that have been machine-translated from English to match the language of each dataset. We find training on these machine-translated prompts leads to better performance on human-written prompts in the respective languages. Surprisingly, we find models are capable of zero-shot generalization to tasks in languages they have never intentionally seen. We conjecture that the models are learning higher-level capabilities that are both task- and language-agnostic. In addition, we introduce xP3, a composite of supervised datasets in 46 languages with English and machine-translated prompts. Our code, datasets and models are publicly available at https://github.com/bigscience-workshop/xmtf.

READ FULL TEXT VIEW PDF

page 2

page 5

page 18

page 19

12/19/2022

BLOOM+1: Adding Language Support to BLOOM for Zero-Shot Prompting

The BLOOM model is a large open-source multilingual language model capab...
06/07/2021

Multilingual Neural Semantic Parsing for Low-Resourced Languages

Multilingual semantic parsing is a cost-effective method that allows a s...
10/07/2021

mRAT-SQL+GAP:A Portuguese Text-to-SQL Transformer

The translation of natural language questions to SQL queries has attract...
09/08/2021

Discrete and Soft Prompting for Multilingual Models

It has been shown for English that discrete and soft prompting perform s...
12/11/2022

IndicXTREME: A Multi-Task Benchmark For Evaluating Indic Languages

In this work, we introduce IndicXTREME, a benchmark consisting of nine d...
10/10/2019

Language Transfer for Early Warning of Epidemics from Social Media

Statements on social media can be analysed to identify individuals who a...
11/21/2022

Extended Multilingual Protest News Detection – Shared Task 1, CASE 2021 and 2022

We report results of the CASE 2022 Shared Task 1 on Multilingual Protest...

Code Repositories

xmtf

Crosslingual Generalization through Multitask Finetuning


view repo

1 Introduction

Figure 1: An overview of datasets in xP3. Datasets added to P3 in this work are marked bold. Yellow datasets are trained on. Green datasets are held out for evaluation.
Figure 2: Language composition of xP3, ROOTS, and the corpus of mT5. All ROOTS and xP3 languages are depicted. The mT5 corpus covers additional languages that are not included in the graph.
Figure 3: Comparison of dataset variants P3, xP3, and xP3mt on a sample from PAWS for P3 Zhang et al. (2019) and PAWS-X Yang et al. (2019) for xP3 and xP3mt. P3 pairs English datasets with English prompts, xP3 pairs multilingual datasets with English prompts and xP3mt pairs multilingual datasets with prompts machine-translated from English to match the dataset language. Expressions in curly brackets are replaced, e.g. for xP3mt the target shown as {{Choices[label]}}) becomes .

Large language models pretrained on vast amounts of text show some capability of solving tasks expressed in natural language, even without explicit training on these tasks Brown et al. (2020). Finetuning on groups of language tasks has been shown to significantly boost this zero-shot task generalization of language models Wei et al. (2021); Sanh et al. (2022); Min et al. (2021). For example, Sanh et al. (2022) finetune on tasks like summarization and question answering leading to better performance on unseen tasks like natural language inference. Previous work has focused on multitask finetuning in the context of large English language models and tasks.

Multilingual large language models show the same zero-shot learning capabilities for both monolingual and crosslingual tasks Goyal et al. (2021a); Lin et al. (2021); Patel et al. (2022); Soltan et al. (2022)

. However, zero-shot performance tends to be significantly lower than finetuned performance. Thus, task-specific or language-specific transfer learning via finetuning remains the predominant practice

Devlin et al. (2018); Conneau et al. (2019). This is particularly challenging for low-resource languages or tasks with limited data available, such as writing a fable that teaches a specified moral. In the spirit of multitask finetuning, it would be desirable to improve the zero-shot task generalization of multilingual models to make them usable on tasks from low-resource languages without requiring further finetuning.

To address this goal, we focus on crosslingual multitask finetuning. Due to the difficulty of collecting supervised task data in low-resource languages, previous work typically aims to transfer capabilities learned from finetuning on English data, which can improve performance on non-English language tasks Wu and Dredze (2019); Chalkidis et al. (2021); Vu et al. (2022). We investigate whether English-only multitask finetuning also improves performance on non-English held-out tasks using the multilingual BLOOM BigScience Workshop (2022) and mT5 Xue et al. (2020) models. We find that after finetuning on the English-only multitask mixture used for T0 Sanh et al. (2022) (P3), performance on a diverse set of non-English held-out tasks increases.

To investigate whether multilingual task data can further improve performance, we extend P3 to xP3 by adding datasets from 46 different languages that cover tasks previously not present in P3 (such as translation and program synthesis). Finetuning on xP3 leads to even better zero-shot task generalization in both English and non-English compared to the P3-trained baseline. Models finetuned on xP3 perform best on English prompts, even for non-English samples. Hypothesizing that better performance could be attained by training on non-English prompts, we construct a variant of xP3 with machine-translated prompts called xP3mt. We find that finetuning on machine-translated prompts is enough to significantly increase performance on held-out tasks with non-English human-written prompts. However, reducing the number of English prompts in the finetuning also worsens English prompt performance on multilingual tasks.

Notably, we also find that models finetuned on xP3 generalize to held-out tasks in languages never intentionally seen during pretraining nor finetuning. We conduct a contamination analysis and find that only small amounts of these languages were included in the pretraining corpus. Thus, we hypothesize the models learn some language- and task-agnostic capabilities.

We publicly release all our datasets and models (URLs in Appendix §D).

2 Related work

2.1 Multitask learning

Multitask finetuning Sanh et al. (2022) (or instruction tuning Wei et al. (2021)) has emerged as a recipe for improving the zero-shot task generalization of large language models. Typically, these works define a task as a collection of datasets that require a certain set of skills. To inform large language models which task to perform given an input, a prompt is used to add natural language instructions to dataset instances Schick and Schütze (2020); Scao and Rush (2021). In this line of work, zero-shot task generalization refers to the ability to perform a held-out task based on prompted instructions alone. Our work builds on T0 (Sanh et al., 2022), a variant of T5 (Raffel et al., 2020) that underwent MTF and was subsequently shown to have strong zero-shot task generalization capabilities.

Increasing the number and diversity of finetuning tasks and datasets has been shown to increase model performance Min et al. (2021); Fries et al. (2022); Wang et al. (2022c); Scialom et al. (2022); Chung et al. (2022); Mishra et al. (2021b). PromptSource Bach et al. (2022) is a software application that provides a framework for developing and applying prompts. PromptSource was used to construct P3, the training dataset of T0. While most prior work has focused on using English prompts on English datasets, Wang et al. (2022b) trained both English and multilingual models on prompted datasets. Their multilingual model, called mTk-instruct, attains strong crosslingual performance. In contrast with Wang et al. (2022b), our sole focus is crosslingual zero-shot generalization. Therefore, we consider a wider variety of prompting settings and perform a more detailed evaluation of multilingual capabilities. Separately, Radford et al. (2019) find that accidental inclusion of non-English text gave the GPT-2 model a limited ability to process and generate non-English text. We similarly discover that our finetuned models can process text in languages not intentionally trained on.

2.2 Multilingual models

Many language models are pretrained on English data only. Multilingual pretrained language models Lample and Conneau (2019); Conneau et al. (2019); Fan et al. (2021) aim to enable processing a wide variety of non-English languages. Unlike monolingual models, multilingual models can also be used for crosslingual tasks, such as translation. For language generation, recent efforts have focused on two different model architectures based on the Transformer Vaswani et al. (2017). On the one hand, encoder-decoder transformers trained with a denoising objective such as mBART Liu et al. (2020) and mT5 Xue et al. (2020) learn to predict tokens masked out in the input sequence. Predicting masked tokens is only a pretraining task and these models are generally finetuned on downstream datasets before being used. On the other hand, decoder-only models pretrained on next token prediction such as mGPT Shliazhko et al. (2022), XGLM Lin et al. (2021) and BLOOM BigScience Workshop (2022) can be used to solve tasks expressed in natural language directly in a zero-shot or few-shot setting Brown et al. (2020). XGLM demonstrated competitive few-shot performance even when the model was prompted in a language different than the sample being processed. In particular, using English prompts for multilingual datasets provides better performance with XGLM than human-translating the English prompt to the dataset language.

In this work, we use the BLOOM models BigScience Workshop (2022); Scao et al. (2022), which were pretrained on the ROOTS corpus Laurençon et al. (2022) in 46 natural languages and 13 programming languages. We also finetune mT5 Xue et al. (2020) to compare encoder-decoder and decoder-only performance. mT5 is pretrained on a corpus sampled from mC4 covering 101 languages.

3 Finetuning data and models

To study crosslingual multitask prompted finetuning, we create xP3 by extending the P3 dataset collection with additional non-English tasks. We finetune both BLOOM and mT5 models on xP3. We refer to Appendix §D for public links to released models and datasets.

3.1 Finetuning data

We build on the P3 Sanh et al. (2022) task taxonomy and add 28 new multilingual datasets illustrated in Figure 1

. We define four task clusters previously not present in P3: translation, simplification, program synthesis, and miscellaneous code datasets. As 11% of BLOOM’s pretraining data is code, we add code datasets classified as program synthesis (text-to-code) or miscellaneous. The latter includes tasks such as estimating the computational complexity of a provided code snippet and generating a name for a given function. We extend the XWinograd dataset 

Tikhonov and Ryabinin (2021) with winograd schemas from CLUE Xu et al. (2020) to increase its Chinese samples from 16 to 504. Similar to P3, a fraction of our prompts invert the task at hand. For example, a prompt may invert a closed-book QA sample by asking the model to generate a question given an answer.

With xP3 we aim to replicate the language distribution of the ROOTS corpus Laurençon et al. (2022) used to pretrain BLOOM. Thus, xP3 consists of the same 46 natural languages and code as ROOTS. ROOTS, xP3 and the mT5 corpus Xue et al. (2020) language distributions are visualized in Figure 2. 39% of xP3 data is English, slightly more than the 30% of English data in ROOTS. Various African languages such as Twi (tw) and Bambara (bm) form the tail of xP3’s language distribution. Many of them are not included in the mT5 pretraining corpus. In xP3, Twi and others are represented solely as a translation task using data from Flores-200 NLLB Team et al. (2022).

To study the importance of non-English prompts, we construct a machine-translated variant of xP3, xP3mt. We translate prompts of monolingual datasets into the respective dataset language. For example, for the Chinese dataset C3 Sun et al. (2020) prompts in xP3mt are in Chinese instead of English in xP3. For crosslingual datasets prompts remain in English (such as Wiki-Lingua, which involves producing a summary in one language based on text in another language). We use the Google Cloud API for machine translation111https://cloud.google.com/translate. Figure 3 compares the dataset variants we train on.

3.2 Models

Figure 4: Zero-shot multilingual task generalization with English prompts. BLOOM models have 176 billion parameters. Scores are the language average for each task. Appendix §B breaks down performance by language.

We use publicly available pretrained BLOOM models ranging from 560 million to 176 billion parameters. BLOOM models are large decoder-only language models pretrained for around 350 billion tokens with an architecture similar to GPT-3 Brown et al. (2020). We finetune the models for an additional 13 billion tokens with loss only being computed on target tokens. For example, given the input “Translate to English: Je t’aime.” and a space-separated target “I love you.”, the model is trained to predict only the targets. As targets vary in length from just one to hundreds of tokens, we downscale the loss of each token by the length of the target it belongs to. This ensures short targets (e.g. for multiple-choice QA) get the same weight as long targets (e.g. for translation). We skip samples longer than 2048 tokens and use packing to train efficiently on multiple samples at a time Kosec et al. (2021). We select the final checkpoint based on validation performance.

For mT5 models, we finetune using the T5X Roberts et al. (2022) framework on TPUs. mT5 uses the same encoder-decoder architecture, pretraining objective (masked language modeling), and pretraining length (1 trillion tokens) as T5 Raffel et al. (2020). For finetuning mT5, we follow the same procedure as described above for BLOOM, except that inputs are fed into the encoder and thus are not space-separated from targets.

We produce three core model variants available in different sizes:

  • BLOOMZ-P3 / mT0-P3: Models finetuned on the English-only P3.

  • BLOOMZ / mT0: Models finetuned on xP3, which consists of multilingual datasets with English prompts.

  • BLOOMZ-MT / mT0-MT: Models finetuned on xP3mt, which consists of multilingual datasets with English and machine-translated prompts.

We evaluate on three held-out tasks: coreference resolution, sentence completion and natural language inference (NLI) as depicted in Figure 1. We also evaluate on HumanEval due to its popularity for code evaluations Chen et al. (2021). For datasets that involve choosing the correct completion from several options, we follow prior work Sanh et al. (2022); Brown et al. (2020) and use rank classification: We compute the log-likelihood of each possible completion and select the highest scoring option. For each evaluation dataset, we select 5 prompts at random from PromptSource and use them for all language splits of the dataset. We report the median of the 5 prompts for results per language split. Thus, in constrast to XGLM Lin et al. (2021), we do not tune prompts based on performance on validation data. A selection of prompts can be found in Appendix §L. For generation evaluations we use lm-evaluation-harness Gao et al. (2021).

4 Results

Figure 5: Zero-shot task and language generalization using English prompts on tasks and languages not intentionally seen during pretraining nor finetuning. Language codes are ISO 639-1, except for JP (Japanese).

We first examine generalization to new tasks in languages included in finetuning in §4.1. Then, in §4.2, we look at language generalization: Can models generalize to tasks in languages that (a) they have only seen during pretraining and (b) they have never seen intentionally? In §4.3, we investigate performance on multilingual prompts and finetuning on xP3mt. Scaling laws are analyzed in §4.4. Finally, §4.5 looks at performance on generative tasks and §4.6 at the effect of language proportions on performance.

4.1 Task generalization

Previous work has shown that large language models finetuned on prompted multitask mixtures generalize to unseen tasks Zhong et al. (2021); Wei et al. (2021); Mishra et al. (2021b, a); Wang et al. (2022b). In Figure 4, we show that the same applies to multilingual models: Finetuned BLOOMZ and BLOOMZ-P3 models significantly improve over BLOOM and XGLM on held-out tasks. Despite an order of magnitude fewer parameters, mT0 (13 billion parameters) is ahead of BLOOMZ (176 billion parameters). We attribute this to the encoder-decoder architecture paired with a masked language modeling pretraining objective Wang et al. (2022a); Tay et al. (2022a) as well as the longer pretraining of mT5 Hoffmann et al. (2022); Su et al. (2022) (1 trillion tokens for mT5 vs. 366 billion for BLOOM). Despite also having gone through crosslingual multitask finetuning, mTk performs significantly worse than the same-sized mT0. We attribute this to our prompting style, which aims to replicate natural human communication. mTk is finetuned on more structured prompts with specific “Definition”, “Input” and “Output” fields. Similarly, Wang et al. (2022b) find that T0 performs worse than Tk on their prompts. We also find models finetuned on xP3 (BLOOMZ, mT0-13B) with 39% of English data outperform models finetuned on P3 (BLOOMZ-P3, mT0-13B-P3), which is 100% English (see Appendix §B). Even the fully English T0-11B model Sanh et al. (2022) is outperformed by our mT0-13B model. Ignoring embedding parameters these models have about the same size. This is likely due to xP3 adding additional tasks and prompts, which has been shown to help generalization Chung et al. (2022).

4.2 Language generalization

Here we add another layer of generalization: languages. Figure 4 already shows that finetuning on English data only (P3) leads to better performance on non-English data: For example, BLOOMZ-P3 improves by over 50% on multilingual sentence completion compared to BLOOM. Thus, zero-shot task performance in languages only seen during pretraining improves after finetuning on English. This has major practical benefits as it can be more difficult to collect data for low-resource languages.

Next, we investigate performance on languages the model has never intentionally seen. Due to the scale of large language model pretraining, it is difficult to label tasks or languages as strictly unseen. It is likely that the training data unintentionally includes small fractions of these languages (just as many tasks might appear “implicitly” in the pretraining corpus Sanh et al. (2022)). In Figure 5 we show that after multitask finetuning on xP3, the models can perform unseen tasks in languages that were not intentionally trained on. After probing the pretraining corpus in Appendix §C, we do find small amounts of these languages that were not intentionally included in the ROOTS corpus Laurençon et al. (2022). However, for XNLI, performance increases across all languages, many of which only show up in tiny fractions in our language contamination analysis, such as Thai with 0.006%. If we extrapolate this proportion to the entire ROOTS corpus, the BLOOM models would have seen a mere 20 million tokens of Thai during pretraining. One possibility is that better-than-random XNLI performance can be attained with little or no language understanding. In Appendix §G, we investigate Levenshtein distances of XNLI samples and find that there are meaningful differences across labels. Thus, sole inspection of characters without language understanding may be enough for better-than-random performance.

4.3 Multilingual prompting

Task Prompt Average accuracy
BLOOMZ BLOOMZ-MT mT0-13B mT0-13B-MT
XNLI EN 53.58 49.74 48.43 51.52
MT 37.87 42.03 39.83 42.64
HT 41.13 44.55 45.19 47.03
XCOPA EN 75.5 75.75 84.45 81.6
MT 71.95 74.25 82.9 81.1
XStoryCloze EN 84.42 84.07 82.52 82.58
MT 84.37 85.31 84.01 83.31
XWinograd EN 60.07 59.15 70.49 73.24
MT 58.48 60.14 66.89 72.33
Table 1: Comparison between EN (English), MT (machine-translated) and HT (human-translated) prompts for 176B BLOOMZ and 13B mT0 models finetuned on either only English or English and machine-translated multilingual prompts (-MT).

Since all prompts in xP3 are in English (even for multilingual datasets), we created xP3mt, an extension with machine-translated prompts. To investigate performance on non-English prompts, we additionally human- and machine-translated the English prompts used for evaluation. In Table 1, we report performance when prompting in non-English languages. BLOOMZ performs much better on English than on non-English prompts. BLOOMZ-MT, which is finetuned on xP3mt, significantly improves on multilingual prompts. On XNLI, BLOOMZ-MT raises the average performance on human-translated prompts from 41.13 to 45.55. This comes at the cost of a reduction in its performance on English prompts, from 53.58 to 49.74. For mT0, the MT version provides similar performance gains on XNLI and XWinograd non-English prompts, while results on XCOPA and XStoryCloze are mixed. Similar to Lin et al. (2021), we also find that models perform better on human-translated prompts than machine-translated ones for XNLI.

4.4 Scaling

Figure 6: Aggregate performance vs. size. Transparent lines correspond to individual languages, while thick lines are average accuracy scores.

In Figure 4, the average performance of BLOOM is near the random baselines of 0.50 for Sentence Completion and Coreference Resolution and 0.33 for NLI. We think this is due to all of our experiments being zero-shot and using untuned prompts Perez et al. (2021a). We find in Figure 6 that even at 560M parameters, multitask finetuning improves zero-shot generalization. The gap between pretrained and multitask finetuned models grows significantly as parameters increase. Scaling up parameters benefits all languages evaluated.

4.5 Generation tasks

Figure 7: Validation performance during training on natural language understanding (NLU) and natural language generation (NLG) tasks. The former are scored using accuracy and the latter using BLEU Papineni et al. (2002). The NLG tasks measured are translation and summarization.
Pass@
GPT-Neo 1.3B 4.79% 7.47% 16.30%
GPT-Neo 2.7B 6.41% 11.27% 21.37%
GPT-J 6B 11.62% 15.74% 27.74%
GPT-NeoX 20B 15.4% 25.6% 41.2%
Codex-300M 13.17% 20.37% 36.27%
Codex-679M 16.22% 25.7% 40.95%
Codex-2.5B 21.36% 35.42% 59.5%
Codex-12B 28.81% 46.81% 72.31%
BLOOM-560M 0.82% 3.02% 5.91%
BLOOM-1.1B 2.48% 5.93% 9.62%
BLOOM-1.7B 4.03% 7.45% 12.75%
BLOOM-3B 6.48% 11.35% 20.43%
BLOOM-7.1B 7.73% 17.38% 29.47%
BLOOM 15.52% 32.20% 55.45%
BLOOMZ-560M 2.18 % 4.11% 9.00%
BLOOMZ-1.1B 2.63% 6.22% 11.68%
BLOOMZ-1.7B 4.38% 8.73% 16.09%
BLOOMZ-3B 6.29% 11.94% 19.06%
BLOOMZ-7.1B 8.06% 15.03% 27.49%
BLOOMZ 12.06% 26.53% 48.44%
BLOOMZ-P3 6.13% 11.79% 18.73%
Table 2: Code continuation on HumanEval. Non-BLOOM results come from prior work Chen et al. (2021); Fried et al. (2022). Codex is a language model finetuned on code, while the GPT models Black et al. (2021); Wang and Komatsuzaki (2021); Black et al. (2022) are trained on a mix of code and text like BLOOM.

In this section, we investigate the impact of multitask finetuning on generative tasks. In Figure 7, we plot validation performance throughout the training process. We find that while performance on natural language understanding tasks continues to increase, generative performance jumps initially and then decreases. Relatedly, in Table 2, we find that multitask finetuning does not improve performance on HumanEval Chen et al. (2021). Only for small models, such as BLOOM-560M vs. BLOOMZ-560M, there are meaningful performance gains. When no code data is included in finetuning (BLOOMZ-P3) performance decreases significantly. mT0 models, which have not been pretrained on code fail to solve any HumanEval problems (see full results in Appendix §K). Given a Python docstring, HumanEval requires models to complete a function. Inspecting generations reveals that the multitask finetuned models are biased towards short generations. In Appendix §E, we show example solutions from HumanEval and compute average length statistics. BLOOMZ tries to solve problems with 70% fewer characters than BLOOM. One possible reason for this is that a majority of samples seen during multitask finetuning are only single sentences, so finetuned models learn to produce short answers. This could be causing the decreasing performance on generative tasks, which require longer answers than natural language understanding tasks. To force longer generations at inference time, we find it beneficial to enforce a minimum number of tokens during which the end-of-sequence token is ignored. We provide qualitative examples of forcing a minimum number of tokens in Appendix §F.

4.6 Effect of language proportions

Figure 8: Performance across languages by size in the BLOOM pretraining corpus, ROOTS.

In Figure 8, we find that finetuned BLOOM models perform better on languages seen extensively during pretraining. As the language distribution in the finetuning dataset, xP3, closely follows that of pretraining, these languages are also seen most frequently during finetuning. Specifically, XCOPA and XNLI show significantly better performance on these high-resource languages, such as English, Spanish or French, which all make up more than 10% of pretraining individually. The trend is less consistent for XWinograd. This may be caused by the fact that XWinograd language subsets are not translations of each other and have a significantly different number of samples. Thus, some language subsets of XWinograd may be inherently more difficult than others.

5 Conclusion

In this work we investigated crosslingual multitask finetuning. We developed xP3, a corpus consisting of tasks in 46 languages. Further, we have extended xP3 to xP3mt with machine-translated prompts. We have finetuned pretrained BLOOM and mT5 models on the newly created corpora as well as the English-only P3 corpus to produce BLOOMZ and mT0 models.

We found that English-only finetuning suffices for a multilingual pretrained large language model to generalize to tasks in other pretrained languages. However, finetuning on multiple languages using xP3 provided even better performance. We have further observed finetuned models to be capable of generalization to new tasks in languages they have never intentionally seen. We investigated multilingual prompting and found performance after finetuning on English prompts only to be poor. However, finetuning on a corpus with machine-translated prompts (xP3mt) lead to significantly better performance on human-written non-English prompts. Comparing models from 560 million up to 176 billion parameters revealed that the performance gap between only pretraining and finetuning widens as parameters increase. Lastly, we found multitask finetuning on billions of short targets biases models to produce short answers, which can hurt performance on generative tasks.

To contribute to future progress on improving zero-shot generalization, we release all datasets and models introduced in this work.

References

Appendix A Contributions

This research was conducted under the BigScience project for open research, a year-long initiative targeting the study of large models and datasets. The goal of the project is to research language models in a public environment. The project has hundreds of researchers from more than 50 countries and over 250 institutions. The BigScience project was initiated by Thomas Wolf at Hugging Face, and this collaboration would not have been possible without his effort. In the following, we list contributions made to this work.

Niklas Muennighoff evaluated all models, created xP3 and wrote most of the paper.

Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts and Hailey Schoelkopf wrote the training and evaluation code.

Niklas Muennighoff and Adam Roberts trained the models.

Niklas Muennighoff, Teven Le Scao, Hailey Schoelkopf, Zheng-Xin Yong, Thomas Wang, Khalid Almubarak, Alham Fikri Aji, M Saiful Bari and Zaid Alyafeai contributed prompts or datasets.

Lintang Sutawika, Stella Biderman, Zheng-Xin Yong, Khalid Almubarak, M Saiful Bari and Albert Webson initiated the project.

Sheng Shen conducted the contamination analysis.

Samuel Albanie wrote the prompt appendix.

Thomas Wang and Zheng-Xin Yong converted checkpoints.

Colin Raffel, Thomas Wang, Teven Le Scao, M Saiful Bari, Edward Raff and Dragomir Radev advised the project.

Niklas Muennighoff, Lintang Sutawika, Teven Le Scao, Colin Raffel, Stella Biderman, Alham Fikri Aji, Adam Roberts, Samuel Albanie, Sheng Shen, M Saiful Bari, Albert Webson, Xiangru Tang, Dragomir Radev and Edward Raff contributed to the paper.

Appendix B Task generalization breakdown

In Figure 9, we compare performance on English held-out tasks. We find that (a) finetuning on xP3 outperforms P3 (b) multilingual mT0 is stronger than monolingual T0 on English. We conjecture that both improvements come from xP3 having more prompts and datasets than P3 Chung et al. (2022).

Figure 9: Zero-shot English task generalization. Each dot represents performance on one English evaluation prompt.

In Figure 10, we visualize task generalization to multilingual datasets. The same data is aggregated in Figure 4. Performance by prompt varies substantially highlighting that prompt engineering may still be necessary after MTF. We also find that mT0 consistently outperforms BLOOMZ on Swahili (SW), possibly due to it being a larger part of its pretraining corpus (see Figure 2 and §4.6).

Figure 10: Zero-shot multilingual task generalization on languages seen during pretraining and finetuning. Each dot represents performance on one English evaluation prompt.

Appendix C ROOTS language contamination

Figure 11: Language composition of ROOTS-IDENTIFY-1%, ROOTS-1% and the mT5 corpus. All mT5 languages are depicted. ROOTS-1% is a random 1% sample of ROOTS with its assigned meta-languages. ROOTS-IDENTIFY-1% are the actual languages in ROOTS-1% re-identified using cld3.

While the BLOOM ROOTS corpus Laurençon et al. (2022) was collected from 46 natural languages and 13 programming languages, we find that sentences from the same document do not always belong to the collected (meta) language. Some sentences use languages like Russian or Japanese that were not the intentionally collected parts. This “language contamination” may stem from “code-mixing” or different languages being used in code comments. To investigate the extent of contamination, we randomly sample 1% of the documents from ROOTS for a total of 51M documents. For each document, we use cld3222https://github.com/google/cld3 Xue et al. (2020) to identify the languages used in each sentence and compare them with the meta language of the document. We summarize our results in Figure 11. It shows that ROOTS contains unintentionally collected languages, such as Burmese (my: 0.00003%), Thai (th: 0.006%), Turkish (tr: 0.03%), Greek (el: 0.03%), Russian (ru: 0.03%), Bulgarian (bg: 0.05%), Estonian (et: 0.06%), Haitian (ht: 0.12%), German (de: 0.21%), Italian (it: 0.28%) and Japanese (ja: 0.54%). These “unseen” languages only have small sentence proportions in our subsample compared to English (en: 46.23%), French (fr: 15.73%) and Spanish (es: 13.38%). Yet, they may help the language generalization of BLOOMZ models described in §4.2. Japanese is mostly mixed in the meta English documents (47%), meta Code documents (8%) and meta Chinese documents (5%). Meanwhile, Russian is mostly mixed in the meta English documents (52%), meta Code documents (19%) and meta French documents (11%).

Appendix D Artifacts

Table 3 lists all artifacts used or released in this work.

Artifact Explanation Public link
ROOTS Multilingual pretraining corpus of BLOOM https://huggingface.co/bigscience-data
mC4 Multilingual pretraining corpus used for mT5 https://huggingface.co/datasets/mc4
P3 Multitask finetuning dataset with English data & English prompts https://huggingface.co/datasets/bigscience/P3
xP3 Multitask finetuning dataset with multilingual data & English prompts https://huggingface.co/datasets/bigscience/xP3
xP3all Same as xP3 with held-out evaluation sets https://huggingface.co/datasets/bigscience/xP3all
xP3mt Same as xP3 with English & multilingual machine-translated prompts https://huggingface.co/datasets/bigscience/xP3mt
xP3megds Processed version of xP3 for easy usage with Megatron-DeepSpeed https://huggingface.co/datasets/bigscience/xP3megds
XGLM-7.5B 7.5B parameter pretrained multilingual transformer https://huggingface.co/facebook/xglm-7.5B
T0-11B 11B parameter model finetuned on P3 https://huggingface.co/bigscience/t0
mTk-3.7B 3.7B parameter multitask finetuned multilingual transformer https://huggingface.co/allenai/mtk-instruct-3b-def-pos
mTk-13B 13B parameter multitask finetuned multilingual transformer https://huggingface.co/allenai/mtk-instruct-11b-def-pos
BLOOM-560M 560M parameter model pretrained on ROOTS https://huggingface.co/bigscience/bloom-560m
BLOOM-1.1B 1.1B parameter model pretrained on ROOTS https://huggingface.co/bigscience/bloom-1b1
BLOOM-1.7B 1.7B parameter model pretrained on ROOTS https://huggingface.co/bigscience/bloom-1b7
BLOOM-3B 3B parameter model pretrained on ROOTS https://huggingface.co/bigscience/bloom-3b
BLOOM-7.1B 7.1B parameter model pretrained on ROOTS https://huggingface.co/bigscience/bloom-7b1
BLOOM 176B parameter model pretrained on ROOTS https://huggingface.co/bigscience/bloom
BLOOMZ-560M 560M parameter model finetuned on xP3 https://huggingface.co/bigscience/bloomz-560m
BLOOMZ-1.1B 1.1B parameter model finetuned on xP3 https://huggingface.co/bigscience/bloomz-1b1
BLOOMZ-1.7B 1.7B parameter model finetuned on xP3 https://huggingface.co/bigscience/bloomz-1b7
BLOOMZ-3B 3B parameter model finetuned on xP3 https://huggingface.co/bigscience/bloomz-3b
BLOOMZ-7.1B 7.1B parameter model finetuned on xP3 https://huggingface.co/bigscience/bloomz-7b1
BLOOMZ-7.1B-MT 7.1B parameter model finetuned on xP3mt https://huggingface.co/bigscience/bloomz-7b1-mt
BLOOMZ-7.1B-P3 7.1B parameter model finetuned on P3 https://huggingface.co/bigscience/bloomz-7b1-p3
BLOOMZ 176B parameter model finetuned on xP3 https://huggingface.co/bigscience/bloomz
BLOOMZ-MT 176B parameter model finetuned on xP3mt https://huggingface.co/bigscience/bloomz-mt
BLOOMZ-P3 176B parameter model finetuned on P3 https://huggingface.co/bigscience/bloomz-p3
mT5-300M 300M parameter model pretrained on a sampled version of mC4 https://huggingface.co/google/mt5-small
mT5-580M 580M parameter model pretrained on a sampled version of mC4 https://huggingface.co/google/mt5-base
mT5-1.2B 1.2B parameter model pretrained on a sampled version of mC4 https://huggingface.co/google/mt5-large
mT5-3.7B 3.7B parameter model pretrained on a sampled version of mC4 https://huggingface.co/google/mt5-xl
mT5-13B 13B parameter model pretrained on a sampled version of mC4 https://huggingface.co/google/mt5-xxl
mT0-300M 300M parameter model finetuned on xP3 https://huggingface.co/bigscience/mt0-small
mT0-580M 580M parameter model finetuned on xP3 https://huggingface.co/bigscience/mt0-base
mT0-1.2B 1.2B parameter model finetuned on xP3 https://huggingface.co/bigscience/mt0-large
mT0-3.7B 3.7B parameter model finetuned on xP3 https://huggingface.co/bigscience/mt0-xl
mT0-13B 13B parameter model finetuned on xP3 https://huggingface.co/bigscience/mt0-xxl
mT0-13B-MT 13B parameter model finetuned on xP3mt https://huggingface.co/bigscience/mt0-xxl-mt
mT0-13B-P3 13B parameter model finetuned on P3 https://huggingface.co/bigscience/mt0-xxl-p3
Table 3: Links to all models & datasets used as part of this work. BLOOMZ models have an additional repository containing the final optimizer states for training with Megatron-Deepspeed that can be found by appending “-optimizer-states” to the respective URL.

Appendix E Code generations

Table 4 provides statistics on code generations and code data. We find that BLOOM generates on average 70% more characters and 17x more comments than BLOOMZ for a given problem from HumanEval. Figure 12 compares an example solution from BLOOM and BLOOMZ. While both solutions are correct, BLOOMZ is biased towards short and concise answers.

(a) BLOOM

(b) BLOOMZ
Figure 12: Code generations of BLOOM and BLOOMZ on HumanEval. The model is prompted to generate after the final . The generation is stopped after an end-of-sequence token or a return statement followed by a newline.
Data () HumanEval generations Targets of xP3
BLOOM BLOOMZ code datasets
Average characters 247 144 531
Average Python comments (#) 0.69 0.04 0.85
Table 4: Number of characters and comments for generations and targets in the finetuning corpus.

Appendix F Qualitative examples

(a) English prompt
(b) Non-English prompt
Figure 13:

Greedy generations for sentiment analysis, a task trained on. BLOOMZ and mT0-13B have not been trained on non-English prompts, but are still able to handle them. BLOOMZ, however, answers in English. The review is a five star review of Star Wars Episode IV.

(a) English prompt
(b) Non-English prompt
Figure 14: Greedy generations for zero-shot query expansion, a task not trained on. The models sometimes fail to output at least five terms as requested in the prompt.
(a) English prompt
(b) English prompt
Figure 15: Greedy generations on question answering, a task trained on. Left: Specifying the language in the prompt is an effective way to force the output language. Right:

Specifying a minimum token length as a generation hyperparameter is an effective way to force long generations. The output of BLOOM is shortened (marked with

).
(a) English prompt
(b) English prompt
Figure 16: Non-greedy fable generations given a moral, a task not trained on. The generations are cherry-picked from 16 outputs with no minimum length, a temperature of 0.9 and top of 40. Left: BLOOMZ generates an interesting fable with the desired moral. mT0 is significantly worse at writing stories likely due to its different pretraining objective. Right: BLOOMZ does not seem to understand the moral correctly.

Appendix G XNLI Levenshtein distances

To investigate whether XNLI can be solved without any language understanding, we compute Levenshtein distances Levenshtein et al. (1966) between premise and hypothesis and average them by the XNLI label. In Table 5 we find that the distances are smallest between entailment pairs and largest between neutral pairs. This is intuitive as entailment pairs generally need to cover similar content. Contradiction pairs still need to cover similar content but differ in at least one major way. Meanwhile for neutral pairs hypothesis and premise may be about completely different topics. This highlights that XNLI can be solved to some degree by solely comparing the similarity of characters across premise and hypothesis.

Label () Entailment Neutral Contradiction
Language ()
Thai (th) 79.08 82.64 81.52
Turkish (tr) 76.93 80.59 80.24
Greek (el) 90.90 95.10 93.93
Table 5: Levenshtein distances between hypothesis and premise averaged across samples from different XNLI labels. Each label has 830 samples per language subset.

Appendix H Multilingual prompting in unseen languages

Table 6 shows aggregate performances on languages not intentionally seen during pretraining nor finetuning for BLOOMZ and only seen during pretraining for mT0. For BLOOMZ, performance drops significantly when translating the prompts to the respective unseen languages. Further, BLOOMZ-MT loses its edge over BLOOMZ as it has not been finetuned on prompts in these languages. For mT0 differences are less significant.

Task Prompt Average accuracy
BLOOMZ BLOOMZ-MT mT0-13B mT0-13B-MT
XNLI EN 45.65 43.2 48.52 51.33
MT 36.48 35.67 41.86 39.78
XCOPA EN 54.27 53.67 72.67 71.6
MT 53.2 53.0 71.57 70.87
XStoryCloze EN 61.59 61.36 79.31 80.13
MT 60.5 59.91 80.21 80.28
XWinograd EN 55.98 54.54 70.81 72.0
MT 53.11 52.46 67.86 70.45
Table 6: Comparison between EN (English), MT (machine-translated) and HT (human-translated) prompts for 176B BLOOMZ and 13B mT0 models finetuned on either only English or English and machine-translated multilingual prompts (-MT). For BLOOMZ the evaluation languages averaged are never intentionally seen, such as Japanese and Russian for XWinograd (see Figure 5). For mT0 the evaluation languages are only seen during pretraining.

Appendix I Ideas that did not work

We list several experiments that did not improve over baseline results:

Non-causal

In a non-causal or prefix language model, the model attends bidirectionally over input tokens and only causally over target tokens. Given a pretrained causal decoder, previous work found that multitask finetuning in a non-causal setup performed better than causal finetuning Wang et al. (2022a); Tay et al. (2022b). However, in our experiments, non-causal finetuning did not improve over causal finetuning.

Special tokens

Instead of separating inputs and targets with a space, we experimented with special tokens. Using the end-of-sequence token as a separator or a completely new token that the model would learn during finetuning significantly worsened results. The models may need to train on more tokens, possibly even during pretraining, to learn these new special tokens Zeng et al. (2022).

Fixing prompts

PromptSource has been written with encoder-decoder models in mind, where inputs and targets are fed into different models. As a consequence, human-written prompts in PromptSource often lack separators between input and target. For our decoder models, we decided to separate them with a space. We additionally experimented with leaving them as is or rewriting a significant amount of prompts, but neither improved significantly over space separation.

BitFit

Previous work has shown bias-only finetuning Zaken et al. (2021) of large language models to be sufficient for strong downstream performance Logan et al. (2021); Hu et al. (2021); Muennighoff (2022); Liu et al. (2022); Ding et al. (2022); Muennighoff et al. (2022). We found multitask finetuning of only biases to perform 15 absolute percentage points worse on the average of held-out tasks for BLOOMZ-7.1B.

Appendix J Limitations

We highlight several limitations of our work:

Unnatural prompting format

The choice to separate inputs and targets using a space character has proven effective to multitask finetune our decoder-only models. Nonetheless, poorly formatted prompts may result in undesirable behavior. For example, given the following prompt: “Translate to English: Je t’aime”, the model may continue the input with additional French content before starting to solve the task, i.e. translating the input from French to English. This can be mitigated by improving the prompts with a trailing full stop or a newline symbol. Encoder-decoder models, such as our mT0, do not suffer from this problem, as inputs and targets are fed into different parts of the model.

Limited languages xP3

The pretraining corpus of mT0 contains more than 101 languages Xue et al. (2020), however we finetune on only 46 languages. As shown in Appendix §B, more datasets lead to better performance. Likely, extending xP3 to the full 101 languages mT0 has seen during pretraining would lead to better performance. However, we decided to use only the languages of BLOOM in order to study language generalization (§4.2). Similarly, one could likely attain better performance by enhancing xP3 with more datasets, such as via BIG-Bench Srivastava et al. (2022); Suzgun et al. (2022), or more prompts, such as via NL-Augmenter Dhole et al. (2021).

Performance

While our models show strong capabilities of performing tasks zero-shot, there remain numerous failure modes that are common in large language models Rae et al. (2021); Bommasani et al. (2021); Zhang et al. (2022); Smith et al. (2022); Ouyang et al. (2022); Chowdhery et al. (2022). In Figure 16 of Appendix §F, BLOOMZ fails to understand the moral of a fable resulting in an undesirable generation. Similarly, in Figure 15, mT0-13B is asked to provide an explanation, but answers with a question.

Learning new languages during finetuning

While we investigated generalization to languages only seen during pretraining, we did not investigate generalization to languages only seen during finetuning. Our mT0 models are finetuned on several new languages not seen in pretraining (see Figure 2). Out of those, we only evaluated on code (HumanEval), where mT0 performed at the random baseline (0.00 in Table 7). Future work may investigate language acquisition via crosslingual multitask finetuning. We point to prior work that has looked into extending BLOOM to new languages Yong and Nikoulina (2022).

Appendix K Full results

Table 7 shows all experimental results reported in this paper.

Pretrained Pretrained + Multitask finetuned Task Dataset Config Split Prompt Metric XGLM-7.5B BLOOM-560M BLOOM-1.1B BLOOM-1.7B BLOOM-3B BLOOM-7.1B BLOOM T0-11B mTk-3.7B mTk-13B mT0-300M mT0-560M mT0-1.2B mT0-3.7B mT0-13B mT0-13B-MT mT0-13B-P3 BLOOMZ-560M BLOOMZ-1.1B BLOOMZ-1.7B BLOOMZ-3B BLOOMZ-7.1B BLOOMZ-7.1B-MT BLOOMZ-7.1B-P3 BLOOMZ BLOOMZ-MT BLOOMZ-P3 Coref. resolution winogrande xl validation EN Median acc. 49.25 49.88 50.99 49.57 49.96 49.41 48.62 60.46 50.99 52.33 49.57 51.62 50.51 52.01 62.27 62.51 56.91 49.80 51.07 50.75 51.78 55.41 55.88 51.78 58.41 58.64 55.64 Coref. resolution winogrande xl validation EN Max acc. 50.12 50.99 51.62 50.91 51.46 50.91 49.64 63.61 51.14 54.54 50.51 53.28 51.78 52.49 63.38 62.67 58.56 52.41 52.33 51.14 53.67 55.80 56.51 54.06 59.27 59.98 57.06 Coref. resolution xwinograd en test EN Median acc. 50.88 50.62 51.10 50.67 50.97 50.15 50.28 62.75 52.22 52.77 50.11 51.01 52.30 57.94 79.91 81.33 59.87 50.24 50.15 52.09 54.84 60.09 59.31 52.26 67.87 64.73 59.74 Coref. resolution xwinograd en test EN Max acc. 51.61 51.53 51.57 51.66 51.70 50.71 51.27 70.71 53.12 60.82 51.31 51.40 54.80 61.89 81.29 83.31 70.71 51.01 50.49 56.34 59.23 66.02 65.76 53.72 69.08 69.33 60.65 Coref. resolution xwinograd fr test EN Median acc. 50.60 46.99 48.19 50.60 46.99 50.60 51.81 54.22 50.60 53.01 50.60 51.81 49.40 56.63 77.11 73.49 55.42 49.40 53.01 51.81 49.40 53.01 53.01 53.01 65.06 59.04 53.01 Coref. resolution xwinograd fr test EN Max acc. 51.81 51.81 56.63 55.42 54.22 51.81 53.01 56.63 53.01 60.24 53.01 55.42 56.63 59.04 78.31 78.31 61.45 51.81 56.63 55.42 53.01 57.83 55.42 55.42 68.67 68.67 59.04 Coref. resolution xwinograd fr test MT Median acc. 46.99 48.19 53.01 48.19 46.99 50.60 49.40 54.22 50.60 53.01 49.40 53.01 53.01 56.63 68.67 75.90 53.01 48.19 50.60 50.60 50.60 51.81 55.42 51.81 56.63 57.83 53.01 Coref. resolution xwinograd fr test MT Max acc. 51.81 51.81 55.42 53.01 55.42 51.81 50.60 59.04 57.83 63.86 54.22 55.42 55.42 59.04 75.90 75.90 61.45 51.81 59.04 51.81 50.60 57.83 57.83 54.22 65.06 66.27 56.63 Coref. resolution xwinograd jp test EN Median acc. 49.22 50.36 50.89 51.62 51.41 50.89 50.26 51.51 52.66 53.18 50.89 50.68 51.41 56.93 74.35 77.37 60.27 49.74 49.84 49.95 50.26 50.68 49.64 50.36 57.46 55.47 51.09 Coref. resolution xwinograd jp test EN Max acc. 52.03 51.09 52.03 52.35 52.24 52.76 50.99 51.82 53.18 56.20 52.14 51.41 52.24 60.27 78.62 78.62 65.59 50.57 51.09 52.55 52.45 52.87 51.62 51.93 59.65 58.39 56.00 Coref. resolution xwinograd jp test MT Median acc. 48.91 50.89 50.26 50.78 51.93 49.53 51.72 51.51 51.20 53.28 51.41 50.05 50.26 55.27 73.31 78.42 61.00 50.78 50.57 49.64 50.68 49.95 50.26 50.36 52.87 52.66 50.89 Coref. resolution xwinograd jp test MT Max acc. 50.99 52.03 52.03 52.24 52.97 50.99 53.18 52.03 53.70 56.41 52.45 51.09 53.08 59.02 78.21 80.19 66.11 52.03 51.82 49.95 52.14 52.76 51.82 51.51 53.91 53.60 54.33 Coref. resolution xwinograd pt test EN Median acc. 50.57 51.33 51.71 51.71 50.19 48.67 50.95 52.47 52.09 56.27 49.81 49.81 53.61 58.17 72.24 76.05 56.27 50.19 50.19 50.95 52.47 53.99 54.37 51.33 63.50 60.08 53.99 Coref. resolution xwinograd pt test EN Max acc. 53.99 53.99 53.99 53.99 54.37 50.19 51.33 54.75 52.09 58.56 50.57 52.09 55.13 60.84 76.43 80.99 61.98 52.09 51.33 53.23 53.61 57.79 57.41 53.99 64.26 64.64 60.46 Coref. resolution xwinograd pt test MT Median acc. 50.95 52.09 50.57 49.81 50.57 50.57 53.23 52.47 53.23 52.47 49.81 47.15 52.47 54.75 71.48 75.67 55.89 52.47 50.57 49.81 50.19 52.85 53.61 51.33 60.46 59.70 54.75 Coref. resolution xwinograd pt test MT Max acc. 53.99 53.99 53.99 53.99 53.99 53.99 53.99 56.65 54.37 55.89 51.71 52.09 56.27 66.16 77.95 80.61 64.26 53.99 54.75 53.23 52.47 53.99 55.51 52.09 64.26 62.74 59.32 Coref. resolution xwinograd ru test EN Median acc. 53.33 51.43 52.38 54.29 52.70 54.29 54.29 51.43 53.97 56.83 49.52 51.11 52.38 56.83 74.29 73.97 56.51 52.06 49.52 51.75 52.38 53.97 53.02 48.57 57.78 56.51 52.70 Coref. resolution xwinograd ru test EN Max acc. 53.97 53.97 53.97 56.19 54.92 55.24 57.14 53.33 55.56 60.32 53.65 52.70 55.56 59.05 76.51 79.05 62.22 53.97 50.48 53.33 53.97 54.92 55.87 49.21 60.95 60.32 56.19 Coref. resolution xwinograd ru test MT Median acc. 53.33 51.75 52.38 53.97 52.06 53.97 52.70 50.16 53.33 54.29 52.06 51.75 52.70 52.38 66.98 71.43 55.87 51.43 51.43 53.02 49.52 52.06 52.70 47.62 54.29 55.87 54.92 Coref. resolution xwinograd ru test MT Max acc. 54.60 53.97 53.97 54.60 54.92 55.56 55.87 52.70 54.92 58.73 54.29 53.97 54.60 54.60 72.06 75.24 58.41 53.97 53.97 55.24 53.97 53.33 54.92 53.97 60.32 57.14 57.14 Coref. resolution xwinograd zh test EN Median acc. 49.01 49.21 48.81 50.20 50.00 50.60 49.21 49.21 52.18 56.75 52.78 52.18 51.59 57.54 69.25 76.19 58.53 54.17 53.97 51.39 55.16 57.94 54.37 52.18 68.65 62.10 51.59 Coref. resolution xwinograd zh test EN Max acc. 50.79 52.18 52.78 53.77 55.16 55.36 52.98 49.40 54.76 57.14 54.17 53.77 54.17 62.90 77.38 79.17 65.67 54.76 55.16 56.15 60.91 63.69 62.70 52.98 69.05 70.63 55.95 Coref. resolution xwinograd zh test HT Median acc. - - - - - - - - - - - - - - - - - - - - - 50.99 - 49.40 - - - Coref. resolution xwinograd zh test HT Max acc. - - - - - - - - - - - - - - - - - - - - - 59.72 - 52.18 - - - Coref. resolution xwinograd zh test MT Median acc. 48.02 49.01 49.01 49.40 49.60 50.79 49.60 49.21 53.17 53.17 51.19 51.79 50.60 56.35 67.86 72.42 57.74 50.79 51.19 51.79 52.98 52.38 57.94 50.40 62.70 67.46 57.14 Coref. resolution xwinograd zh test MT Max acc. 49.21 55.56 53.17 56.15 53.57 56.94 57.74 49.21 54.56 57.74 53.37 53.97 54.37 62.10 72.82 82.34 64.09 51.98 54.17 54.17 55.16 60.71 62.50 52.38 70.24 76.39 60.71 NLI anli r1 validation EN Median acc. 33.30 33.60 33.50 33.40 32.90 33.40 36.20 44.50 29.90 34.20 33.30 31.30 30.70 37.50 48.00 48.50 44.90 29.60 29.10 33.10 38.60 40.90 40.10 34.50 46.00 45.60 40.60 NLI anli r1 validation EN Max acc. 33.50 34.40 33.70 33.80 33.40 33.70 37.60 45.00 34.80 35.40 34.70 33.30 33.30 38.20 49.50 49.50 47.30 33.40 33.30 34.00 40.10 42.10 42.60 35.10 48.60 49.70 41.70 NLI anli r2 validation EN Median acc. 33.40 33.20 33.10 33.30 33.20 33.30 33.70 39.30 32.40 32.50 33.20 33.30 32.50 34.40 41.70 40.60 37.90 32.00 33.20 34.30 34.60 38.20 37.60 33.90 41.90 41.00 37.80 NLI anli r2 validation EN Max acc. 35.00 33.70 33.50 36.00 34.90 33.40 34.80 39.60 33.20 34.20 34.00 33.50 34.70 34.80 43.00 42.00 40.20 33.40 33.50 36.10 36.80 39.50 39.40 35.40 44.10 45.00 39.30 NLI anli r3 validation EN Median acc. 32.92 33.50 33.42 33.17 33.33 33.08 34.58 41.33 32.83 33.33 33.00 33.00 33.50 37.42 44.83 46.25 40.50 33.25 33.08 35.42 37.75 38.00 38.92 34.08 42.67 41.33 40.08 NLI anli r3 validation EN Max acc. 34.25 35.58 33.50 33.67 33.58 33.58 36.33 43.75 33.00 34.83 33.83 33.33 34.75 39.00 46.08 48.17 44.17 33.50 34.50 37.08 40.00 41.00 42.00 37.58 45.50 45.58 42.83 NLI super_glue cb validation EN Median acc. 39.29 42.86 42.86 28.57 32.14 44.64 33.93 76.79 44.64 26.79 44.64 46.43 50.00 69.64 82.14 87.50 67.86 51.79 53.57 58.93 67.86 57.14 71.43 53.57 76.79 76.79 75.00 NLI super_glue cb validation EN Max acc. 41.07 60.71 48.21 42.86 51.79 57.14 42.86 78.57 51.79 57.14 50.00 50.00 51.79 85.71 85.71 87.50 76.79 53.57 58.93 71.43 75.00 80.36 83.93 62.50 82.14 87.50 85.71 NLI super_glue rte validation EN Median acc. 52.71 53.07 47.65 49.46 54.15 52.35 50.18 83.39 56.68 51.26 58.84 65.70 62.09 76.90 83.03 83.75 80.14 64.26 53.07 73.29 72.56 79.06 79.06 67.15 81.95 81.23 73.65 NLI super_glue rte validation EN Max acc. 53.07 54.15 52.71 53.79 57.40 55.96 54.15 84.84 61.37 55.23 61.01 66.43 64.26 78.70 85.56 84.84 83.03 67.15 65.70 76.17 76.17 84.12 82.67 78.70 85.56 85.92 85.20 NLI xnli ar validation EN Median acc. 33.33 33.82 33.57 33.98 35.94 33.82 33.78 33.90 35.10 35.10 33.90 40.84 38.15 49.72 56.51 57.63 54.82 39.80 41.33 46.99 48.23 51.20 49.28 47.99 54.38 51.85 46.39 NLI xnli ar validation EN Max acc. 34.98 36.95 34.78 35.90 36.59 37.99 34.46 34.22 39.72 38.31 37.43 41.85 42.61 51.85 57.91 58.03 56.06 44.46 46.59 50.04 53.29 53.25 55.58 50.64 60.68 58.03 55.22 NLI xnli ar validation HT Median acc. 34.30 33.37 33.33 34.06 33.37 33.33 33.37 33.33 33.65 35.62 32.57 33.37 34.34 39.32 42.93 49.16 47.55 34.78 35.30 35.82 39.36 41.85 39.00 36.35 38.63 37.31 50.32 NLI xnli ar validation HT Max acc. 37.31 33.45 33.41 35.38 35.26 37.79 34.18 33.65 35.50 37.79 36.02 34.14 35.14 49.56 50.04 55.22 54.82 36.47 38.27 45.06 47.39 48.07 46.10 43.21 42.29 43.01 56.71 NLI xnli ar validation MT Median acc. 33.33 33.33 33.33 33.33 33.49 35.06 33.61 33.33 33.25 33.78 33.29 33.37 33.33 33.41 33.33 35.14 33.37 33.25 33.33 33.53 33.90 33.49 34.66 33.53 33.33 41.97 33.37 NLI xnli ar validation MT Max acc. 34.22 33.45 35.42 33.69 34.54 36.67 36.95 33.45 34.10 36.27 33.33 33.53 34.22 33.94 34.18 42.85 39.48 36.18 40.32 35.94 41.89 49.12 48.55 42.89 35.50 45.42 47.51 NLI xnli bg validation EN Median acc. 33.37 33.33 33.33 33.41 33.33 33.37 33.13 34.66 34.30 34.50 33.86 40.44 41.49 52.65 59.24 59.80 56.79 37.27 35.46 38.59 39.36 43.49 41.20 41.65 47.19 43.69 41.16 NLI xnli bg validation EN Max acc. 37.23 33.45 34.66 34.78 34.62 34.66 33.90 35.66 39.92 36.59 37.55 42.33 43.94 54.18 59.88 59.92 58.23 39.76 40.40 42.17 43.82 43.61 44.90 43.98 48.43 46.75 46.63 NLI xnli bg validation MT Median acc. 33.33 33.33 33.33 33.37 33.33 33.09 33.05 33.33 34.34 36.63 33.33 33.69 34.10 40.72 38.39 46.67 41.93 33.33 34.74 33.33 33.33 33.49 33.65 33.86 34.10 33.41 33.41 NLI xnli bg validation MT Max acc. 33.33 33.73 33.33 33.45 34.62 34.70 33.49 33.33 37.19 39.44 33.65 35.54 38.31 48.55 54.70 52.53 48.23 33.33 39.80 34.10 36.27 35.02 34.66 34.58 39.20 37.43 43.98 NLI xnli de validation EN Median acc. 33.37 33.37 34.14 33.29 33.45 33.33 33.09 43.94 36.10 35.14 33.94 41.37 42.25 52.93 60.12 59.84 57.03 37.71 36.27 41.12 40.76 45.86 44.22 41.85 53.13 45.18 46.43 NLI xnli de validation EN Max acc. 36.10 34.02 35.02 33.65 35.18 34.26 33.65 45.90 40.52 35.50 35.78 42.41 44.18 54.78 60.64 60.16 58.59 39.36 40.12 42.73 45.26 46.83 48.92 47.03 54.38 53.69 50.16 NLI xnli de validation MT Median acc. 33.13 33.29 33.37 33.37 33.33 33.41 33.41 33.94 34.30 39.32 33.33 33.33 33.41 37.87 37.51 35.26 33.41 33.41 34.78 33.61 34.26 33.69 33.86 35.70 39.56 36.79 39.32 NLI xnli de validation MT Max acc. 36.06 33.45 33.45 33.65 34.54 34.58 36.39 40.08 35.74 41.45 33.86 33.41 34.54 47.47 51.37 45.82 50.56 35.50 35.58 38.27 37.63 36.22 37.99 36.47 42.13 38.96 44.70 NLI xnli el validation EN Median acc. 33.29 33.33 33.45 33.37 33.33 33.33 33.49 33.57 34.50 34.82 34.62 39.76 41.57 52.57 58.80 58.88 56.10 37.95 35.50 38.43 40.36 41.08 40.92 40.68 45.94 42.29 39.12 NLI xnli el validation EN Max acc. 36.83 33.90 34.02 34.58 35.54 34.42 33.73 34.10 40.08 37.31 37.43 40.92 43.94 53.78 59.00 59.20 57.35 40.96 39.32 41.81 42.61 41.53 42.89 41.89 47.43 46.55 43.05 NLI xnli el validation MT Median acc. 33.33 33.33 33.33 33.33 33.33 33.37 33.33 33.33 33.53 34.70 33.21 33.33 33.33 33.29 45.06 34.78 33.49 33.33 33.33 33.41 33.33 33.78 33.33 33.37 34.82 34.74 33.61 NLI xnli el validation MT Max acc. 34.22 33.41 34.62 33.33 33.37 34.90 33.37 34.70 39.08 36.22 33.73 33.33 39.28 34.98 51.24 42.37 35.58 33.49 33.37 35.50 33.37 34.82 34.38 33.94 37.19 37.43 35.70 NLI xnli en validation EN Median acc. 33.49 34.38 33.61 33.61 33.57 33.29 33.49 59.12 36.79 35.66 34.10 43.13 40.60 55.18 61.24 61.93 60.36 44.62 39.04 49.84 52.21 55.38 54.38 46.35 60.92 57.47 55.02 NLI xnli en validation EN Max acc. 36.79 35.34 34.74 35.22 37.83 33.45 36.02 60.16 41.00 36.83 38.47 43.78 44.26 56.83 62.01 62.25 61.00 46.43 47.11 55.02 57.31 59.68 58.92 55.90 67.47 61.81 59.72 NLI xnli es validation EN Median acc. 33.33 33.37 33.33 33.49 33.90 33.86 34.10 45.70 34.26 35.58 34.26 40.36 42.49 51.97 60.32 60.72 58.03 43.90 41.24 46.91 50.32 52.53 46.10 45.34 58.51 43.98 52.09 NLI xnli es validation EN Max acc. 33.41 34.38 33.98 33.86 35.98 35.94 39.48 48.67 38.96 36.27 36.75 41.93 45.34 54.78 60.80 60.92 59.40 44.98 47.55 52.97 56.14 55.10 57.35 53.73 61.24 59.12 59.32 NLI xnli es validation HT Median acc. 33.37 33.33 33.33 33.45 33.33 33.33 33.33 33.33 34.66 36.75 33.33 33.41 33.37 35.66 38.76 54.74 33.33 33.37 34.10 33.33 33.45 33.33 37.63 33.33 33.37 46.55 33.37 NLI xnli es validation HT Max acc. 33.49 34.86 33.33 34.86 34.94 33.37 34.94 43.05 35.22 42.05 33.90 33.78 36.02 39.04 58.15 60.76 51.65 37.23 38.96 48.03 53.09 48.76 53.13 52.97 56.83 56.99 58.76 NLI xnli es validation MT Median acc. 33.45 33.33 33.33 33.33 33.33 33.33 33.61 34.22 33.82 36.27 33.29 33.33 33.37 33.53 45.54 55.34 33.45 33.33 33.45 33.33 33.45 33.33 39.80 34.50 33.37 49.52 33.90 NLI xnli es validation MT Max acc. 34.22 35.38 34.10 34.90 34.02 35.38 36.14 37.27 40.08 39.00 33.53 33.57 34.14 47.67 55.38 60.12 47.23 43.21 34.46 43.01 51.24 55.10 53.86 52.21 57.71 53.82 58.96 NLI xnli fr validation EN Median acc. 33.33 33.82 33.90 33.45 34.50 34.46 33.45 45.38 35.78 35.26 34.94 40.36 40.36 52.37 59.52 59.56 58.15 44.22 42.65 48.59 50.52 52.41 51.00 47.39 57.71 53.69 51.81 NLI xnli fr validation EN Max acc. 35.46 35.26 34.86 33.90 37.47 37.27 36.95 46.71 36.75 35.94 37.15 42.45 42.01 54.22 59.88 59.88 58.47 45.54 48.51 52.21 55.78 55.26 56.67 53.37 61.37 59.12 57.99 NLI xnli fr validation HT Median acc. 33.49 33.82 33.53 34.82 34.22 33.98 33.90 33.61 35.26 36.06 33.41 33.98 34.58 35.90 39.76 55.86 33.94 33.33 35.14 37.59 34.58 46.55 48.92 42.73 49.52 51.53 48.59 NLI xnli fr validation HT Max acc. 36.27 34.02 35.70 35.50 35.82 34.34 35.06 47.59 41.45 36.83 34.14 38.92 37.79 47.31 57.07 58.80 50.92 43.53 47.75 46.83 51.89 53.21 53.45 47.11 58.47 56.95 55.50 NLI xnli fr validation MT Median acc. 34.10 33.33 33.33 33.90 32.69 33.45 33.49 33.61 37.07 35.58 33.53 34.78 34.38 35.10 37.79 55.66 34.98 33.73 34.26 37.83 36.67 41.41 47.23 41.85 46.95 52.01 49.00 NLI xnli fr validation MT Max acc. 34.98 34.54 35.22 34.98 33.49 34.70 35.66 34.46 43.37 35.90 35.26 38.27 36.51 44.02 57.83 58.71 50.32 40.48 46.35 42.13 51.41 48.84 52.41 47.91 57.91 54.82 55.86 NLI xnli hi validation EN Median acc. 33.33 33.41 33.49 34.54 33.78 33.33 33.53 33.33 34.18 34.02 34.18 38.88 38.19 47.87 56.14 57.31 54.94 39.08 37.59 44.54 46.18 48.35 44.70 41.41 53.53 44.38 45.78 NLI xnli hi validation EN Max acc. 34.86 34.94 34.34 35.10 37.07 35.38 36.75 33.65 39.48 34.86 35.38 39.76 41.89 50.24 57.23 57.47 55.46 41.81 42.89 48.07 51.49 50.88 53.45 49.84 56.83 52.53 55.02 NLI xnli hi validation HT Median acc. 33.41 33.33 33.29 33.33 33.33 33.33 34.18 33.33 34.22 34.22 33.29 36.79 37.59 39.24 47.99 44.78 34.62 37.27 35.50 36.59 34.50 40.56 37.31 43.90 44.54 49.64 47.15 NLI xnli hi validation HT Max acc. 34.62 34.46 34.14 33.33 33.37 33.41 35.98 35.22 38.67 37.23 34.94 37.99 41.69 48.67 56.06 56.39 53.41 40.32 41.12 41.24 43.41 42.61 47.03 49.28 60.20 52.37 52.21 NLI xnli hi validation MT Median acc. 33.29 33.37 33.33 33.33 33.45 33.37 34.22 33.33 35.22 33.33 33.37 34.86 34.26 41.61 47.59 36.39 33.45 34.54 37.39 36.71 33.33 33.94 34.50 33.90 38.51 44.14 38.07 NLI xnli hi validation MT Max acc. 33.73 33.98 33.45 34.14 33.61 33.45 36.22 33.33 38.31 36.91 34.30 36.79 39.36 47.23 50.24 48.59 33.69 37.15 39.04 39.08 36.18 36.27 36.55 39.88 41.61 49.32 43.94 NLI xnli ru validation EN Median acc. 33.33 33.33 33.37 33.33 33.61 33.33 33.29 38.15 33.86 35.30 33.73 41.16 41.37 49.84 57.95 57.67 55.10 36.99 35.78 41.45 43.73 46.67 44.62 43.90 52.33 46.75 44.74 NLI xnli ru validation EN Max acc. 36.10 34.34 35.02 33.82 35.10 35.26 34.66 41.24 38.84 37.99 37.35 41.93 42.13 53.09 58.88 58.67 56.55 39.64 42.81 45.10 47.11 47.75 50.24 46.55 54.02 52.85 50.12 NLI xnli ru validation MT Median acc. 33.41 33.33 33.33 33.37 33.41 34.14 33.37 33.37 34.10 36.10 35.26 33.57 33.29 39.44 39.72 33.94 34.14 33.33 33.49 33.73 33.29 33.78 34.46 33.94 42.89 38.59 41.49 NLI xnli ru validation MT Max acc. 33.98 33.53 33.61 33.82 34.86 36.79 33.45 33.65 37.63 38.27 36.67 35.86 35.42 42.93 56.71 48.03 38.11 33.86 33.57 36.63 33.57 38.96 37.31 34.94 45.34 46.31 43.49 NLI xnli sw validation EN Median acc. 33.25 33.82 33.45 33.53 33.98 33.33 33.21 33.73 34.82 34.14 33.94 38.23 39.64 45.78 55.46 55.70 53.25 37.19 35.50 41.20 41.77 43.90 42.29 36.51 50.36 43.98 37.27 NLI xnli sw validation EN Max acc. 34.82 34.94 35.46 34.46 36.75 36.55 33.73 33.78 37.79 35.46 35.18 39.68 40.08 49.60 55.66 56.79 53.73 38.35 41.29 44.34 47.83 46.63 48.27 43.49 52.09 50.36 50.04 NLI xnli sw validation HT Median acc. 33.45 33.33 33.41 33.33 33.33 33.37 34.54 33.94 35.02 34.58 33.41 33.33 33.57 37.75 41.73 46.95 43.25 34.54 33.53 34.02 33.41 35.70 33.61 34.10 34.98 39.40 34.54 NLI xnli sw validation HT Max acc. 35.54 34.50 33.45 33.41 33.37 35.02 35.94 34.58 35.42 37.19 34.02 35.46 36.59 46.31 52.37 49.60 49.68 35.14 33.98 34.94 36.10 35.94 35.58 37.19 37.71 42.85 35.02 NLI xnli sw validation MT Median acc. 33.57 33.33 33.33 33.33 33.33 33.33 33.41 33.33 33.25 35.18 33.33 33.33 33.33 35.98 33.33 33.37 34.58 33.33 33.53 33.57 32.97 33.41 33.37 33.33 35.82 35.10 33.37 NLI xnli sw validation MT Max acc. 34.22 33.33 33.45 34.38 33.37 34.62 35.46 34.06 35.02 37.11 33.57 34.82 34.78 38.03 39.64 37.55 41.45 34.34 34.98 33.94 33.53 37.03 33.41 33.41 36.27 36.83 33.57 NLI xnli th validation EN Median acc. 33.37 33.73 33.37 33.41 33.33 33.41 33.33 33.69 34.74 34.58 33.41 40.92 39.68 45.34 56.31 57.19 54.98 34.14 33.61 38.88 38.19 39.00 39.48 41.65 41.45 41.33 37.31 NLI xnli th validation EN Max acc. 35.22 33.86 33.90 34.46 34.02 33.82 36.31 34.70 39.88 35.46 37.55 41.97 40.80 52.13 57.43 58.03 56.02 35.50 42.93 40.36 42.93 40.12 41.08 43.17 43.78 43.98 42.29 NLI xnli th validation MT Median acc. 33.53 33.33 33.37 33.33 33.33 33.33 34.58 33.33 34.06 35.54 33.25 35.34 33.57 35.78 38.55 39.20 40.32 33.57 33.33 33.29 33.65 32.53 33.78 33.94 33.29 34.10 34.02 NLI xnli th validation MT Max acc. 35.46 33.69 33.69 33.41 35.18 36.39 40.44 33.33 35.98 36.06 33.57 35.82 33.69 40.84 52.45 57.95 49.08 34.94 34.30 34.22 33.98 34.78 34.90 35.38 36.27 37.63 36.99 NLI xnli tr validation EN Median acc. 33.33 33.33 33.33 33.33 33.37 33.33 34.22 34.14 35.26 34.62 34.26 39.16 40.76 48.71 56.75 56.39 54.34 36.95 33.98 35.46 36.06 36.14 37.47 38.31 42.73 39.76 39.08 NLI xnli tr validation EN Max acc. 36.51 33.73 33.57 33.37 33.41 35.18 34.66 34.90 40.24 36.79 36.51 40.28 41.29 50.56 57.59 57.67 55.38 37.31 37.51 37.15 37.23 37.55 38.71 40.44 45.70 43.78 43.78 NLI xnli tr validation MT Median acc. 33.33 33.33 33.41 33.33 33.33 33.21 33.37 33.49 34.34 34.02 33.29 33.21 33.94 34.06 38.80 38.67 38.67 33.49 33.33 33.33 33.25 33.33 33.37 34.46 33.33 33.37 33.82 NLI xnli tr validation MT Max acc. 33.45 33.41 34.62 34.34 33.98 33.49 34.02 34.02 34.86 38.80 34.26 35.94 34.46 37.63 48.35 54.98 46.99 36.67 33.61 35.34 34.46 33.73 33.98 36.18 37.27 37.27 40.88 NLI xnli ur validation EN Median acc. 33.05 33.61 33.37 33.69 33.41 34.42 34.02 33.13 33.29 34.14 34.50 36.59 37.07 46.67 54.70 54.58 53.57 36.67 35.26 39.88 42.89 43.94 40.84 40.12 49.96 45.86 40.28 NLI xnli ur validation EN Max acc. 34.10 33.69 34.58 34.78 34.30 35.82 34.26 33.33 38.43 34.62 35.78 38.71 39.80 47.91 55.42 55.98 54.02 38.96 41.37 44.38 49.04 46.51 49.48 45.18 50.80 51.24 51.81 NLI xnli ur validation HT Median acc. 33.90 33.61 33.09 33.53 32.93 31.69 33.69 33.25 34.86 34.10 34.30 34.70 35.74 35.50 48.92 45.42 35.66 34.18 33.37 37.95 38.35 34.26 35.62 36.10 34.62 40.44 38.92 NLI xnli ur validation HT Max acc. 34.98 34.06 33.37 33.78 33.53 33.13 35.06 33.53 35.78 35.86 35.14 37.23 39.88 41.12 52.17 53.82 46.87 36.79 33.86 39.92 41.49 41.77 40.00 37.67 41.77 46.39 46.95 NLI xnli ur validation MT Median acc. 33.33 33.33 33.33 33.25 33.25 33.25 33.29 33.45 33.33 33.49 33.61 35.02 33.49 40.08 40.32 36.10 33.49 34.14 33.33 33.29 33.33 33.33 33.37 33.37 33.33 33.82 36.67 NLI xnli ur validation MT Max acc. 34.02 33.45 33.45 33.53 33.29 33.33 34.66 33.57 34.06 34.78 34.22 35.46 38.88 42.69 53.78 51.81 49.80 36.02 33.33 33.33 33.45 35.66 33.78 37.51 35.50 35.86 37.87 NLI xnli vi validation EN Median acc. 33.33 34.42 33.37 33.45 33.57 33.69 33.49 34.70 34.98 34.58 35.46 39.60 39.76 52.45 58.19 58.35 56.39 43.65 40.52 46.35 46.22 50.08 46.51 43.61 55.78 48.27 49.80 NLI xnli vi validation EN Max acc. 37.79 35.26 34.54 34.82 37.91 37.19 33.73 35.58 38.27 35.14 36.95 40.20 41.81 53.21 58.51 58.92 57.11 44.74 47.19 51.08 53.98 52.93 54.50 51.97 61.00 55.82 57.27 NLI xnli vi validation HT Median acc. 33.41 33.05 33.41 33.29 33.37 32.97 33.73 33.21 34.02 37.63 33.13 33.78 34.98 41.37 43.57 33.45 45.78 37.23 37.79 35.94 40.92 41.24 50.28 33.57 39.60 46.55 33.61 NLI xnli vi validation HT Max acc. 34.86 33.53 33.61 33.78 34.14 33.53 34.46 33.25 37.51 38.63 33.61 37.99 39.56 46.31 56.14 56.75 49.24 39.20 42.21 44.14 43.29 47.59 52.65 39.76 46.99 54.82 48.03 NLI xnli vi validation MT Median acc. 33.33 33.33 33.33 33.33 33.33 33.33 33.33 33.33 33.41 33.98 33.33 33.33 33.33 33.33 33.78 33.33 33.73 33.57 33.33 33.33 33.33 33.33 33.33 33.33 33.33 33.33 33.33 NLI xnli vi validation MT Max acc. 34.62 33.41 33.82 34.66 34.14 34.02 35.02 34.06 33.69 34.70 33.37 33.94 34.34 33.53 36.79 33.94 35.82 37.03 34.66 34.42 33.57 33.94 38.88 34.26 39.20 33.33 33.90 NLI xnli zh validation EN Median acc. 33.41 35.06 33.41 33.33 35.54 34.10 33.73 33.33 34.62 34.74 36.14 41.37 38.92 44.54 57.11 57.39 53.90 43.94 39.40 48.11 47.71 51.24 47.07 47.39 56.22 44.02 48.47 NLI xnli zh validation EN Max acc. 35.14 36.31 34.22 36.95 36.67 37.27 34.86 33.82 41.85 35.10 37.07 42.49 40.84 50.64 59.12 58.71 55.22 44.66 47.63 51.12 54.18 53.61 54.30 52.29 56.91 55.50 56.95 NLI xnli zh validation HT Median acc. 33.45 33.33 33.33 33.33 33.33 33.37 33.53 33.33 33.25 34.70 33.13 33.78 34.10 40.24 46.83 50.40 39.00 34.70 33.37 33.45 33.69 35.38 34.46 39.32 33.69 36.95 52.89 NLI xnli zh validation HT Max acc. 33.86 33.37 33.45 33.57 34.74 35.02 36.91 33.33 37.63 37.51 34.02 35.22 35.66 41.89 55.98 56.99 51.49 37.43 38.96 34.06 42.09 44.10 40.64 45.10 41.37 49.24 53.69 NLI xnli zh validation MT Median acc. 33.69 33.25 33.25 32.61 34.38 33.41 33.82 33.33 34.42 34.46 32.85 33.69 34.54 36.31 48.92 54.86 33.33 33.82 33.98 34.30 34.06 34.98 46.35 35.10 34.38 49.56 38.96 NLI xnli zh validation MT Max acc. 35.62 33.49 33.45 33.86 34.54 34.02 34.70 33.33 36.63 35.30 34.10 34.82 35.90 39.12 51.49 56.87 39.28 35.58 36.95 36.63 39.16 42.41 48.92 39.60 50.60 52.25 44.14 Program synthesis openai_humaneval None test EN Pass@1 - 0.82 2.48 4.03 6.48 7.73 15.52 - - - - - - 0.00 - - - 2.18 2.62 4.38 6.29 8.06 7.23 1.55 12.06 13.55 6.13 Program synthesis openai_humaneval None test EN Pass@10 - 3.02 5.93 7.45 11.35 17.38 32.20 - - - - - - 0.00 - - - 4.11 6.22 8.73 11.94 15.03 14.46 4.12 26.53 26.26 11.79 Program synthesis openai_humaneval None test EN Pass@100 - 6.23 9.62 12.75 20.43 29.47 55.45 - - - - - - 0.00 - - - 9.00 11.68 16.09 19.06 27.49 25.86 9.60 48.44 47.01 18.73 Sent. completion story_cloze 2016 validation EN Median acc. 51.68 50.08 47.25 48.48 47.51 49.44 50.99 94.71 46.71 48.21 52.38 57.14 58.69 77.61 95.40 93.85 96.31 58.52 59.01 79.64 85.20 89.10 88.51 84.66 95.67 95.83 94.01 Sent. completion story_cloze 2016 validation EN Max acc. 66.27 59.43 62.05 64.30 66.44 70.92 76.22 94.92 52.27 57.08 54.36 57.83 59.49 79.10 96.04 94.66 96.63 60.29 62.75 82.90 87.33 90.43 89.58 87.07 96.26 96.69 94.66 Sent. completion super_glue copa validation EN Median acc. 55.00 55.00 57.00 54.00 62.00 69.00 55.00 93.00 53.00 58.00 51.00 53.00 65.00 66.00 90.00 86.00 90.00 51.00 58.00 66.00 73.00 83.00 80.00 78.00 88.00 90.00 89.00 Sent. completion super_glue copa validation EN Max acc. 67.00 67.00 62.00 65.00 66.00 78.00 75.00 94.00 54.00 66.00 57.00 55.00 65.00 72.00 93.00 88.00 91.00 52.00 63.00 69.00 76.00 86.00 84.00 81.00 91.00 91.00 91.00 Sent. completion xcopa et validation EN Median acc. 57.00 53.00 50.00 53.00 50.00 53.00 51.00 53.00 52.00 51.00 53.00 49.00 53.00 65.00 72.00 79.00 75.00 48.00 49.00 48.00 50.00 49.00 51.00 52.00 48.00 52.00 49.00 Sent. completion xcopa et validation EN Max acc. 58.00 58.00 56.00 57.00 57.00 57.00 52.00 55.00 56.00 61.00 57.00 51.00 56.00 70.00 75.00 81.00 79.00 53.00 55.00 50.00 51.00 50.00 51.00 57.00 50.00 54.00 53.00 Sent. completion xcopa et validation MT Median acc. 56.00 56.00 54.00 51.00 53.00 53.00 46.00 49.00 55.00 56.00 53.00 50.00 54.00 60.00 74.00 76.00 75.00 48.00 53.00 47.00 49.00 47.00 47.00 48.00 49.00 48.00 46.00 Sent. completion xcopa et validation MT Max acc. 57.00 60.00 58.00 55.00 61.00 56.00 54.00 54.00 58.00 64.00 56.00 52.00 55.00 69.00 79.00 79.00 77.00 50.00 54.00 48.00 53.00 48.00 52.00 52.00 51.00 52.00 53.00 Sent. completion xcopa ht validation EN Median acc. 52.00 48.00 48.00 52.00 53.00 52.00 56.00 47.00 53.00 47.00 55.00 55.00 59.00 60.00 76.00 76.00 75.00 42.00 46.00 48.00 52.00 49.00 51.00 51.00 55.00 51.00 51.00 Sent. completion xcopa ht validation EN Max acc. 53.00 53.00 56.00 56.00 57.00 59.00 57.00 51.00 63.00 54.00 60.00 59.00 62.00 66.00 79.00 79.00 77.00 49.00 52.00 51.00 62.00 54.00 52.00 55.00 58.00 55.00 56.00 Sent. completion xcopa ht validation MT Median acc. 53.00 52.00 54.00 50.00 55.00 53.00 57.00 49.00 58.00 45.00 54.00 55.00 54.00 60.00 72.00 75.00 73.00 45.00 44.00 47.00 49.00 50.00 51.00 54.00 53.00 50.00 56.00 Sent. completion xcopa ht validation MT Max acc. 57.00 53.00 62.00 57.00 59.00 56.00 66.00 56.00 58.00 52.00 60.00 60.00 58.00 61.00 81.00 78.00 80.00 47.00 51.00 54.00 64.00 52.00 54.00 56.00 56.00 54.00 58.00 Sent. completion xcopa id validation EN Median acc. 51.00 53.00 51.00 56.00 50.00 54.00 54.00 55.00 48.00 51.00 56.00 50.00 59.00 65.00 90.00 88.00 84.00 50.00 59.00 58.00 66.00 70.00 70.00 65.00 79.00 83.00 78.00 Sent. completion xcopa id validation EN Max acc. 54.00 56.00 61.00 58.00 53.00 59.00 62.00 58.00 52.00 53.00 58.00 54.00 59.00 70.00 92.00 90.00 86.00 57.00 60.00 61.00 70.00 76.00 73.00 67.00 86.00 87.00 82.00 Sent. completion xcopa id validation MT Median acc. 52.00 55.00 51.00 59.00 53.00 53.00 56.00 56.00 47.00 53.00 53.00 53.00 55.00 59.00 90.00 88.00 84.00 53.00 56.00 56.00 60.00 62.00 64.00 64.00 76.00 83.00 75.00 Sent. completion xcopa id validation MT Max acc. 58.00 60.00 60.00 61.00 58.00 57.00 61.00 59.00 51.00 57.00 59.00 55.00 61.00 71.00 91.00 89.00 87.00 54.00 57.00 59.00 64.00 67.00 71.00 70.00 82.00 84.00 87.00 Sent. completion xcopa it validation EN Median acc. 49.00 56.00 50.00 46.00 55.00 50.00 54.00 66.00 55.00 54.00 53.00 55.00 60.00 64.00 87.00 85.00 87.00 51.00 51.00 45.00 50.00 59.00 57.00 57.00 72.00 69.00 69.00 Sent. completion xcopa it validation EN Max acc. 52.00 59.00 55.00 55.00 60.00 55.00 56.00 68.00 58.00 55.00 57.00 61.00 61.00 69.00 90.00 88.00 90.00 52.00 55.00 48.00 53.00 61.00 62.00 60.00 74.00 72.00 74.00 Sent. completion xcopa it validation MT Median acc. 53.00 53.00 53.00 45.00 54.00 52.00 54.00 63.00 53.00 56.00 57.00 54.00 59.00 66.00 84.00 84.00 85.00 49.00 54.00 43.00 48.00 55.00 57.00 55.00 69.00 69.00 68.00 Sent. completion xcopa it validation MT Max acc. 55.00 58.00 55.00 48.00 57.00 55.00 57.00 72.00 59.00 57.00 59.00 56.00 63.00 70.00 88.00 86.00 88.00 52.00 56.00 49.00 51.00 57.00 60.00 58.00 73.00 74.00 71.00 Sent. completion xcopa qu validation EN Median acc. 59.00 54.00 48.00 56.00 59.00 61.00 56.00 52.00 48.00 48.00 47.00 52.00 54.00 53.00 58.00 54.00 48.00 54.00 44.00 52.00 51.00 45.00 48.00 50.00 49.00 51.00 51.00 Sent. completion xcopa qu validation EN Max acc. 61.00 56.00 56.00 58.00 59.00 65.00 59.00 58.00 55.00 53.00 51.00 53.00 57.00 55.00 58.00 56.00 49.00 55.00 50.00 56.00 56.00 60.00 60.00 54.00 56.00 53.00 54.00 Sent. completion xcopa qu validation MT Median acc. 60.00 49.00 50.00 55.00 52.00 60.00 51.00 57.00 47.00 53.00 49.00 53.00 54.00 54.00 56.00 54.00 53.00 53.00 47.00 50.00 45.00 51.00 46.00 50.00 49.00 50.00 51.00 Sent. completion xcopa qu validation MT Max acc. 63.00 60.00 58.00 57.00 55.00 63.00 60.00 60.00 51.00 55.00 54.00 55.00 56.00 56.00 59.00 55.00 56.00 55.00 56.00 54.00 50.00 60.00 61.00 51.00 53.00 56.00 57.00 Sent. completion xcopa sw validation EN Median acc. 50.00 51.00 52.00 53.00 48.00 52.00 58.00 57.00 45.00 47.00 50.00 54.00 53.00 48.00 70.00 76.00 71.00 52.00 56.00 53.00 55.00 47.00 54.00 48.00 60.00 64.00 56.00 Sent. completion xcopa sw validation EN Max acc. 56.00 59.00 61.00 61.00 58.00 59.00 65.00 58.00 58.00 50.00 53.00 59.00 54.00 53.00 73.00 81.00 74.00 55.00 64.00 55.00 66.00 63.00 60.00 58.00 64.00 66.00 58.00 Sent. completion xcopa sw validation MT Median acc. 53.00 49.00 49.00 49.00 53.00 53.00 60.00 57.00 46.00 47.00 50.00 54.00 51.00 52.00 77.00 76.00 72.00 53.00 57.00 53.00 54.00 59.00 59.00 56.00 63.00 62.00 58.00 Sent. completion xcopa sw validation MT Max acc. 57.00 60.00 62.00 59.00 57.00 55.00 62.00 59.00 54.00 52.00 55.00 58.00 54.00 53.00 79.00 78.00 75.00 56.00 61.00 57.00 62.00 60.00 61.00 62.00 64.00 67.00 61.00 Sent. completion xcopa ta validation EN Median acc. 52.00 48.00 54.00 53.00 58.00 54.00 53.00 55.00 57.00 50.00 57.00 60.00 59.00 60.00 84.00 78.00 79.00 48.00 54.00 52.00 55.00 55.00 59.00 69.00 67.00 66.00 66.00 Sent. completion xcopa ta validation EN Max acc. 61.00 55.00 54.00 59.00 61.00 56.00 63.00 62.00 61.00 59.00 59.00 63.00 61.00 62.00 84.00 79.00 84.00 50.00 57.00 56.00 59.00 57.00 62.00 71.00 69.00 70.00 69.00 Sent. completion xcopa ta validation MT Median acc. 54.00 44.00 55.00 53.00 57.00 56.00 59.00 55.00 50.00 52.00 57.00 60.00 61.00 55.00 77.00 74.00 71.00 46.00 52.00 50.00 54.00 61.00 56.00 63.00 62.00 63.00 63.00 Sent. completion xcopa ta validation MT Max acc. 58.00 55.00 60.00 55.00 62.00 57.00 68.00 58.00 60.00 62.00 58.00 61.00 62.00 64.00 80.00 81.00 82.00 58.00 55.00 54.00 57.00 64.00 60.00 66.00 64.00 64.00 69.00 Sent. completion xcopa th validation EN Median acc. 54.00 53.00 53.00 54.00 52.00 50.00 55.00 55.00 52.00 55.00 60.00 50.00 51.00 56.00 73.00 71.00 74.00 54.00 55.00 56.00 54.00 57.00 55.00 56.00 54.00 51.00 51.00 Sent. completion xcopa th validation EN Max acc. 55.00 57.00 53.00 56.00 54.00 52.00 59.00 55.00 56.00 57.00 65.00 51.00 53.00 60.00 74.00 74.00 77.00 58.00 59.00 60.00 55.00 57.00 61.00 63.00 58.00 53.00 59.00 Sent. completion xcopa th validation MT Median acc. 53.00 52.00 52.00 52.00 48.00 45.00 55.00 55.00 52.00 58.00 56.00 51.00 53.00 54.00 71.00 72.00 72.00 54.00 50.00 56.00 55.00 51.00 51.00 55.00 52.00 51.00 54.00 Sent. completion xcopa th validation MT Max acc. 55.00 54.00 59.00 55.00 52.00 54.00 57.00 55.00 58.00 63.00 59.00 55.00 57.00 58.00 77.00 76.00 76.00 57.00 58.00 59.00 63.00 56.00 56.00 56.00 57.00 53.00 61.00 Sent. completion xcopa tr validation EN Median acc. 48.00 56.00 54.00 51.00 52.00 49.00 49.00 49.00 48.00 49.00 51.00 55.00 47.00 55.00 73.00 73.00 74.00 55.00 47.00 55.00 53.00 49.00 55.00 53.00 54.00 51.00 54.00 Sent. completion xcopa tr validation EN Max acc. 56.00 60.00 55.00 55.00 55.00 52.00 50.00 51.00 51.00 56.00 58.00 58.00 49.00 57.00 79.00 76.00 76.00 58.00 55.00 59.00 58.00 53.00 56.00 55.00 57.00 54.00 55.00 Sent. completion xcopa tr validation MT Median acc. 53.00 55.00 50.00 50.00 48.00 42.00 51.00 49.00 50.00 49.00 52.00 56.00 50.00 55.00 77.00 73.00 73.00 55.00 52.00 58.00 54.00 44.00 48.00 54.00 51.00 50.00 55.00 Sent. completion xcopa tr validation MT Max acc. 56.00 57.00 56.00 58.00 54.00 49.00 55.00 53.00 56.00 58.00 57.00 60.00 57.00 58.00 79.00 76.00 74.00 61.00 54.00 59.00 61.00 48.00 51.00 58.00 54.00 53.00 56.00 Sent. completion xcopa vi validation EN Median acc. 51.00 60.00 50.00 59.00 54.00 59.00 56.00 49.00 51.00 54.00 43.00 51.00 57.00 62.00 85.00 83.00 82.00 56.00 58.00 68.00 73.00 78.00 71.00 65.00 84.00 84.00 74.00 Sent. completion xcopa vi validation EN Max acc. 56.00 62.00 59.00 64.00 57.00 62.00 59.00 53.00 52.00 61.00 54.00 52.00 63.00 68.00 87.00 85.00 84.00 61.00 63.00 70.00 77.00 79.00 72.00 67.00 87.00 91.00 79.00 Sent. completion xcopa vi validation MT Median acc. 52.00 61.00 52.00 57.00 59.00 62.00 57.00 61.00 50.00 57.00 48.00 50.00 57.00 60.00 87.00 84.00 81.00 54.00 57.00 63.00 64.00 68.00 72.00 61.00 82.00 85.00 76.00 Sent. completion xcopa vi validation MT Max acc. 57.00 66.00 65.00 65.00 69.00 67.00 68.00 65.00 54.00 61.00 52.00 52.00 63.00 64.00 88.00 84.00 83.00 57.00 57.00 67.00 65.00 74.00 77.00 64.00 84.00 89.00 81.00 Sent. completion xcopa zh validation EN Median acc. 55.00 51.00 49.00 57.00 51.00 60.00 56.00 55.00 51.00 53.00 53.00 53.00 56.00 63.00 85.00 83.00 79.00 55.00 55.00 62.00 72.00 76.00 77.00 72.00 86.00 84.00 80.00 Sent. completion xcopa zh validation EN Max acc. 58.00 61.00 63.00 73.00 66.00 68.00 72.00 55.00 52.00 65.00 54.00 58.00 58.00 65.00 89.00 86.00 79.00 61.00 61.00 66.00 73.00 80.00 80.00 77.00 90.00 86.00 82.00 Sent. completion xcopa zh validation HT Median acc. - - - - - - - - - - - - - - - - - - - - - 76.00 - 75.00 - - - Sent. completion xcopa zh validation HT Max acc. - - - - - - - - - - - - - - - - - - - - - 78.00 - 79.00 - - - Sent. completion xcopa zh validation MT Median acc. 54.00 52.00 49.00 57.00 52.00 61.00 53.00 55.00 50.00 56.00 48.00 51.00 55.00 57.00 83.00 83.00 77.00 54.00 54.00 59.00 57.00 72.00 74.00 72.00 86.00 83.00 82.00 Sent. completion xcopa zh validation MT Max acc. 63.00 62.00 58.00 67.00 66.00 67.00 73.00 55.00 52.00 58.00 56.00 52.00 58.00 59.00 88.00 87.00 79.00 59.00 55.00 67.00 61.00 81.00 80.00 76.00 90.00 86.00 83.00 Sent. completion xstory_cloze ar validation EN Median acc. 51.69 49.44 49.57 49.57 50.23 50.36 52.42 48.11 47.98 49.97 48.11 53.61 54.53 65.92 85.37 87.23 88.29 52.75 52.08 69.49 78.23 81.47 82.13 75.18 91.26 91.59 91.86 Sent. completion xstory_cloze ar validation EN Max acc. 52.95 51.36 52.08 54.20 56.25 59.17 65.52 49.64 49.04 51.62 48.71 54.53 56.59 67.50 90.93 90.60 88.95 53.67 53.54 73.33 80.61 83.26 83.79 77.50 92.65 92.92 92.12 Sent. completion xstory_cloze ar validation MT Median acc. 51.62 48.91 49.11 49.24 51.49 49.37 52.61 51.69 48.71 51.36 47.58 53.08 54.86 66.71 90.54 90.14 88.29 52.55 52.28 62.48 67.24 80.01 82.46 75.12 91.13 91.73 91.79 Sent. completion xstory_cloze ar validation MT Max acc. 53.01 51.75 52.48 54.20 55.72 58.37 65.12 53.47 49.90 53.01 48.78 54.20 55.06 70.09 91.07 91.00 89.15 54.40 53.08 70.48 78.03 83.06 83.85 78.69 92.79 94.04 92.46 Sent. completion xstory_cloze es validation EN Median acc. 52.28 50.89 46.86 47.72 48.58 49.83 50.56 81.80 47.19 48.11 53.28 54.33 55.33 73.73 89.94 89.68 93.32 55.33 55.99 69.56 81.67 86.76 86.76 78.89 93.18 94.77 92.85 Sent. completion xstory_cloze es validation EN Max acc. 59.36 55.20 58.77 60.95 63.07 65.25 72.34 83.32 49.97 52.35 55.20 54.93 55.72 74.39 92.52 93.32 93.58 55.86 58.04 77.96 85.90 88.88 88.62 82.93 94.31 95.23 93.91 Sent. completion xstory_cloze es validation MT Median acc. 52.08 50.56 46.72 49.97 49.77 50.69 50.63 80.08 49.44 48.05 53.87 55.00 54.60 74.45 91.20 92.85 92.72 55.53 56.06 55.72 66.25 84.98 86.57 78.49 92.98 94.90 92.85 Sent. completion xstory_cloze es validation MT Max acc. 60.56 54.93 57.84 60.89 62.81 65.78 72.47 82.13 50.63 53.47 54.86 55.39 55.33 77.17 92.52 93.38 93.98 56.45 58.37 72.53 84.51 88.95 88.82 81.60 94.37 95.50 94.44 Sent. completion xstory_cloze eu validation EN Median acc. 51.56 49.31 48.25 50.30 50.36 48.38 50.96 49.11 45.14 49.31 51.82 51.82 50.50 64.13 86.04 84.38 90.87 49.44 46.72 57.05 67.44 70.81 70.02 67.17 85.24 85.04 85.04 Sent. completion xstory_cloze eu validation EN Max acc. 54.86 52.95 54.14 54.00 55.00 56.52 62.61 51.36 50.83 53.01 52.95 52.88 52.55 66.64 89.48 89.68 91.33 50.56 52.22 60.49 70.95 73.33 72.67 70.42 86.90 85.90 86.70 Sent. completion xstory_cloze eu validation MT Median acc. 49.83 50.69 48.64 50.50 49.31 48.44 50.89 50.83 46.46 50.30 51.42 52.55 51.29 65.92 89.54 87.16 91.20 47.39 45.14 49.83 50.17 59.89 65.59 66.38 81.40 82.59 82.73 Sent. completion xstory_cloze eu validation MT Max acc. 55.13 52.22 52.95 54.20 54.60 56.32 62.67 52.15 50.96 53.47 52.28 53.67 52.61 69.03 90.60 91.13 91.66 47.52 52.35 57.91 64.13 70.81 73.26 68.63 86.76 86.83 84.25 Sent. completion xstory_cloze hi validation EN Median acc. 50.96 48.11 47.65 50.17 51.42 49.44 52.42 49.70 46.33 52.22 51.09 52.55 50.89 67.84 89.81 87.76 88.68 53.34 51.82 68.03 75.98 75.31 74.85 68.17 87.03 87.43 87.09 Sent. completion xstory_cloze hi validation EN Max acc. 55.33 52.95 54.53 56.78 56.39 59.70 63.80 50.23 52.02 53.61 51.62 53.87 52.15 70.68 92.32 90.73 89.41 53.74 55.20 72.87 78.89 79.48 78.82 72.20 87.89 88.68 88.35 Sent. completion xstory_cloze hi validation MT Median acc. 49.64 46.99 47.72 50.23 52.35 50.56 51.09 48.78 47.12 52.08 51.89 54.60 50.17 69.89 91.79 88.95 87.82 54.80 50.89 55.46 65.06 73.33 75.84 68.83 86.70 87.36 86.23 Sent. completion xstory_cloze hi validation MT Max acc. 56.59 53.87 54.40 56.78 57.31 60.23 65.39 50.76 53.14 54.86 53.01 55.00 51.16 71.08 92.19 90.14 89.15 55.79 55.92 70.75 75.25 80.61 80.41 71.61 88.42 89.15 88.09 Sent. completion xstory_cloze id validation EN Median acc. 52.42 48.97 45.86 47.85 50.63 52.28 52.02 70.28 46.86 48.58 50.76 53.54 54.14 72.34 90.80 90.54 91.86 55.79 55.79 64.00 71.81 82.40 78.16 74.45 90.67 90.87 91.40 Sent. completion xstory_cloze id validation EN Max acc. 59.36 55.00 57.51 58.77 60.29 63.53 69.03 73.06 49.90 51.03 52.35 54.40 54.67 73.86 93.25 93.05 92.46 57.25 57.97 74.92 82.99 84.25 83.32 77.10 92.12 92.06 92.59 Sent. completion xstory_cloze id validation MT Median acc. 52.15 49.90 48.58 49.70 51.03 52.81 51.89 68.96 47.85 48.05 50.83 54.73 54.14 74.98 92.46 91.73 91.66 52.75 54.67 53.14 58.97 68.96 82.26 73.46 89.28 90.87 90.07 Sent. completion xstory_cloze id validation MT Max acc. 59.63 55.33 57.97 60.29 60.89 63.60 68.76 70.15 49.97 51.49 53.08 57.38 54.27 75.71 93.51 92.19 92.65 57.84 56.98 69.29 78.89 83.32 84.58 75.12 91.46 92.79 91.40 Sent. completion xstory_cloze my validation EN Median acc. 51.42 51.49 47.32 52.02 49.64 52.48 52.68 52.75 46.00 48.91 50.89 51.16 51.16 63.20 82.79 84.78 86.96 46.46 46.00 48.31 46.99 49.70 49.17 50.63 49.97 51.89 50.63 Sent. completion xstory_cloze my validation EN Max acc. 53.21 52.61 48.44 52.95 50.56 52.61 52.95 54.80 50.89 50.56 50.96 51.42 51.69 65.65 87.49 86.70 88.35 47.05 46.39 51.03 49.90 50.43 51.42 51.03 52.35 52.35 52.68 Sent. completion xstory_cloze my validation MT Median acc. 49.83 50.03 46.86 50.50 49.31 51.62 52.28 48.18 45.47 47.39 49.57 52.28 50.03 63.07 83.45 81.07 84.32 46.06 46.06 47.39 47.58 51.42 50.56 49.90 49.83 50.23 51.22 Sent. completion xstory_cloze my validation MT Max acc. 52.48 51.69 47.58 52.61 50.56 52.68 53.41 50.17 50.50 50.83 51.82 52.75 50.23 64.66 85.57 85.90 85.44 46.59 47.05 51.09 49.04 52.55 51.56 51.49 50.56 51.42 51.95 Sent. completion xstory_cloze ru validation EN Median acc. 50.23 49.70 50.63 52.88 50.89 50.10 50.36 65.12 46.19 49.04 48.38 51.82 52.75 69.49 87.49 86.30 83.65 51.29 48.44 54.00 57.78 63.27 62.08 64.06 79.02 77.56 79.42 Sent. completion xstory_cloze ru validation EN Max acc. 60.09 50.69 51.22 54.14 52.08 52.35 56.06 67.50 51.42 56.39 49.24 53.87 53.08 71.14 90.80 90.87 84.51 53.14 50.23 56.39 61.42 65.32 64.26 66.45 81.73 79.09 79.62 Sent. completion xstory_cloze ru validation MT Median acc. 49.50 50.23 51.09 52.02 52.35 50.50 49.83 61.95 47.52 49.24 48.97 50.63 53.01 71.28 89.54 90.01 82.86 51.03 49.11 50.17 50.10 58.31 58.77 61.02 78.42 74.19 75.98 Sent. completion xstory_cloze ru validation MT Max acc. 60.42 50.30 51.42 53.14 53.41 52.15 55.46 63.93 51.82 55.39 49.70 53.01 53.74 74.85 91.40 91.66 84.91 52.22 50.30 55.13 56.65 63.40 62.41 64.13 79.09 78.16 76.57 Sent. completion xstory_cloze sw validation EN Median acc. 52.08 49.90 49.64 49.83 53.08 51.89 49.31 49.31 46.59 49.04 53.61 53.21 53.94 67.11 86.17 87.76 89.15 49.24 49.24 55.59 67.44 67.70 66.31 58.84 77.83 79.42 75.71 Sent. completion xstory_cloze sw validation EN Max acc. 56.32 50.03 50.30 51.62 55.06 53.94 60.42 51.49 49.31 53.21 54.53 53.34 54.73 68.83 88.82 89.61 89.61 51.36 49.24 61.28 69.69 71.67 71.01 60.82 79.81 81.14 77.76 Sent. completion xstory_cloze sw validation MT Median acc. 50.69 50.83 49.83 50.76 51.49 50.30 48.84 51.56 46.46 47.92 52.81 53.14 53.81 69.69 87.36 88.15 89.08 48.64 49.17 49.24 50.23 53.28 58.37 55.79 70.81 73.00 70.28 Sent. completion xstory_cloze sw validation MT Max acc. 55.33 51.62 50.69 51.69 53.01 53.67 60.56 52.28 49.37 53.94 54.20 54.40 55.53 71.14 89.41 89.15 89.34 50.50 49.97 58.44 63.73 68.43 69.69 57.18 78.36 80.01 72.60 Sent. completion xstory_cloze te validation EN Median acc. 51.29 49.90 51.95 50.23 52.68 50.83 51.69 48.44 49.04 49.97 53.21 54.67 56.32 64.86 85.04 86.10 86.37 52.02 51.36 63.40 70.28 72.01 70.09 62.08 80.61 80.15 77.70 Sent. completion xstory_cloze te validation EN Max acc. 57.38 55.33 55.92 55.79 57.58 57.91 61.61 49.17 54.00 53.34 53.34 55.72 57.18 68.70 89.54 90.40 87.29 53.61 55.26 66.25 73.66 74.72 73.06 63.14 81.20 82.40 79.88 Sent. completion xstory_cloze te validation MT Median acc. 49.11 51.03 52.68 49.44 52.61 50.89 50.76 49.50 49.77 48.97 53.21 56.25 55.13 67.50 87.16 86.90 82.20 51.16 50.36 55.33 54.73 63.73 66.98 58.31 76.17 78.29 73.79 Sent. completion xstory_cloze te validation MT Max acc. 57.05 55.79 56.12 56.32 58.11 57.64 62.41 49.83 53.61 54.00 53.67 56.92 56.85 68.89 90.54 89.41 85.51 54.86 55.86 61.88 66.38 73.59 71.54 59.30 80.28 80.54 77.37 Sent. completion xstory_cloze zh validation EN Median acc. 52.08 49.83 47.39 47.85 49.17 50.30 51.36 49.70 47.65 53.21 56.25 54.00 56.32 68.17 91.20 91.79 92.92 54.53 57.18 76.84 82.59 84.91 83.85 76.70 91.99 92.32 91.13 Sent. completion xstory_cloze zh validation EN Max acc. 55.59 54.47 56.45 58.04 59.89 61.22 65.65 50.89 49.04 53.94 56.78 54.60 58.84 71.74 92.72 93.05 93.18 56.52 58.17 78.69 84.32 85.04 85.84 79.62 93.12 92.85 92.26 Sent. completion xstory_cloze zh validation HT Median acc. - - - - - - - - - - - - - - - - - - - - - 81.67 - 77.10 - - - Sent. completion xstory_cloze zh validation HT Max acc. - - - - - - - - - - - - - - - - - - - - - 85.37 - 79.48 - - - Sent. completion xstory_cloze zh validation MT Median acc. 52.15 49.24 47.45 47.65 50.23 51.89 53.01 48.05 46.99 52.02 55.00 54.27 57.71 72.01 92.59 91.79 91.79 55.59 56.45 70.88 74.26 81.20 84.65 78.42 91.86 91.40 90.40 Sent. completion xstory_cloze zh validation MT Max acc. 56.12 54.33 56.59 57.38 60.09 61.22 66.64 50.03 48.97 54.20 57.78 55.72 59.50 72.93 93.85 93.05 93.58 56.45 56.85 77.17 80.87 85.11 85.90 80.34 92.39 92.52 91.93
Table 7: Evaluation results. Results per prompt can be found at https://huggingface.co/datasets/bigscience/evaluation-results

Appendix L Prompts used

This section describes the prompts used for training and evaluation.

See pages 2- of prompt-appendix.pdf