xmtf
Crosslingual Generalization through Multitask Finetuning
view repo
Multitask prompted finetuning (MTF) has been shown to help large language models generalize to new tasks in a zero-shot setting, but so far explorations of MTF have focused on English data and models. We apply MTF to the pretrained multilingual BLOOM and mT5 model families to produce finetuned variants called BLOOMZ and mT0. We find finetuning large multilingual language models on English tasks with English prompts allows for task generalization to non-English languages that appear only in the pretraining corpus. Finetuning on multilingual tasks with English prompts further improves performance on English and non-English tasks leading to various state-of-the-art zero-shot results. We also investigate finetuning on multilingual tasks with prompts that have been machine-translated from English to match the language of each dataset. We find training on these machine-translated prompts leads to better performance on human-written prompts in the respective languages. Surprisingly, we find models are capable of zero-shot generalization to tasks in languages they have never intentionally seen. We conjecture that the models are learning higher-level capabilities that are both task- and language-agnostic. In addition, we introduce xP3, a composite of supervised datasets in 46 languages with English and machine-translated prompts. Our code, datasets and models are publicly available at https://github.com/bigscience-workshop/xmtf.
READ FULL TEXT VIEW PDFCrosslingual Generalization through Multitask Finetuning
Large language models pretrained on vast amounts of text show some capability of solving tasks expressed in natural language, even without explicit training on these tasks Brown et al. (2020). Finetuning on groups of language tasks has been shown to significantly boost this zero-shot task generalization of language models Wei et al. (2021); Sanh et al. (2022); Min et al. (2021). For example, Sanh et al. (2022) finetune on tasks like summarization and question answering leading to better performance on unseen tasks like natural language inference. Previous work has focused on multitask finetuning in the context of large English language models and tasks.
Multilingual large language models show the same zero-shot learning capabilities for both monolingual and crosslingual tasks Goyal et al. (2021a); Lin et al. (2021); Patel et al. (2022); Soltan et al. (2022)
. However, zero-shot performance tends to be significantly lower than finetuned performance. Thus, task-specific or language-specific transfer learning via finetuning remains the predominant practice
Devlin et al. (2018); Conneau et al. (2019). This is particularly challenging for low-resource languages or tasks with limited data available, such as writing a fable that teaches a specified moral. In the spirit of multitask finetuning, it would be desirable to improve the zero-shot task generalization of multilingual models to make them usable on tasks from low-resource languages without requiring further finetuning.To address this goal, we focus on crosslingual multitask finetuning. Due to the difficulty of collecting supervised task data in low-resource languages, previous work typically aims to transfer capabilities learned from finetuning on English data, which can improve performance on non-English language tasks Wu and Dredze (2019); Chalkidis et al. (2021); Vu et al. (2022). We investigate whether English-only multitask finetuning also improves performance on non-English held-out tasks using the multilingual BLOOM BigScience Workshop (2022) and mT5 Xue et al. (2020) models. We find that after finetuning on the English-only multitask mixture used for T0 Sanh et al. (2022) (P3), performance on a diverse set of non-English held-out tasks increases.
To investigate whether multilingual task data can further improve performance, we extend P3 to xP3 by adding datasets from 46 different languages that cover tasks previously not present in P3 (such as translation and program synthesis). Finetuning on xP3 leads to even better zero-shot task generalization in both English and non-English compared to the P3-trained baseline. Models finetuned on xP3 perform best on English prompts, even for non-English samples. Hypothesizing that better performance could be attained by training on non-English prompts, we construct a variant of xP3 with machine-translated prompts called xP3mt. We find that finetuning on machine-translated prompts is enough to significantly increase performance on held-out tasks with non-English human-written prompts. However, reducing the number of English prompts in the finetuning also worsens English prompt performance on multilingual tasks.
Notably, we also find that models finetuned on xP3 generalize to held-out tasks in languages never intentionally seen during pretraining nor finetuning. We conduct a contamination analysis and find that only small amounts of these languages were included in the pretraining corpus. Thus, we hypothesize the models learn some language- and task-agnostic capabilities.
We publicly release all our datasets and models (URLs in Appendix §D).
Multitask finetuning Sanh et al. (2022) (or instruction tuning Wei et al. (2021)) has emerged as a recipe for improving the zero-shot task generalization of large language models. Typically, these works define a task as a collection of datasets that require a certain set of skills. To inform large language models which task to perform given an input, a prompt is used to add natural language instructions to dataset instances Schick and Schütze (2020); Scao and Rush (2021). In this line of work, zero-shot task generalization refers to the ability to perform a held-out task based on prompted instructions alone. Our work builds on T0 (Sanh et al., 2022), a variant of T5 (Raffel et al., 2020) that underwent MTF and was subsequently shown to have strong zero-shot task generalization capabilities.
Increasing the number and diversity of finetuning tasks and datasets has been shown to increase model performance Min et al. (2021); Fries et al. (2022); Wang et al. (2022c); Scialom et al. (2022); Chung et al. (2022); Mishra et al. (2021b). PromptSource Bach et al. (2022) is a software application that provides a framework for developing and applying prompts. PromptSource was used to construct P3, the training dataset of T0. While most prior work has focused on using English prompts on English datasets, Wang et al. (2022b) trained both English and multilingual models on prompted datasets. Their multilingual model, called mTk-instruct, attains strong crosslingual performance. In contrast with Wang et al. (2022b), our sole focus is crosslingual zero-shot generalization. Therefore, we consider a wider variety of prompting settings and perform a more detailed evaluation of multilingual capabilities. Separately, Radford et al. (2019) find that accidental inclusion of non-English text gave the GPT-2 model a limited ability to process and generate non-English text. We similarly discover that our finetuned models can process text in languages not intentionally trained on.
Many language models are pretrained on English data only. Multilingual pretrained language models Lample and Conneau (2019); Conneau et al. (2019); Fan et al. (2021) aim to enable processing a wide variety of non-English languages. Unlike monolingual models, multilingual models can also be used for crosslingual tasks, such as translation. For language generation, recent efforts have focused on two different model architectures based on the Transformer Vaswani et al. (2017). On the one hand, encoder-decoder transformers trained with a denoising objective such as mBART Liu et al. (2020) and mT5 Xue et al. (2020) learn to predict tokens masked out in the input sequence. Predicting masked tokens is only a pretraining task and these models are generally finetuned on downstream datasets before being used. On the other hand, decoder-only models pretrained on next token prediction such as mGPT Shliazhko et al. (2022), XGLM Lin et al. (2021) and BLOOM BigScience Workshop (2022) can be used to solve tasks expressed in natural language directly in a zero-shot or few-shot setting Brown et al. (2020). XGLM demonstrated competitive few-shot performance even when the model was prompted in a language different than the sample being processed. In particular, using English prompts for multilingual datasets provides better performance with XGLM than human-translating the English prompt to the dataset language.
In this work, we use the BLOOM models BigScience Workshop (2022); Scao et al. (2022), which were pretrained on the ROOTS corpus Laurençon et al. (2022) in 46 natural languages and 13 programming languages. We also finetune mT5 Xue et al. (2020) to compare encoder-decoder and decoder-only performance. mT5 is pretrained on a corpus sampled from mC4 covering 101 languages.
To study crosslingual multitask prompted finetuning, we create xP3 by extending the P3 dataset collection with additional non-English tasks. We finetune both BLOOM and mT5 models on xP3. We refer to Appendix §D for public links to released models and datasets.
We build on the P3 Sanh et al. (2022) task taxonomy and add 28 new multilingual datasets illustrated in Figure 1
. We define four task clusters previously not present in P3: translation, simplification, program synthesis, and miscellaneous code datasets. As 11% of BLOOM’s pretraining data is code, we add code datasets classified as program synthesis (text-to-code) or miscellaneous. The latter includes tasks such as estimating the computational complexity of a provided code snippet and generating a name for a given function. We extend the XWinograd dataset
Tikhonov and Ryabinin (2021) with winograd schemas from CLUE Xu et al. (2020) to increase its Chinese samples from 16 to 504. Similar to P3, a fraction of our prompts invert the task at hand. For example, a prompt may invert a closed-book QA sample by asking the model to generate a question given an answer.With xP3 we aim to replicate the language distribution of the ROOTS corpus Laurençon et al. (2022) used to pretrain BLOOM. Thus, xP3 consists of the same 46 natural languages and code as ROOTS. ROOTS, xP3 and the mT5 corpus Xue et al. (2020) language distributions are visualized in Figure 2. 39% of xP3 data is English, slightly more than the 30% of English data in ROOTS. Various African languages such as Twi (tw) and Bambara (bm) form the tail of xP3’s language distribution. Many of them are not included in the mT5 pretraining corpus. In xP3, Twi and others are represented solely as a translation task using data from Flores-200 NLLB Team et al. (2022).
To study the importance of non-English prompts, we construct a machine-translated variant of xP3, xP3mt. We translate prompts of monolingual datasets into the respective dataset language. For example, for the Chinese dataset C3 Sun et al. (2020) prompts in xP3mt are in Chinese instead of English in xP3. For crosslingual datasets prompts remain in English (such as Wiki-Lingua, which involves producing a summary in one language based on text in another language). We use the Google Cloud API for machine translation111https://cloud.google.com/translate. Figure 3 compares the dataset variants we train on.
We use publicly available pretrained BLOOM models ranging from 560 million to 176 billion parameters. BLOOM models are large decoder-only language models pretrained for around 350 billion tokens with an architecture similar to GPT-3 Brown et al. (2020). We finetune the models for an additional 13 billion tokens with loss only being computed on target tokens. For example, given the input “Translate to English: Je t’aime.” and a space-separated target “I love you.”, the model is trained to predict only the targets. As targets vary in length from just one to hundreds of tokens, we downscale the loss of each token by the length of the target it belongs to. This ensures short targets (e.g. for multiple-choice QA) get the same weight as long targets (e.g. for translation). We skip samples longer than 2048 tokens and use packing to train efficiently on multiple samples at a time Kosec et al. (2021). We select the final checkpoint based on validation performance.
For mT5 models, we finetune using the T5X Roberts et al. (2022) framework on TPUs. mT5 uses the same encoder-decoder architecture, pretraining objective (masked language modeling), and pretraining length (1 trillion tokens) as T5 Raffel et al. (2020). For finetuning mT5, we follow the same procedure as described above for BLOOM, except that inputs are fed into the encoder and thus are not space-separated from targets.
We produce three core model variants available in different sizes:
BLOOMZ-P3 / mT0-P3: Models finetuned on the English-only P3.
BLOOMZ / mT0: Models finetuned on xP3, which consists of multilingual datasets with English prompts.
BLOOMZ-MT / mT0-MT: Models finetuned on xP3mt, which consists of multilingual datasets with English and machine-translated prompts.
We evaluate on three held-out tasks: coreference resolution, sentence completion and natural language inference (NLI) as depicted in Figure 1. We also evaluate on HumanEval due to its popularity for code evaluations Chen et al. (2021). For datasets that involve choosing the correct completion from several options, we follow prior work Sanh et al. (2022); Brown et al. (2020) and use rank classification: We compute the log-likelihood of each possible completion and select the highest scoring option. For each evaluation dataset, we select 5 prompts at random from PromptSource and use them for all language splits of the dataset. We report the median of the 5 prompts for results per language split. Thus, in constrast to XGLM Lin et al. (2021), we do not tune prompts based on performance on validation data. A selection of prompts can be found in Appendix §L. For generation evaluations we use lm-evaluation-harness Gao et al. (2021).
We first examine generalization to new tasks in languages included in finetuning in §4.1. Then, in §4.2, we look at language generalization: Can models generalize to tasks in languages that (a) they have only seen during pretraining and (b) they have never seen intentionally? In §4.3, we investigate performance on multilingual prompts and finetuning on xP3mt. Scaling laws are analyzed in §4.4. Finally, §4.5 looks at performance on generative tasks and §4.6 at the effect of language proportions on performance.
Previous work has shown that large language models finetuned on prompted multitask mixtures generalize to unseen tasks Zhong et al. (2021); Wei et al. (2021); Mishra et al. (2021b, a); Wang et al. (2022b). In Figure 4, we show that the same applies to multilingual models: Finetuned BLOOMZ and BLOOMZ-P3 models significantly improve over BLOOM and XGLM on held-out tasks. Despite an order of magnitude fewer parameters, mT0 (13 billion parameters) is ahead of BLOOMZ (176 billion parameters). We attribute this to the encoder-decoder architecture paired with a masked language modeling pretraining objective Wang et al. (2022a); Tay et al. (2022a) as well as the longer pretraining of mT5 Hoffmann et al. (2022); Su et al. (2022) (1 trillion tokens for mT5 vs. 366 billion for BLOOM). Despite also having gone through crosslingual multitask finetuning, mTk performs significantly worse than the same-sized mT0. We attribute this to our prompting style, which aims to replicate natural human communication. mTk is finetuned on more structured prompts with specific “Definition”, “Input” and “Output” fields. Similarly, Wang et al. (2022b) find that T0 performs worse than Tk on their prompts. We also find models finetuned on xP3 (BLOOMZ, mT0-13B) with 39% of English data outperform models finetuned on P3 (BLOOMZ-P3, mT0-13B-P3), which is 100% English (see Appendix §B). Even the fully English T0-11B model Sanh et al. (2022) is outperformed by our mT0-13B model. Ignoring embedding parameters these models have about the same size. This is likely due to xP3 adding additional tasks and prompts, which has been shown to help generalization Chung et al. (2022).
Here we add another layer of generalization: languages. Figure 4 already shows that finetuning on English data only (P3) leads to better performance on non-English data: For example, BLOOMZ-P3 improves by over 50% on multilingual sentence completion compared to BLOOM. Thus, zero-shot task performance in languages only seen during pretraining improves after finetuning on English. This has major practical benefits as it can be more difficult to collect data for low-resource languages.
Next, we investigate performance on languages the model has never intentionally seen. Due to the scale of large language model pretraining, it is difficult to label tasks or languages as strictly unseen. It is likely that the training data unintentionally includes small fractions of these languages (just as many tasks might appear “implicitly” in the pretraining corpus Sanh et al. (2022)). In Figure 5 we show that after multitask finetuning on xP3, the models can perform unseen tasks in languages that were not intentionally trained on. After probing the pretraining corpus in Appendix §C, we do find small amounts of these languages that were not intentionally included in the ROOTS corpus Laurençon et al. (2022). However, for XNLI, performance increases across all languages, many of which only show up in tiny fractions in our language contamination analysis, such as Thai with 0.006%. If we extrapolate this proportion to the entire ROOTS corpus, the BLOOM models would have seen a mere 20 million tokens of Thai during pretraining. One possibility is that better-than-random XNLI performance can be attained with little or no language understanding. In Appendix §G, we investigate Levenshtein distances of XNLI samples and find that there are meaningful differences across labels. Thus, sole inspection of characters without language understanding may be enough for better-than-random performance.
Task | Prompt | Average accuracy | |||
---|---|---|---|---|---|
BLOOMZ | BLOOMZ-MT | mT0-13B | mT0-13B-MT | ||
XNLI | EN | 53.58 | 49.74 | 48.43 | 51.52 |
MT | 37.87 | 42.03 | 39.83 | 42.64 | |
HT | 41.13 | 44.55 | 45.19 | 47.03 | |
XCOPA | EN | 75.5 | 75.75 | 84.45 | 81.6 |
MT | 71.95 | 74.25 | 82.9 | 81.1 | |
XStoryCloze | EN | 84.42 | 84.07 | 82.52 | 82.58 |
MT | 84.37 | 85.31 | 84.01 | 83.31 | |
XWinograd | EN | 60.07 | 59.15 | 70.49 | 73.24 |
MT | 58.48 | 60.14 | 66.89 | 72.33 |
Since all prompts in xP3 are in English (even for multilingual datasets), we created xP3mt, an extension with machine-translated prompts. To investigate performance on non-English prompts, we additionally human- and machine-translated the English prompts used for evaluation. In Table 1, we report performance when prompting in non-English languages. BLOOMZ performs much better on English than on non-English prompts. BLOOMZ-MT, which is finetuned on xP3mt, significantly improves on multilingual prompts. On XNLI, BLOOMZ-MT raises the average performance on human-translated prompts from 41.13 to 45.55. This comes at the cost of a reduction in its performance on English prompts, from 53.58 to 49.74. For mT0, the MT version provides similar performance gains on XNLI and XWinograd non-English prompts, while results on XCOPA and XStoryCloze are mixed. Similar to Lin et al. (2021), we also find that models perform better on human-translated prompts than machine-translated ones for XNLI.
In Figure 4, the average performance of BLOOM is near the random baselines of 0.50 for Sentence Completion and Coreference Resolution and 0.33 for NLI. We think this is due to all of our experiments being zero-shot and using untuned prompts Perez et al. (2021a). We find in Figure 6 that even at 560M parameters, multitask finetuning improves zero-shot generalization. The gap between pretrained and multitask finetuned models grows significantly as parameters increase. Scaling up parameters benefits all languages evaluated.
Pass@ | |||
GPT-Neo 1.3B | 4.79% | 7.47% | 16.30% |
GPT-Neo 2.7B | 6.41% | 11.27% | 21.37% |
GPT-J 6B | 11.62% | 15.74% | 27.74% |
GPT-NeoX 20B | 15.4% | 25.6% | 41.2% |
Codex-300M | 13.17% | 20.37% | 36.27% |
Codex-679M | 16.22% | 25.7% | 40.95% |
Codex-2.5B | 21.36% | 35.42% | 59.5% |
Codex-12B | 28.81% | 46.81% | 72.31% |
BLOOM-560M | 0.82% | 3.02% | 5.91% |
BLOOM-1.1B | 2.48% | 5.93% | 9.62% |
BLOOM-1.7B | 4.03% | 7.45% | 12.75% |
BLOOM-3B | 6.48% | 11.35% | 20.43% |
BLOOM-7.1B | 7.73% | 17.38% | 29.47% |
BLOOM | 15.52% | 32.20% | 55.45% |
BLOOMZ-560M | 2.18 % | 4.11% | 9.00% |
BLOOMZ-1.1B | 2.63% | 6.22% | 11.68% |
BLOOMZ-1.7B | 4.38% | 8.73% | 16.09% |
BLOOMZ-3B | 6.29% | 11.94% | 19.06% |
BLOOMZ-7.1B | 8.06% | 15.03% | 27.49% |
BLOOMZ | 12.06% | 26.53% | 48.44% |
BLOOMZ-P3 | 6.13% | 11.79% | 18.73% |
In this section, we investigate the impact of multitask finetuning on generative tasks. In Figure 7, we plot validation performance throughout the training process. We find that while performance on natural language understanding tasks continues to increase, generative performance jumps initially and then decreases. Relatedly, in Table 2, we find that multitask finetuning does not improve performance on HumanEval Chen et al. (2021). Only for small models, such as BLOOM-560M vs. BLOOMZ-560M, there are meaningful performance gains. When no code data is included in finetuning (BLOOMZ-P3) performance decreases significantly. mT0 models, which have not been pretrained on code fail to solve any HumanEval problems (see full results in Appendix §K). Given a Python docstring, HumanEval requires models to complete a function. Inspecting generations reveals that the multitask finetuned models are biased towards short generations. In Appendix §E, we show example solutions from HumanEval and compute average length statistics. BLOOMZ tries to solve problems with 70% fewer characters than BLOOM. One possible reason for this is that a majority of samples seen during multitask finetuning are only single sentences, so finetuned models learn to produce short answers. This could be causing the decreasing performance on generative tasks, which require longer answers than natural language understanding tasks. To force longer generations at inference time, we find it beneficial to enforce a minimum number of tokens during which the end-of-sequence token is ignored. We provide qualitative examples of forcing a minimum number of tokens in Appendix §F.
In Figure 8, we find that finetuned BLOOM models perform better on languages seen extensively during pretraining. As the language distribution in the finetuning dataset, xP3, closely follows that of pretraining, these languages are also seen most frequently during finetuning. Specifically, XCOPA and XNLI show significantly better performance on these high-resource languages, such as English, Spanish or French, which all make up more than 10% of pretraining individually. The trend is less consistent for XWinograd. This may be caused by the fact that XWinograd language subsets are not translations of each other and have a significantly different number of samples. Thus, some language subsets of XWinograd may be inherently more difficult than others.
In this work we investigated crosslingual multitask finetuning. We developed xP3, a corpus consisting of tasks in 46 languages. Further, we have extended xP3 to xP3mt with machine-translated prompts. We have finetuned pretrained BLOOM and mT5 models on the newly created corpora as well as the English-only P3 corpus to produce BLOOMZ and mT0 models.
We found that English-only finetuning suffices for a multilingual pretrained large language model to generalize to tasks in other pretrained languages. However, finetuning on multiple languages using xP3 provided even better performance. We have further observed finetuned models to be capable of generalization to new tasks in languages they have never intentionally seen. We investigated multilingual prompting and found performance after finetuning on English prompts only to be poor. However, finetuning on a corpus with machine-translated prompts (xP3mt) lead to significantly better performance on human-written non-English prompts. Comparing models from 560 million up to 176 billion parameters revealed that the performance gap between only pretraining and finetuning widens as parameters increase. Lastly, we found multitask finetuning on billions of short targets biases models to produce short answers, which can hurt performance on generative tasks.
To contribute to future progress on improving zero-shot generalization, we release all datasets and models introduced in this work.
Gpt-neo: Large scale autoregressive language modeling with mesh-tensorflow.
If you use this software, please cite it using these metadata, 58.Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
. Association for Computational Linguistics.Multilingual denoising pre-training for neural machine translation.
Transactions of the Association for Computational Linguistics, 8:726–742.This research was conducted under the BigScience project for open research, a year-long initiative targeting the study of large models and datasets. The goal of the project is to research language models in a public environment. The project has hundreds of researchers from more than 50 countries and over 250 institutions. The BigScience project was initiated by Thomas Wolf at Hugging Face, and this collaboration would not have been possible without his effort. In the following, we list contributions made to this work.
Niklas Muennighoff evaluated all models, created xP3 and wrote most of the paper.
Niklas Muennighoff, Thomas Wang, Lintang Sutawika, Adam Roberts and Hailey Schoelkopf wrote the training and evaluation code.
Niklas Muennighoff and Adam Roberts trained the models.
Niklas Muennighoff, Teven Le Scao, Hailey Schoelkopf, Zheng-Xin Yong, Thomas Wang, Khalid Almubarak, Alham Fikri Aji, M Saiful Bari and Zaid Alyafeai contributed prompts or datasets.
Lintang Sutawika, Stella Biderman, Zheng-Xin Yong, Khalid Almubarak, M Saiful Bari and Albert Webson initiated the project.
Sheng Shen conducted the contamination analysis.
Samuel Albanie wrote the prompt appendix.
Thomas Wang and Zheng-Xin Yong converted checkpoints.
Colin Raffel, Thomas Wang, Teven Le Scao, M Saiful Bari, Edward Raff and Dragomir Radev advised the project.
Niklas Muennighoff, Lintang Sutawika, Teven Le Scao, Colin Raffel, Stella Biderman, Alham Fikri Aji, Adam Roberts, Samuel Albanie, Sheng Shen, M Saiful Bari, Albert Webson, Xiangru Tang, Dragomir Radev and Edward Raff contributed to the paper.
In Figure 9, we compare performance on English held-out tasks. We find that (a) finetuning on xP3 outperforms P3 (b) multilingual mT0 is stronger than monolingual T0 on English. We conjecture that both improvements come from xP3 having more prompts and datasets than P3 Chung et al. (2022).
In Figure 10, we visualize task generalization to multilingual datasets. The same data is aggregated in Figure 4. Performance by prompt varies substantially highlighting that prompt engineering may still be necessary after MTF. We also find that mT0 consistently outperforms BLOOMZ on Swahili (SW), possibly due to it being a larger part of its pretraining corpus (see Figure 2 and §4.6).
While the BLOOM ROOTS corpus Laurençon et al. (2022) was collected from 46 natural languages and 13 programming languages, we find that sentences from the same document do not always belong to the collected (meta) language. Some sentences use languages like Russian or Japanese that were not the intentionally collected parts. This “language contamination” may stem from “code-mixing” or different languages being used in code comments. To investigate the extent of contamination, we randomly sample 1% of the documents from ROOTS for a total of 51M documents. For each document, we use cld3222https://github.com/google/cld3 Xue et al. (2020) to identify the languages used in each sentence and compare them with the meta language of the document. We summarize our results in Figure 11. It shows that ROOTS contains unintentionally collected languages, such as Burmese (my: 0.00003%), Thai (th: 0.006%), Turkish (tr: 0.03%), Greek (el: 0.03%), Russian (ru: 0.03%), Bulgarian (bg: 0.05%), Estonian (et: 0.06%), Haitian (ht: 0.12%), German (de: 0.21%), Italian (it: 0.28%) and Japanese (ja: 0.54%). These “unseen” languages only have small sentence proportions in our subsample compared to English (en: 46.23%), French (fr: 15.73%) and Spanish (es: 13.38%). Yet, they may help the language generalization of BLOOMZ models described in §4.2. Japanese is mostly mixed in the meta English documents (47%), meta Code documents (8%) and meta Chinese documents (5%). Meanwhile, Russian is mostly mixed in the meta English documents (52%), meta Code documents (19%) and meta French documents (11%).
Table 3 lists all artifacts used or released in this work.
Artifact | Explanation | Public link |
---|---|---|
ROOTS | Multilingual pretraining corpus of BLOOM | https://huggingface.co/bigscience-data |
mC4 | Multilingual pretraining corpus used for mT5 | https://huggingface.co/datasets/mc4 |
P3 | Multitask finetuning dataset with English data & English prompts | https://huggingface.co/datasets/bigscience/P3 |
xP3 | Multitask finetuning dataset with multilingual data & English prompts | https://huggingface.co/datasets/bigscience/xP3 |
xP3all | Same as xP3 with held-out evaluation sets | https://huggingface.co/datasets/bigscience/xP3all |
xP3mt | Same as xP3 with English & multilingual machine-translated prompts | https://huggingface.co/datasets/bigscience/xP3mt |
xP3megds | Processed version of xP3 for easy usage with Megatron-DeepSpeed | https://huggingface.co/datasets/bigscience/xP3megds |
XGLM-7.5B | 7.5B parameter pretrained multilingual transformer | https://huggingface.co/facebook/xglm-7.5B |
T0-11B | 11B parameter model finetuned on P3 | https://huggingface.co/bigscience/t0 |
mTk-3.7B | 3.7B parameter multitask finetuned multilingual transformer | https://huggingface.co/allenai/mtk-instruct-3b-def-pos |
mTk-13B | 13B parameter multitask finetuned multilingual transformer | https://huggingface.co/allenai/mtk-instruct-11b-def-pos |
BLOOM-560M | 560M parameter model pretrained on ROOTS | https://huggingface.co/bigscience/bloom-560m |
BLOOM-1.1B | 1.1B parameter model pretrained on ROOTS | https://huggingface.co/bigscience/bloom-1b1 |
BLOOM-1.7B | 1.7B parameter model pretrained on ROOTS | https://huggingface.co/bigscience/bloom-1b7 |
BLOOM-3B | 3B parameter model pretrained on ROOTS | https://huggingface.co/bigscience/bloom-3b |
BLOOM-7.1B | 7.1B parameter model pretrained on ROOTS | https://huggingface.co/bigscience/bloom-7b1 |
BLOOM | 176B parameter model pretrained on ROOTS | https://huggingface.co/bigscience/bloom |
BLOOMZ-560M | 560M parameter model finetuned on xP3 | https://huggingface.co/bigscience/bloomz-560m |
BLOOMZ-1.1B | 1.1B parameter model finetuned on xP3 | https://huggingface.co/bigscience/bloomz-1b1 |
BLOOMZ-1.7B | 1.7B parameter model finetuned on xP3 | https://huggingface.co/bigscience/bloomz-1b7 |
BLOOMZ-3B | 3B parameter model finetuned on xP3 | https://huggingface.co/bigscience/bloomz-3b |
BLOOMZ-7.1B | 7.1B parameter model finetuned on xP3 | https://huggingface.co/bigscience/bloomz-7b1 |
BLOOMZ-7.1B-MT | 7.1B parameter model finetuned on xP3mt | https://huggingface.co/bigscience/bloomz-7b1-mt |
BLOOMZ-7.1B-P3 | 7.1B parameter model finetuned on P3 | https://huggingface.co/bigscience/bloomz-7b1-p3 |
BLOOMZ | 176B parameter model finetuned on xP3 | https://huggingface.co/bigscience/bloomz |
BLOOMZ-MT | 176B parameter model finetuned on xP3mt | https://huggingface.co/bigscience/bloomz-mt |
BLOOMZ-P3 | 176B parameter model finetuned on P3 | https://huggingface.co/bigscience/bloomz-p3 |
mT5-300M | 300M parameter model pretrained on a sampled version of mC4 | https://huggingface.co/google/mt5-small |
mT5-580M | 580M parameter model pretrained on a sampled version of mC4 | https://huggingface.co/google/mt5-base |
mT5-1.2B | 1.2B parameter model pretrained on a sampled version of mC4 | https://huggingface.co/google/mt5-large |
mT5-3.7B | 3.7B parameter model pretrained on a sampled version of mC4 | https://huggingface.co/google/mt5-xl |
mT5-13B | 13B parameter model pretrained on a sampled version of mC4 | https://huggingface.co/google/mt5-xxl |
mT0-300M | 300M parameter model finetuned on xP3 | https://huggingface.co/bigscience/mt0-small |
mT0-580M | 580M parameter model finetuned on xP3 | https://huggingface.co/bigscience/mt0-base |
mT0-1.2B | 1.2B parameter model finetuned on xP3 | https://huggingface.co/bigscience/mt0-large |
mT0-3.7B | 3.7B parameter model finetuned on xP3 | https://huggingface.co/bigscience/mt0-xl |
mT0-13B | 13B parameter model finetuned on xP3 | https://huggingface.co/bigscience/mt0-xxl |
mT0-13B-MT | 13B parameter model finetuned on xP3mt | https://huggingface.co/bigscience/mt0-xxl-mt |
mT0-13B-P3 | 13B parameter model finetuned on P3 | https://huggingface.co/bigscience/mt0-xxl-p3 |
Table 4 provides statistics on code generations and code data. We find that BLOOM generates on average 70% more characters and 17x more comments than BLOOMZ for a given problem from HumanEval. Figure 12 compares an example solution from BLOOM and BLOOMZ. While both solutions are correct, BLOOMZ is biased towards short and concise answers.
![]() |
|
Data () | HumanEval generations | Targets of xP3 | |
---|---|---|---|
BLOOM | BLOOMZ | code datasets | |
Average characters | 247 | 144 | 531 |
Average Python comments (#) | 0.69 | 0.04 | 0.85 |
![]() |
![]() |
Greedy generations for sentiment analysis, a task trained on. BLOOMZ and mT0-13B have not been trained on non-English prompts, but are still able to handle them. BLOOMZ, however, answers in English. The review is a five star review of Star Wars Episode IV.
![]() |
![]() |
![]() |
![]() |
Specifying a minimum token length as a generation hyperparameter is an effective way to force long generations. The output of BLOOM is shortened (marked with
).
![]() |
![]() |
To investigate whether XNLI can be solved without any language understanding, we compute Levenshtein distances Levenshtein et al. (1966) between premise and hypothesis and average them by the XNLI label. In Table 5 we find that the distances are smallest between entailment pairs and largest between neutral pairs. This is intuitive as entailment pairs generally need to cover similar content. Contradiction pairs still need to cover similar content but differ in at least one major way. Meanwhile for neutral pairs hypothesis and premise may be about completely different topics. This highlights that XNLI can be solved to some degree by solely comparing the similarity of characters across premise and hypothesis.
Label () | Entailment | Neutral | Contradiction |
---|---|---|---|
Language () | |||
Thai (th) | 79.08 | 82.64 | 81.52 |
Turkish (tr) | 76.93 | 80.59 | 80.24 |
Greek (el) | 90.90 | 95.10 | 93.93 |
Table 6 shows aggregate performances on languages not intentionally seen during pretraining nor finetuning for BLOOMZ and only seen during pretraining for mT0. For BLOOMZ, performance drops significantly when translating the prompts to the respective unseen languages. Further, BLOOMZ-MT loses its edge over BLOOMZ as it has not been finetuned on prompts in these languages. For mT0 differences are less significant.
Task | Prompt | Average accuracy | |||
---|---|---|---|---|---|
BLOOMZ | BLOOMZ-MT | mT0-13B | mT0-13B-MT | ||
XNLI | EN | 45.65 | 43.2 | 48.52 | 51.33 |
MT | 36.48 | 35.67 | 41.86 | 39.78 | |
XCOPA | EN | 54.27 | 53.67 | 72.67 | 71.6 |
MT | 53.2 | 53.0 | 71.57 | 70.87 | |
XStoryCloze | EN | 61.59 | 61.36 | 79.31 | 80.13 |
MT | 60.5 | 59.91 | 80.21 | 80.28 | |
XWinograd | EN | 55.98 | 54.54 | 70.81 | 72.0 |
MT | 53.11 | 52.46 | 67.86 | 70.45 |
We list several experiments that did not improve over baseline results:
In a non-causal or prefix language model, the model attends bidirectionally over input tokens and only causally over target tokens. Given a pretrained causal decoder, previous work found that multitask finetuning in a non-causal setup performed better than causal finetuning Wang et al. (2022a); Tay et al. (2022b). However, in our experiments, non-causal finetuning did not improve over causal finetuning.
Instead of separating inputs and targets with a space, we experimented with special tokens. Using the end-of-sequence token as a separator or a completely new token that the model would learn during finetuning significantly worsened results. The models may need to train on more tokens, possibly even during pretraining, to learn these new special tokens Zeng et al. (2022).
PromptSource has been written with encoder-decoder models in mind, where inputs and targets are fed into different models. As a consequence, human-written prompts in PromptSource often lack separators between input and target. For our decoder models, we decided to separate them with a space. We additionally experimented with leaving them as is or rewriting a significant amount of prompts, but neither improved significantly over space separation.
Previous work has shown bias-only finetuning Zaken et al. (2021) of large language models to be sufficient for strong downstream performance Logan et al. (2021); Hu et al. (2021); Muennighoff (2022); Liu et al. (2022); Ding et al. (2022); Muennighoff et al. (2022). We found multitask finetuning of only biases to perform 15 absolute percentage points worse on the average of held-out tasks for BLOOMZ-7.1B.
We highlight several limitations of our work:
The choice to separate inputs and targets using a space character has proven effective to multitask finetune our decoder-only models. Nonetheless, poorly formatted prompts may result in undesirable behavior. For example, given the following prompt: “Translate to English: Je t’aime”, the model may continue the input with additional French content before starting to solve the task, i.e. translating the input from French to English. This can be mitigated by improving the prompts with a trailing full stop or a newline symbol. Encoder-decoder models, such as our mT0, do not suffer from this problem, as inputs and targets are fed into different parts of the model.
The pretraining corpus of mT0 contains more than 101 languages Xue et al. (2020), however we finetune on only 46 languages. As shown in Appendix §B, more datasets lead to better performance. Likely, extending xP3 to the full 101 languages mT0 has seen during pretraining would lead to better performance. However, we decided to use only the languages of BLOOM in order to study language generalization (§4.2). Similarly, one could likely attain better performance by enhancing xP3 with more datasets, such as via BIG-Bench Srivastava et al. (2022); Suzgun et al. (2022), or more prompts, such as via NL-Augmenter Dhole et al. (2021).
While our models show strong capabilities of performing tasks zero-shot, there remain numerous failure modes that are common in large language models Rae et al. (2021); Bommasani et al. (2021); Zhang et al. (2022); Smith et al. (2022); Ouyang et al. (2022); Chowdhery et al. (2022). In Figure 16 of Appendix §F, BLOOMZ fails to understand the moral of a fable resulting in an undesirable generation. Similarly, in Figure 15, mT0-13B is asked to provide an explanation, but answers with a question.
While we investigated generalization to languages only seen during pretraining, we did not investigate generalization to languages only seen during finetuning. Our mT0 models are finetuned on several new languages not seen in pretraining (see Figure 2). Out of those, we only evaluated on code (HumanEval), where mT0 performed at the random baseline (0.00 in Table 7). Future work may investigate language acquisition via crosslingual multitask finetuning. We point to prior work that has looked into extending BLOOM to new languages Yong and Nikoulina (2022).
Table 7 shows all experimental results reported in this paper.
This section describes the prompts used for training and evaluation.
See pages 2- of prompt-appendix.pdf