Log In Sign Up

On the Multilingual Capabilities of Very Large-Scale English Language Models

Generative Pre-trained Transformers (GPTs) have recently been scaled to unprecedented sizes in the history of machine learning. These models, solely trained on the language modeling objective, have been shown to exhibit outstanding few-shot learning capabilities in a number of different tasks. Nevertheless, aside from anecdotal experiences, little is known regarding their multilingual capabilities, given the fact that the pre-training corpus is almost entirely composed of English text. In this work, we investigate the multilingual skills of GPT-3, focusing on one language that barely appears in the pre-training corpus, Catalan, which makes the results especially meaningful; we assume that our results may be relevant for other languages as well. We find that the model shows an outstanding performance, particularly in generative tasks, with predictable limitations mostly in language understanding tasks but still with remarkable results given the zero-shot scenario. We investigate its potential and limits in extractive question-answering and natural language generation, as well as the effect of scale in terms of model size.


page 1

page 2

page 3

page 4


Applying Multilingual Models to Question Answering (QA)

We study the performance of monolingual and multilingual language models...

What Language Model to Train if You Have One Million GPU Hours?

The crystallization of modeling methods around the Transformer architect...

Unsupervised Inflection Generation Using Neural Language Modeling

The use of Deep Neural Network architectures for Language Modeling has r...

MulZDG: Multilingual Code-Switching Framework for Zero-shot Dialogue Generation

Building dialogue generation systems in a zero-shot scenario remains a h...

OPT: Open Pre-trained Transformer Language Models

Large language models, which are often trained for hundreds of thousands...

mGPT: Few-Shot Learners Go Multilingual

Recent studies report that autoregressive language models can successful...

1 Introduction

Improving Natural Language Understanding (NLU) and Generation (NLG) by pre-training autoregressive language models based on the Transformer Vaswani et al. (2017) decoder architecture has been commonplace since the original GPT (Generative Pretrained Transformer) Radford and Narasimhan (2018) first appeared. In the race to scale up these language models Radford et al. (2019), the arrival of GPT-3 Brown et al. (2020) has changed the rules of the game. As claimed by their creators, its ability to learn from a few examples "via text interaction" makes it stand out from the rest. Its impressive generative capabilities have caused a big sensation, not only at research level but also in the mainstream media.

A particular feature of GPT-3 is, besides the sheer size of the data it has been trained on, the fact that, although the data is generally of good quality, it has not been filtered for language (in purpose). Therefore, although GPT-3 is in principle a language model for English, its training data contains many other languages,111 even if they account for a small portion of the dataset in comparison to English (93% by word count). Intuitively, one would expect that this quantity would not be enough to obtain a high-quality language model in these other languages, especially in the low-resource ones. Some evidence in this regard is provided by the large amount of data required to train language-specific models Nozza et al. (2020). Even the multilingual ones222Note that both mBERT and XLM-R are encoder-based models, unlike GPT, but the point still holds. such as mBERT Devlin et al. (2018) or XLM-R Conneau et al. (2019) employ large multilingual datasets based on Wikipedia or CommonCrawl. A very recent work trained a language-specific Catalan model with around 1.7B tokens Armengol-Estapé et al. (2021), but it was published after the elaboration of this article and thus is not included in our comparisons. The code for reproducing the GPT-3 API queries and the results we obtained is openly available.333

2 Related Work

In Brown et al. (2020)

, the authors of GPT-3 already conducted a thorough evaluation in many different benchmarks, including question-answering, cloze tasks, and Natural Language Inference (NLI), among many others. Crucially, they train and evaluate models of different sizes, and find that by simply scaling up the exact same architecture, the diminishing returns that one would expect are not observed. Recently, some works have estimated the increase in performance of autoregressive models in terms of model size, data, and compute

Kaplan et al. (2020); Henighan et al. (2020). Also in Brown et al. (2020), and relevant to our work, authors evaluate GPT-3 in machine translation, both in zero and few-shot settings, and find that in the latter, GPT-3 outperforms previous unsupervised NMT models by 5 BLEU in some pairs. Specifically, this success is observed in the evaluated pairs in which English is the target language and not in the ones in which English is the source one, being GPT-3 an English language model. No other analysis involving languages other than English was conducted.

Since the original article of GPT-3, several works have investigated the capabilities and limits of the model in English Zhao et al. (2021). Moreover, with the possibility of querying the model via API, hundreds of researchers, journalists and curious alike have embarked on all sorts of experiments, including automatic programming or solving arithmetic operations Floridi and Chiriatti (2020). The Internet is full of examples of the amazing generative capabilities of the model, from poetry, news or essay writing Elkins and Chun (2020).

Furthermore, many researchers are interested in the ethical concerns regarding such a capable generative model and studying the impact it may had if it was released to the public Dale (2021); McGuffie and Newhouse (2020). In a more consequential approach, with the purpose of harnessing the full learning potential of GPT, we are seeing the emergence of a new line of research exploring optimal ways to "prompt" the model Liu et al. (2021).

Nevertheless, to our knowledge, no work has studied its potential for solving tasks in languages other than English, aside from machine translation. In this work, we investigate the multilingual skills of GPT-3, focusing on Catalan, a language barely appearing in the pre-training corpus.

3 Methodology

In this work we have explored how good GPT-3 is at generating natural text in Catalan and solving one NLU task, specifically extractive Q&A. Catalan only accounts for the 0,01798% of words in the training corpus, that is around 35M words. Language models, even if in a considerably smaller scale than GPT-3, are usually trained on corpora with a number of tokens in the billions as can be seen in Table 1. Even considering the effect of certain factors particular to each language, such as linguistic proximity to English (e.g. being an Indo European language), affiliation to well-populated families (e.g. Romance), number of tokens in the training corpus, etc. we can assume that our results may be relevant for other languages as well.

Model Words (M) Catalan words (M)
mBERT Unclear444Summing up tokens from all languages from Table 6 in Conneau et al. (2019). ~200
XLM-R 295,008555In the dataset statistics in Github, OpenAI claims that English, with around 181B tokens, accounts for about 93% of the dataset. This implies a total size of around 197B tokens, the one we use in the table. However, in the article authors say the model was trained with a total of 300B tokens. We have not been able to clarify this apparent inconsistency. 1,752
GPT-3 196,75566footnotemark: 6 35
Table 1: Pre-training word count in some models
33footnotetext: mBERT was trained with the top 100 largest Wikipedias, but there are no details on the exact amount of tokens. For Catalan, we estimate the size in 200M tokens from a dump from January 2020.

3.1 Question-answering

To evaluate GPT-3 in question-answering, we use a Catalan translation (introduced in Armengol-Estapé et al. (2021), Rodriguez-Penagos and Armentano-Oller (2021b)) of XQuAD Artetxe et al. (2019), a cross-lingual question-answering dataset consisting of 240 paragraphs and 1,060 question-answer pairs. We focus on the zero-shot setting, in which the model is not given any example. GPT-3 is asked to answer one question at a time, pieced with its context as prompts as shown below (in bold, GPT-3’s answer):

Això és un sistema de resposta de preguntes en català.
Context: La defensa dels Panthers va cedir només 308 punts […]
Pregunta: Quants punts va cedir la defensa dels Panthers?
308 punts

The whole prompt, including the instruction to answer the question (the first sentence), the context, the question (Pregunta), and the final word (Resposta, "Answer") are given in Catalan, with the hope that this will further condition the model to answer in Catalan. To study the effect of scale, we run the model with the 4 engines provided in OpenAI’s API,777 in increasing size888To the best of our knowledge, OpenAI has not clarified the exact size of each of the models in the API. However, some evaluations results seem to suggest that Ada, Babbage, Curie and Davinci would correspond to 350M, 1.3B, 6.7B, and 175B, respectively. See: (in parameters): Ada, Babbage, Curie, and Davinci, using the default sampling parameters999A temperature of 0.7, a frequency penalty of 0, a presence penalty of 0, and with top_p = 1. except for max_tokens, which we set to 64 to allow the longest answers.

As a reference, we include the results of what should be considered state-of-the-art, the ones obtained by fine-tuning mBERT and XLM-RoBERTa (base size for both models) in a Catalan question-answering dataset Rodriguez-Penagos and Armentano-Oller (2021a) using the script from the Huggingface library Wolf et al. (2019)

used for fine-tuning on the SQuAD dataset. For all models (including GPT-3), we apply the same evaluation script as in SQuAD.


3.2 Natural Language Generation

In order to evaluate the generative capabilities of GPT-3 in Catalan, we want to assess how “natural” the generated text is to Catalan natives. For this, we create a synthetic set of 60 sentences and mix them randomly with 60 control sentences coming from a news corpus,1111112021 crawling from in Catalan and ask our evaluators to score each sentence based on their overall fluency and correctness. To obtain the synthetic sentences, we first query GPT-3 with a set of 20 headlines extracted from the same news corpus, and then sample 60 sentences from the generated output. For this evaluation we only use the output of the largest version of GPT-3 (i.e. Davinci). We manually checked that the sentences did not appear in the Internet,121212By searching them on Google. None of the sentences appeared verbatim although we removed a similar one. to avoid sentences that could have been directly memorized in training. As in question-answering, we used the default sampling parameters of OpenAI’s API, this time, setting max_tokens to 1024, for generating more sentences to sample from. For the human evaluation, similarly to Casas et al. (2020)

, sentences were evaluated by a pool of 9 annotators, who were requested to rate the sentence in an integer scale from 1 to 5. Each sentence, randomly distributed among the pool of evaluators, was scored by 3 different evaluators; this redundancy accounts for the variance and subjectivity in human scores.

Model F1 EM
GPT-3: Ada 5.26 0.38
GPT-3: Babbage 10.08 1.13
GPT-3: Curie 16.66 5.00
GPT-3: Davinci 38.43 17.74
XLM-RoBERTa 67.10 46.42
mBERT 67.15 46.51
Table 2: Question answering results for XQuAD-ca

4 Results


The results obtained by GPT-3 in this task are reported in table 2, showing the F1 score and the Exact Match value for XQuAD-ca, for the different GPT-3 model sizes. We also include the results of two supervised, fine-tuned models considered state-of-the art as a reference. Note that this is not a direct comparison, since for GPT-3 it is a zero-shot setting. GPT-3 Davinci obtains a F1 score that is more than 50% the punctuation obtained by the SOTA models, which is remarkable being a pure zero-shot setting. Figure 1 shows the scaling curves of the different model-sizes of GPT-3.

Figure 1: Question-answering results for GPT-3 sizes

Natural Language Generation

Table 3 shows the results of the human evaluation. The sentences generated by GPT-3 obtain an average score of 3,89, compared to 4,49 of the control.131313

The difference is statistically significant. With a t-test, we obtain a p-value of 0.00026 < 0.001.

As can be seen by the difference between the standard deviations and the distribution of scores in Figure

2, GPT-3 is less consistent than the control in quality, however most of the sentences are rated between 4 and 5 by the evaluators. In fact, a third of the sentences is above the average of the control, versus half of the ones generated by humans.

St. Dev.
% >
Human Av.
Human 4.49 0.57 53.33
GPT-3 3.83 1.05 33.33
Table 3: Human evaluation (for GPT-3, Davinci)
Figure 2: Distribution of Human Evaluation ratings

5 Discussion

Qualitative analysis

A closer inspection of the results shows some surprising abilities of GPT-3 in addition to the naturalness of most of the sentences. An interesting example is that following the prompt of a headline about Valencia, GPT-3 is able to write using the Valencian variant of Catalan, which is truly remarkable. An analysis of the errors shows that those with score of 2 or less (13% of the sample) contain gibberish fragments, often mixing Catalan and English, and in fact no control sentence has received such low scores. On the other hand, sentences with score 3 (21,6%) are mostly syntactically impeccable but with some peculiarities in the meaning, as for example: "La IV Mostra de Patrimoni Cultural de Bétera ha comptat amb una participació de 15.000 persones, que han pogut gaudir d’un espai on diversos grups han mostrat els seus valors patrimonials."


As shown in Figure 1

, there is a steep curve of F1 score in terms of model size, while pre-training data (and, thus, the amount of Catalan) remains the same. This shows that transfer learning between English and the other languages in zero-shot settings scales with model size in a very steep curve. This is coherent with Figure H.11 in

Brown et al. (2020), where zero-shot translation in which English is the target language reaches a plateau, but when the target languages are languages other than English, the curves keep climbing.

Usability in practice

We believe the model can be useful in multilingual applications (at least, in a degree not far from the one for English), especially since we used the model in zero-shot settings and without any effort in prompt design. We expect the model to perform considerably better in few-shot settings, and even better in languages with more data in GPT-3’s corpus. Nevertheless, a caveat, at least for Catalan, is that smaller versions of GPT-3 aren’t usable, and because the vocabulary was trained fundamentally on English, Catalan sentences are tokenized into considerably long sequences, which makes them expensive to compute.

Limitations of our study

We have restricted our analysis to the case of Catalan, and to two specific tasks, even if we believe them to be relevant, and reasonably representative of the NLP scenario. We have constrained the analysis to the zero-shot setting, which we believe to be the most interesting one. For the human evaluation, we have tried to make it as balanced as possible by using a redundancy of 3 evaluators, but human ratings can be biased. Regarding the relevance to other languages, as already mentioned, Catalan probably benefits from linguistic similarities with Romance and Indo European languages at large (including English).

6 Conclusions and Future Work

We have seen that GPT-3 does, indeed, exhibit remarkable zero-shot NLU and NLG capabilities in Catalan. This is surprising in view of the tiny proportion of Catalan in the training corpus. Our results show that GPT-3 can be useful not only for English but for many other languages present in the corpus as well. Nevertheless, some practical concerns (the needed model scale and sub optimal tokenization) make it less computationally efficient than for English. On the overall, this is a very interesting exercise of how linguistic structures (universals) transfer across languages. Given the large amount of tasks GPT-3 has been implicitly exposed to during the training procedure, handling a different language can be considered as working on yet another domain. As future work, we suggest extending the study of the scaling laws of language models Kaplan et al. (2020) in terms of cross-lingual transfer, similarly to Hernandez et al. (2021).


  • J. Armengol-Estapé, C. P. Carrino, C. Rodriguez-Penagos, O. de Gibert Bonet, C. Armentano-Oller, A. Gonzalez-Agirre, M. Melero, and M. Villegas (2021) Are multilingual models the best choice for moderately under-resourced languages? A comprehensive assessment for Catalan. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online, pp. 4933–4946. External Links: Link, Document Cited by: §1, §3.1.
  • M. Artetxe, S. Ruder, and D. Yogatama (2019) On the cross-lingual transferability of monolingual representations. arXiv preprint arXiv:1910.11856. Cited by: §3.1.
  • T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei (2020) Language models are few-shot learners. CoRR abs/2005.14165. External Links: Link, 2005.14165 Cited by: §1, §2, §5.
  • N. Casas, J. A. Fonollosa, and M. R. Costa-jussà (2020) Syntax-driven iterative expansion language models for controllable text generation. arXiv preprint arXiv:2004.02211. Cited by: §3.2.
  • A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov (2019) Unsupervised cross-lingual representation learning at scale. CoRR abs/1911.02116. External Links: Link, 1911.02116 Cited by: §1, footnote 4.
  • R. Dale (2021) GPT-3: what’s it good for?. Natural Language Engineering 27 (1), pp. 113–118. Cited by: §2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. External Links: Link, 1810.04805 Cited by: §1.
  • K. Elkins and J. Chun (2020) Can gpt-3 pass a writer’s turing test. Journal of Cultural Analytics 2371, pp. 4549. Cited by: §2.
  • L. Floridi and M. Chiriatti (2020) GPT-3: its nature, scope, limits, and consequences. Minds and Machines 30 (4), pp. 681–694. Cited by: §2.
  • T. Henighan, J. Kaplan, M. Katz, M. Chen, C. Hesse, J. Jackson, H. Jun, T. B. Brown, P. Dhariwal, S. Gray, C. Hallacy, B. Mann, A. Radford, A. Ramesh, N. Ryder, D. M. Ziegler, J. Schulman, D. Amodei, and S. McCandlish (2020) Scaling laws for autoregressive generative modeling. CoRR abs/2010.14701. External Links: Link, 2010.14701 Cited by: §2.
  • D. Hernandez, J. Kaplan, T. Henighan, and S. McCandlish (2021) Scaling laws for transfer. CoRR abs/2102.01293. External Links: Link, 2102.01293 Cited by: §6.
  • J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei (2020) Scaling laws for neural language models. CoRR abs/2001.08361. External Links: Link, 2001.08361 Cited by: §2, §6.
  • J. Liu, D. Shen, Y. Zhang, B. Dolan, L. Carin, and W. Chen (2021) What makes good in-context examples for gpt-3?. CoRR abs/2101.06804. External Links: Link, 2101.06804 Cited by: §2.
  • K. McGuffie and A. Newhouse (2020) The radicalization risks of gpt-3 and advanced neural language models. arXiv preprint arXiv:2009.06807. Cited by: §2.
  • D. Nozza, F. Bianchi, and D. Hovy (2020) What the [mask]? making sense of language-specific BERT models. CoRR abs/2003.02912. External Links: Link, 2003.02912 Cited by: §1.
  • A. Radford and K. Narasimhan (2018) Improving language understanding by generative pre-training. Cited by: §1.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. Cited by: §1.
  • C. G. Rodriguez-Penagos and C. Armentano-Oller (2021a) Cited by: §3.1.
  • C. G. Rodriguez-Penagos and C. Armentano-Oller (2021b) Cited by: §3.1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. CoRR abs/1706.03762. External Links: Link, 1706.03762 Cited by: §1.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, and J. Brew (2019)

    HuggingFace’s transformers: state-of-the-art natural language processing

    CoRR abs/1910.03771. External Links: Link, 1910.03771 Cited by: §3.1.
  • T. Z. Zhao, E. Wallace, S. Feng, D. Klein, and S. Singh (2021) Calibrate before use: improving few-shot performance of language models. arXiv preprint arXiv:2102.09690. Cited by: §2.