XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning

In order to simulate human language capacity, natural language processing systems must complement the explicit information derived from raw text with the ability to reason about the possible causes and outcomes of everyday situations. Moreover, the acquired world knowledge should generalise to new languages, modulo cultural differences. Advances in machine commonsense reasoning and cross-lingual transfer depend on the availability of challenging evaluation benchmarks. Motivated by both demands, we introduce Cross-lingual Choice of Plausible Alternatives (XCOPA), a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages. We benchmark a range of state-of-the-art models on this novel dataset, revealing that current methods based on multilingual pretraining and zero-shot fine-tuning transfer suffer from the curse of multilinguality and fall short of performance in monolingual settings by a large margin. Finally, we propose ways to adapt these models to out-of-sample resource-lean languages where only a small corpus or a bilingual dictionary is available, and report substantial improvements over the random baseline. XCOPA is available at github.com/cambridgeltl/xcopa.


It's All in the Heads: Using Attention Heads as a Baseline for Cross-Lingual Transfer in Commonsense Reasoning

Commonsense reasoning is one of the key problems in natural language pro...

xGQA: Cross-Lingual Visual Question Answering

Recent advances in multimodal vision and language modeling have predomin...

Few-shot Learning with Multilingual Language Models

Large-scale autoregressive language models such as GPT-3 are few-shot le...

MultiEURLEX – A multi-lingual and multi-label legal document classification dataset for zero-shot cross-lingual transfer

We introduce MULTI-EURLEX, a new multilingual dataset for topic classifi...

Causal Inference Principles for Reasoning about Commonsense Causality

Commonsense causality reasoning (CCR) aims at identifying plausible caus...

Analyzing the Effects of Reasoning Types on Cross-Lingual Transfer Performance

Multilingual language models achieve impressive zero-shot accuracies in ...

Leveraging Knowledge in Multilingual Commonsense Reasoning

Commonsense reasoning (CSR) requires the model to be equipped with gener...

1 Introduction

premise choice 1 choice 2
qu Sipasqa cereal mikhunanpi kuruta tarirqan. R Payqa pukunman ñuqñuta churakurqan. Payqa manam mikhuyta munarqanchu.
en The girl found a bug in her cereal. She poured milk in the bowl. She lost her appetite.
th ตาของฉันแดงและบวม C ฉันร้องไห้ ฉันหัวเราะ
en My eyes became red and puffy. I was sobbing. I was laughing.
Table 1: Examples of forward and backward reasoning (Result [R] and Cause [C]) in the XCOPA validation sets.

Commonsense reasoning is a critical component of any natural language understanding system Davis:2015cacm. Contrary to textual entailment, commonsense reasoning must bridge between premises and possible hypotheses with world knowledge that is not explicit in text (singer1992validation). Such world knowledge encompasses, among other aspects: temporal and spatial relations, causality, laws of nature, social conventions, politeness, emotional responses, multiple modalities. Hence, it corresponds to the individuals’ expectations about typical situations (shoham1990nonmonotonic).111Moreover, there are often multiple legitimate chains of sentences that can be invoked in between premises and hypotheses. In short, commonsense reasoning does not just involve understanding what is possible, but also ranking what is most plausible.

A seminal work on the quantitative evaluation of commonsense reasoning is the Choice Of Plausible Alternatives dataset (COPA; Roemmele:2011aaai), which focuses on cause–effect relationships. In recent years, more datasets have been dedicated to other facets of world knowledge (Sakaguchi:2020aaai; Bisk:2020aaai; Bhagavatula:2020iclr; Rashkin:2018acl; Sap:2019siqa, inter alia). Unfortunately, the extensive efforts related to this thread of research have so far been limited only to the English language.222The only exception is direct translation of the 272 paired English Winograd Schema Challenge instances to Japanese WSCja, French Amsili:2017french, and Portuguese Melo:2020eniac. Such a narrow scope not only curbs the development of natural language understanding tools in other languages Bender:2011lilt; Ponti:2019cl, but also exacerbates the Anglo-centric bias in modeling commonsense reasoning. In fact, the expectations about typical situations vary cross-culturally (thomas1983cross).

Datasets that cover multiple languages for other natural understanding tasks, such as language inference (Conneau:2018emnlp), question answering (Lewis:2019arxiv; artetxe2020translation; tydiqa), and paraphrase identification (Yang:2019emnlp) have received increasing attention. In fact, the requirement to generalise to new languages encourages the development of more versatile language understanding models, which can be ported across different grammars and lexica. These efforts have recently culminated in the integration of several multilingual tasks into the XTREME evaluation suite Hu:2020arxiv. However, a comprehensive multilingual benchmark for commonsense reasoning in particular is still missing.

In order to address this gap, we develop a novel dataset, XCOPA (see examples in Table 1), by carefully translating and re-annotating the validation and test sets of English COPA into 11 target languages. A key design choice is the selection of a typologically diverse sample of languages. In particular, we privilege internal variety over the abundance of digital resources in each language. Since resource-rich languages tend to belong to only a few families and areas, their sample is highly biased and is not indicative of the expected model performance (gerz-etal-2018-relation; Ponti:2019cl; Joshi:2020arxiv). Following this guiding principle, we select 11 languages from 11 distinct families, and 4 geographical macro-areas (Africa, South America, Eurasia, Southeast Asia and Oceania).

We leverage XCOPA to benchmark a series of state-of-the-art pretrained multilingual models, including xlm-r (conneau2019unsupervised), mbert (Devlin:2019bert), and multilingual use (yang2019multilingual). Two XCOPA languages (i.e., Quechua and Haitian Creole) are out-of-sample for the pretrained models: this naturally raises the question of how to adapt the pretrained models to such unseen languages. In particular, we investigate the resource-lean scenarios where either some monolingual data or a bilingual dictionary with English (or both) are available for the target language.

In summary, we offer the following contributions. 1) We create the first large-scale multilingual evaluation set for commonsense reasoning, spanning 11 languages, and discuss the challenges in accounting for world knowledge across different cultures and languages. 2) We propose quantitative metrics to measure the internal variety of a language sample, which can guide the design of any multilingual dataset in the future. 3) We benchmark pretrained state-of-the-art models in cross-lingual transfer of commonsense knowledge, and investigate how to (post-hoc) improve transfer for languages unseen at pretraining time.

In order to rise to the challenge of this dataset, models must be able not only to combine textual evidence with world knowledge – which makes commonsense reasoning challenging per se Talmor:2019naacl; Rajani2019acl, but they must also transfer the acquired causal reasoning abilities across languages. The results we obtain on XCOPA may thus indicate the limitations of current state-of-the-art multilingual models in cross-lingual transfer for complex reasoning tasks.

2 Annotation Design

Design Objectives. The principal objectives in XCOPA creation were: 1) to create a genuinely typologically diverse multilingual dataset, aligned across target languages in order to make performance scores comparable, and 2) to ensure high quality, naturalness and idiomacity of each monolingual dataset. While the commonly used translation approach addresses the former objective, it is prone to compromise the latter goal, bending the target language to the structural and lexical properties of the source language: the resulting evaluation benchmarks thus fail to measure system performance adequately koppel2011translationese; volansky2015features; artetxe2020translation.

In order to avoid these pitfalls, we: (i) entrusted the translation task to a single (but carefully selected) translator for each target language,333Crowd-sourcing offers faster annotation at a lower cost – however, in our trial experiments, chasing low annotation times and costs resulted in low translation quality. and (ii) offered enough leeway for necessary target-language adjustments (e.g., substitutions with culture-specific concepts and multi-word paraphrases, wherever the original text eluded direct translation). Detailed translation guidelines are available in Appendix A.

Typology [0, 1] 0.41 0.41 0.39 0.36 0.32 0.31
Family [0, 1] 1 0.9 0.78 0.6 0.66 0.66
Geography [0, ] 1.79 1.16 0.95 0.72 0.66 0
Table 2: Indices of typological, genealogical, and areal diversity for the language samples of a set of NLU datasets.

Language Sampling. Multilingual evaluation benchmarks assess the expected performance of a model across languages. However, should such languages be randomly sampled from the distribution of digital texts? Or rather, should the sample represent the distribution over the languages spoken around the world? Resource-rich languages tend to belong to the same families and areas, which facilitates knowledge transfer and leads to an overestimation of the expected performance in the second sense (gerz-etal-2018-relation; Ponti:2019cl). Moreover, rather than samples that account for independent and identically distributed draws from the ‘true’ language distribution (known as probability sampling), we opt for a uniform

distribution of linguistic phenomena, which encourages the inclusion of outliers

(known as variety sampling; rijkhoff1993method; dryer1989large). Thus, the performance on XCOPA also reflects the robustness of a model, i.e. its resilience to phenomena that are unlikely to be observed in the training data.

Inspired by rijkhoff1993method and miestamo2004clausal, we propose a series of simple and interpretable metrics that quantify diversity of a language sample independent of its size: 1) a typology index based on 103 typological features of each language from URIEL Littel-et-al:2017, originally sourced from the World Atlas of Language Structures (WALS; wals)

. Each feature is binary and indicates the presence or absence of a phenomenon in a language. We estimate the entropy of the distribution of values in a sample, as shown in the heatmap of Figure 

2 in the Appendix. Afterwards, we average across all 103 feature-specific entropies. Intuitively, if all values are equally represented, the entropy is high. If all languages have identical features, the entropy is 0; 2) The family index is simply the number of distinct families divided by the sample size. 3) The geography index is the entropy of the distribution over macro-areas in a sample.444Six macro-areas, as defined by dryer1989large, are Africa, Eurasia, Southeast Asia and Oceania, Australia and New Guinea, North America, and South America.

et ht id it qu sw ta th tr vi zh
val 97.0 97.0 99.0 98.0 98.0 99.0 100.0 99.0 97.0 97.0 96.0
test 98.2 96.4 100.0 97.0 94.8 99.0 98.6 98.2 96.4 98.4 96.6
Table 3: Percentage of annotated labels in each language agreeing with the majority label. Note that the majority label is highly reliable, as we observed a 100% agreement with the development set labels in the original COPA.

The sample of languages for XCOPA aims at maximising these indices. In particular, XCOPA includes Estonian (et), Indonesian (id), Italian (it), Cusco-Collao Quechua (qu),555The translator is an Eastern Apurímac Quechua speaker. Kiswahili (sw), Tamil (ta), Thai (th), Turkish (tr), Vietnamese (vi), and Mandarin Chinese (zh). These languages belong to distinct families, respectively: Uralic, Creole, Austronesian, Indo-European, Yuman–Cochimí, Niger-Congo, Dravidian, Kra-Dai, Turkic, Austroasiatic, and Sino-Tibetan. Moreover, qu and ht are spoken in South America, which is an underrepresented macro-area. We report the 3 metrics in Table 2 and compare it to samples from other standard multilingual NLU datasets. XCOPA offers the most diverse sample in terms of typology (on a par with TyDiQA), family, and geography.

Final Dataset. As shown in Table 1, each (X)COPA instance corresponds to a premise, a question (“What was the cause?” or “What happened as a result?”), and two alternatives. The task is framed as binary classification where the machine has to predict the more plausible choice. For each target language, XCOPA comprises 100 annotated data instances in the validation set and 500 instances as the test set, which are translations from the respective English COPA validation and test set, see Table 1 again. Our translators performed labeling prior to translation, deciding on the correct alternative for the English premise and preserving the correctness of the same alternative in translation. We measure inter-translator agreement using the Fleiss’ statistic fleiss1971measuring: the obtained score of for development data and for test data reveal very high agreement between translators (i.e., landis1977measurement define as almost perfect agreement).

From the 11 sets of annotation labels we obtain the majority labels (i.e., 6+ translators agree). We observe perfect agreement between our majority labels and the English COPA labels for development data. We then compute the percentage of annotated labels which agree with the majority label for each language individually, reported in Table 3, and find very high agreement across 11 languages. The small discrepancies in label choices in our work stem not only from the actual semantic ambiguity of the original English question, but also reflect the translators’ different cultural frames of reference and patterns of association. On average, 2.1% of labels in the validation set and 2.4% of labels in the test set do not match the majority label.666 In order to accurately represent ambiguity of the small number of disagreement labels, in the final datasets we explicitly tag the corresponding questions with an apposite marker.

3 Qualitative Analysis

As highlighted in §2, our guidelines anticipated that the adopted translation approach may entail language-specific challenges, e.g., the lack of equivalent concepts or the grammatical expression of tense and aspect. We now analyse the main design challenges and the adopted solutions.

Cultural Context. The scenarios included in English COPA were authored by American English speakers with a particular cultural background. It is therefore inevitable that some concepts, intended as commonplace, sound unusual or even completely foreign in the target language. The examples include: (i) concrete referents with no language-specific term available (e.g., bowling ball, hamburger, lottery); (ii) systems of social norms absent in the target culture, e.g., traffic regulations (e.g., parallel parking, parking meter); (iii) social, political, and cultural institutions and related terminology (e.g., e.g., mortgage, lobbyist, gavel); (iv) idiomatic expressions (e.g., put the caller on hold).

In such cases, the translators were advised to resort to (i) paraphrasing; (ii) substitutions with similar concepts, e.g., ‘faucet’ is replaced with ‘pipe’ in Tamil (*X, kuḻāy) and Haitian Creole (tiyo); or (iii) phonetically transcribed loan words, e.g., in Tamil: *X (pauliṅ pantu, ‘bowling ball’), *X (cōppu, ‘soap’).

Grammatical Tense. The temporal contiguity between two events and their duration is crucial in establishing their causal relationship (enfieldmacro). A number of languages in our sample (i.e., th, vi, id, zh) do not have the grammatical category of tense and express temporality by means of aspect, mood or lexical items and expressions referring to time (e.g., adverbs), or rely entirely on pragmatic context to provide sufficient information for the interpretation of the utterance. Even if aspectual viewpoint markers exist, they are optional, e.g., the perfective marker 了 (le) in zh. To ensure naturalness of the translated sentences and faithfully represent the properties of the so-called tenseless languages, we favoured the unmarked variants, with the temporal relations established by the situational context (e.g., compare: (a) 我想节约能源。, Wǒ xiǎng jiéyuē néngyuán., ‘I want(ed) to conserve energy.’ (no perfective marker), and (b) 学生拼错了这个词。, Xuéshēng pīn cuòle zhège cí., ‘The student misspelled the word.’, (with completed action marker). Further considerations on anteriority and aspect are provided in Appendix B.

Label Discrepancies. The analysis of inter-translator agreement in §2 revealed a small number of COPA scenarios with discrepancies in annotations across languages. To better understand the source of such disagreements, we identified all the validation set instances on which one or more translators diverged from the majority label.777Overall, there were 10 validation set questions with 1 translator out of 11 in disagreement, 5 questions with 2, and 1 question with 3. We identified two cases where the translator’s experience and cultural frame of reference played a role (as attested in translator feedback), which required, for instance, understanding of the procedures and structure of U.S. court trials (e.g., The judge pounded the gavel. cause: (a) The courtroom broke into uproar. (b) The jury announced its verdict.).

Most disagreement cases (87.5%), however, seem to be culturally independent and concern genuinely ambiguous cases (e.g. The detective revealed an anomaly in the case. result: (a) He finalized his theory. (b) He scrapped his theory.). To verify this in a monolingual setting, we carried out a follow-up experiment where 4 Italian native speakers labeled the translated validation and test set instances. The Fleiss’ agreement scores were 0.926 (validation) and 0.917 (test), respectively. This corroborates our decision to override a single translator’s label with the majority label without altering the translation.

4 Experiments and Results

We now benchmark a series of state-of-the-art multilingual models on XCOPA to provide baseline scores for future research, as well as to exhibit the challenging nature of the dataset. The only direct in-domain data available are: 1) the original English COPA training set covering 400 instances and 2) validation sets in English and all target languages spanning 100 instances each.

We evaluate the following state-of-the-art pretrained multilingual encoders: 1) multilingual BERT (MBERT) Devlin:2019bert and XLM-on-RoBERTa (XLM-R) conneau2019unsupervised in the standard fine-tuning regime

(i.e., their parameters are fine-tuned together with the task classifier’s parameters), and

2) multilingual Universal Sentence Encoder (USE) yang2019multilingual in the feature-based regime (i.e., its parameters are fixed during the task classifier’s training). Both MBERT and XLM-R include all XCOPA languages in their pretraining data spanning 100 languages, except for Haitian Creole and Quechua. Multilingual USE was trained on 16 languages, covering it, th, tr, and zh from the XCOPA language sample.

4.1 Multiple–Choice Classification

COPA and XCOPA are multiple–choice classification tasks: given a premise and a prompt (cause or result), the goal is to select the more meaningful of the two choices (see Table 1). Due to training data scarcity in COPA, we probe the usefulness of first “pretraining” the classifier on larger multiple–choice English commonsense reasoning datasets, such as SocialIQa (SIQA; Sap:2019siqa) or WinoGrande Sakaguchi:2020aaai. As different multiple–choice selection tasks differ in the number of choices (e.g., there are 2 possible answers in COPA, whereas there are 3 in SIQA), a classifier with a fixed number of classes is not a good fit for this scenario. We thus follow Sap:2019siqa and couple the (pretrained) encoder with a feed-forward network which produces a single scalar score for each of the possible answers. The scores for individual choices are then concatenated and passed to the softmax function.

Encoder Input. For each instance, we couple each of the answer choices with the concatenation of the premise and the prompt and feed that as a “sentence” pair input to MBERT and XLM-R, or as a single “sentence” to USE.888For MBERT and XLM-R, we insert the standard special tokens. For example, for the last example from Table 1 and Choice 1, the input for MBERT would be as follows: [CLS] My eyes became red and puffy. What was the cause? [SEP] I was sobbing. [SEP])

Classifier Head. Let be the -th answer choice of an instance of multiple-choice dataset (i.e., in COPA and in SIQA) and let (with

as the vector size of the encoder) be the encoding of its corresponding input consisting of the premise, prompt and the answer itself, as explained above.

999For MBERT and XLM-R is the transformer representation of the sequence start token. For USE, is the average of contextualised vectors of all tokens. The predicted score for the answer is then obtained with the following feed-forward net: , with , and as parameters. We obtain the score for each answer and concatenate them into a prediction vector to which we apply a softmax normalisation: , where is the number of answers in the multiple–choice selection dataset. The loss for the training instance is then the standard cross-entropy classification loss.

Experimental Setup. We rely on the following pretrained Transformer-based encoder configurations: multilingual BERT (Base, ), XLM-R (Base, ), and multilingual USE (Large,

). We evaluate them in different transfer learning setups based on

1) different sources of training data: SIQA,101010The SIQA dataset is similar in nature to COPA (i.e., it is a multiple–choice dataset for commonsense reasoning about social interactions, with open-format prompts and three answer choices). It comes with a much larger training set, consisting of 33K instances and therefore can provide useful learning signal also for causal commonsense reasoning in XCOPA. COPA, or both; and 2) different model selection regimes for hyper-parameter tuning and early stopping (based on English or target language validation set). The resulting combinations are shown in Table 4.

Train dataset Model selection
Setup SIQA COPA en target
Table 4: Different fine-tuning and transfer setups. CO=COPA; SI=SIQA; ZS=Zero-Shot; TLV=Target Language Validation (Set).

4.2 Results and Discussion

CO-ZS XLM-R 55.6 56.9 55.4
MBERT 54.1 54.4 55.7
USE 54.7 56.0 58.1
CO-TLV XLM-R 55.1 56.4 55.2
MBERT 54.2 54.5 55.8
USE 54.8 55.4 59.0
SI-ZS XLM-R 60.1 62.3 62.9
MBERT 54.7 55.6 56.4
USE 55.0 56.4 60.1
SI+CO-ZS XLM-R 59.0 60.7 61.9
MBERT 55.8 56.8 57.9
USE 54.1 54.9 58.9
SI+CO-TLV XLM-R 60.7 63.5 63.6
MBERT 54.4 54.8 54.2
USE 54.3 55.2 59.1
Table 5: Summary of XCOPA results. All: average over all 11 XCOPA languages; MBERT  XCOPA: average over 9 XCOPA languages (without ht and qu) included in MBERT and XLM-R pretraining; USE  XCOPA: average over 4 XCOPA languages (it, th, tr, and zh), included in the USE pretraining.

Table 5 shows the aggregate accuracy of MBERT, XLM-R and USE over 11 XCOPA languages for each of the previously described training setups from Table 4. We first compare our XCOPA results with the English COPA performance of the monolingual English BERT (Base) reported by Sap:2019siqa: 63% accuracy in COPA-only fine-tuning and 80% after sequential SIQA+COPA fine-tuning, which is approximately 7% (COPA-only) and 17% (SIQA + COPA) better than our best average XCOPA performances in the respective setups. This contributes to recent suspicions cao2020multilingual; Hu:2020arxiv that massively multilingual pretrained transformers do not offer a completely satisfactory solution for language transfer.

Figure 1: Per-language XCOPA results for XLM-R, MBERT, and USE in the SIQA + COPA-TLV setup. Striped bars correspond to language-model pairs where the language was not included in model pretraining.

XLM-R outperforms MBERT and USE in all setups, but the gains are pronounced only in setups in which the models were first fine-tuned on SIQA (SI-ZS, SI+CO-ZS, and SI+CO-TLV). USE outperforms MBERT surprisingly often. This might have been expected in the COPA-only setups (CO-ZS and CO-TLV) where the small COPA training set is insufficient to meaningfully fine-tune MBERT transformer parameters. However, the finding that MBERT does not benefit more than USE from prior SIQA training is surprising and warrants further investigation. What is more, USE in some setups even outperforms MBERT for some of the languages (e.g., id, ta, sw) on which MBERT was pretrained and USE was not (cf. the scores in the MBERT  XCOPA column). We speculate that this is due to the combination of two effects: (1) the infamous “curse of multilinguality” conneau2019unsupervised is much more pronounced for MBERT (which is pretrained on 104 languages) than for USE, pretrained on only 16 langugages; and (2) there are subword-level similarities between XCOPA target languages and the 16 languages used in USE pretraining.

We also note that training models only on SIQA yields performance that is comparable (and for MBERT and USE often better) to performance we obtain with additional COPA training (setups SI + CO-ZS and SI + CO-TLV). While this is in part due to the limited size of the COPA training set, it confirms the assumption that SIQA and COPA are highly compatible tasks. We also note that only slight gains are achieved by hyper-parameter tuning on the target language validation set (TLV).

In Figure 1, we display per-language performance in the best setup, SIQA + COPA-TLV, while detailed results for all other setups are available in Appendix D. As expected, all models fluctuate around random-level performance on out-of-sample languages, ht and qu. For all other languages, XLM-R outperforms MBERT. Surprisingly, we also observe that for some languages (id, vi, zh) performance of transfer from English is slightly higher than the actual performance in English, without transfer. Another observation is that the transfer performance is often better for some languages typologically distant from English than for languages closer to English (e.g., th, vi, zh versus it). This might be partially due to good representation of languages such as zh and th in the pretrained models due to their large training data and very specific scripts.

4.3 Adaptation to Unseen Languages

Even massively multilingual pretrained encoders like MBERT and XLM-R, pretrained on corpora of over 100 languages, cover a fraction of the world’s 7000+ languages. Pretraining a multilingual encoder covering the majority of the world languages is infeasible: we thus explore several resource-lean approaches for extending it post-hoc to support transfer to languages not observed during pretraining, such as qu and ht in XCOPA.

Adaptation Strategies. We rely on XLM-R, as the best-performing multilingual encoder in XCOPA evaluation (see Figure 1) and probe several strategies for adapting it to the two unseen XCOPA target languages. In all strategies, we simply continue training the XLM-R model via masked language modeling (MLM) on different combinations of data, and in particular:

1) T. Sentences in the target language. We create the monolingual corpora for ht and qu by concatenating their respective Wikipedia dumps with their respective text from the JW300 corpus agic-vulic-2019-jw300. In total, the training size is 5,710,426 tokens for ht, and 2,263,134 tokens for qu.

2) S. Sentences in English (en). This could prevent (catastrophic) forgetting of the source language while fine-tuning, which presumably may occur with T only. We create the English corpus of comparable size to ht and qu corpora by randomly sampling 200K sentences from en Wikipedia.

3) D. A bilingual en–ht and en–qu dictionary. The dictionaries were extracted from PanLex kamholz-etal-2014-panlex: we retain the 5k most reliable word translation pairs according to the available PanLex confidence scores. We create a synthetic corpus from the dictionary (termed D-corpus henceforth) by concatenating each translation pair from the dictionary into a quasi-sentence.

4) T-REP. T data with all occurrences of target language terms from the 5K dictionary replaced with their English translations.

We select 5k target language sentences as the development corpus and use it for early stopping of the MLM training (i.e., we measure perplexity).

4.4 Results and Discussion

Setup Model ht qu
CO-ZS XLM-R 49.4 50.7
2-4     +T 53.8 49.8
    +S+T 52.8 54.0
    +D 52.2 51.2
    +S+T+D 53.6 52.0
    +T-REP 49.6 55.0
SI-ZS XLM-R 49.2 51.0
2-4     +T 56.2 57.9
    +S+T 55.2 55.0
    +D 55.4 57.4
    +S+T+D 56.4 53.5
    +T-REP 58.6 57.7
SI+CO-ZS XLM-R 51.4 51.2
2-4     +T 57.8 54.0
    +S+T 55.8 55.2
    +D 57.8 57.9
    +S+T+D 55.4 54.0
    +T-REP 58.4 54.4
Table 6: XCOPA accuracy scores of different transfer variants that adapt to out-of-sample languages.

The performance of the five adaptation variants with XLM-R on ht and qu in the zero-shot XCOPA evaluation setups is reported in Table 6. When using sufficiently large fine-tuning datasets (SI-ZS and SI+CO-ZS setups) all adaptation methods yield substantial improvements over the base XLM-R model. The improvements are less consistent in the COPA-ZS setup. However, we attribute it to the limited size of the English COPA dataset (only 400 instances) used for fine-tuning rather than to the ineffectiveness of the adaptation strategies. A comparison between XLMR+T and XLMR+S+T suggests that additional MLM pretraining on a moderately sized target language corpus does not lead to forgetting of the source language information.

The results of the light-weight post-hoc XLM-R adaptations for ht and qu are quite encouraging,111111Note that the unseen languages, however, must rely on seen scripts (e.g., both ht and qu are written in Latin script). as they bypass retraining the encoder from scratch while achieving downstream results almost comparable with seen languages. Moreover, the results from Table 6 suggest that leveraging additional knowledge from a general bilingual dictionary can lead to further benefits: e.g., note the results of XLMR+T-REP in SIQA-ZS and SIQA+COPA-ZS transfer setups. Further, the results with the most resource-lenient method (XLMR+D) also reveal positive trends. Further adaptation strategies (ponti2019towards) and downstream tasks warrant future investigation.

5 Related Work

Evaluation of Commonsense Reasoning. Besides COPA, another important early dataset that instigated computational modeling of commonsense reasoning is the Winograd Schema Challenge (WSC) Levesque:2012wsc; Morgenstern:2015aaai. WSC targets a pronoun conference resolution task with paired instances, and has been recently expanded into the WinoGrande dataset Sakaguchi:2020aaai through crowd-sourcing, now spanning 44k paired instances.

Recent advances across a range of NLP tasks driven by large pre-trained language models Wang:2019neurips; Ruder:2019naacl has spurred further interest in this area as a way to probe their reasoning abilities. Some evaluation sets target particular well-defined aspects of commonsense, such as the physical aspect Bisk:2020aaai, abductive reasoning Bhagavatula:2020iclr,121212Abductive reasoning is inference to the most plausible explanation of incomplete observations Peirce:1960book. intents and reactions to events Rashkin:2018acl, social interactions Sap:2019siqa, or visual commonsense Zellers:2019cvpr. Other recent datasets such as CommonsenseQA Talmor:2019naacl, SWAG Zellers:2018emnlp, and HellaSWAG Zellers:2019acl are cast as open-ended multiple-choice problems where the system is expected to choose the most sensible option. Another line of evaluation targets commonsense-enabled reading comprehension and question answering Ostermann:2018lrec; Zhang:2018arxiv; Huang:2019emnlp.

Multilingual Evaluation of Natural Language Understanding. While all these datasets for commonsense reasoning are limited to English, several multilingual datasets for other natural language understanding tasks are available. These include lexical semantic similarity (Multi-SimLex) Vulic:2020multisimlex, document classification (MLDoc) Schwenk:2018lrec

, sentiment analysis

Barnes:2018acl, and natural language inference (XNLI) Conneau:2018emnlp. Other recent multilingual datasets target the QA task based on reading comprehension. MLQA Lewis:2019arxiv includes 7 languages; XQuAD Artetxe:2019arxiv spans 10 languages; TyDiQA tydiqa covers 11 typologically diverse languages. Further, PAWS-X Yang:2019emnlp evaluates paraphrase identification over 6 languages. A standard and pragmatic approach to multilingual dataset creation is translation from an existing (English) dataset: Multi-SimLex was created starting from the extended English SimLex-999 Hill:2015cl, XNLI from MultiNLI Williams:2018naacl, XQuAD from SQuAD Rajpurkar:2016emnlp, and PAWS-X from PAWS Zhang:2019naacl. On the other hand, TyDiQA was built independently in each language. Finally, a large number of tasks has been recently integrated into a unified multilingual evaluation suite, XTREME Hu:2020arxiv.

6 Conclusion

We presented the Cross-lingual Choice of Plausible Alternatives (XCOPA), a multilingual evaluation benchmark for causal commonsense reasoning. All XCOPA instances are aligned across 11 languages, which enables cross-lingual comparisons. The language selection was informed by variety sampling, in order to maximise the diversity in terms of typological features, geographical macro-area, and family. This allows for assessing the robustness of machine learning models to an array of rare phenomena displayed by the languages in the sample.

We also ran a series of cross-lingual transfer experiments, evaluating state-of-art transfer methods based on multilingual pretraining and fine-tuning on English. We observed that, although these methods perform better than chance, they still lag significantly behind the monolingual supervised learning setting. Overall, the scores are held down by the ‘curse of multilinguality’, the need to account for a wide sample of languages in pretraining. In addition, the transfer seems not to depend that much on the distance from the source, but rather on the abundance of monolingual target language data for multilingual pretraining. Finally, we investigated how to adapt pretrained multilingual models to new out-of-sample languages in resource-lean scenarios where only a small monolingual corpus and/or a bilingual English–target dictionary are available, with notable gains reported.

We hope that this new challenging evaluation set will foster further research in multilingual commonsense reasoning and cross-lingual transfer.


This work is supported by the ERC Consolidator Grant LEXICAL (no 648909). EMP, IV, and AK are also funded through the Google Faculty Research Award 2018 for Natural Language Processing. GG is supported by the Eliteprogramm of the Baden-Württemberg Stiftung (AGREE Grant). Heartfelt thanks to Ulla Meeri Petti and Fangyu Liu for their invaluable help.


Appendix A Detailed Translation Guidelines

Translation of the English COPA validation and test set instances into each of the 11 languages was carried out by a single translator per language, meeting the following eligibility criteria: (i) a native speaker of the target language, (ii) fluent in English, (iii) with minimum undergraduate education level. Each translator was presented with translation guidelines and a spreadsheet accessible online, containing one English premise-hypothesis triple per line, followed by an empty line where target translations were entered. The task consisted in (a) identifying the correct alternative for the English premise and (b) translating the premise and both alternatives into the target language, preserving the causal relations present in the original (see §3 for discussion of ambiguous and problematic cases). Each translator worked independently (using any external resources, such as English-target language dictionaries, if needed) and completed the task in its entirety, producing 100 validation and 500 test instance translations, and a label for each. To ensure the output preserves the lexical, temporal, and causal relations present in the original triples, the guidelines instructed to:

  1. [label=., itemsep=0pt, topsep=0pt]

  2. maintain the original correspondence relations between lexical items, i.e., if the same English word appeared both in the premise and the alternatives (Premise: The friends decided to share the hamburger.; A1: They cut the hamburger in half.; A2: They ordered fries with the hamburger.), it was translated into the same target-language equivalent in all three translated sentences;

  3. ensure that the original chronology and temporal extension of events is preserved through appropriate choice of verbal tense and aspect in the target language, e.g., maintaining the distinction between perfective and imperfective aspect (Premise: My eyes became red and puffy. [perf], A1: I was sobbing. [imperf], A2: I was laughing. [imperf]; See §3 for discussion of the challenges posed by tenseless languages);

  4. in case of English words with no exact translations in the target language or referring to concepts absent from the target language culture (e.g., peach), the following solutions were to be adopted, in order of preference: (1) using a common loanword from another language, provided it is understood by the general population of target-language speakers; (2) using a periphrasis to describe the same concept (e.g., a juicy fruit); (3) substituting the original concept with a similar one that is more familiar to the target language speaker community (e.g., santol), provided that it can play a similar role in the causal relations captured by the original premise-hypothesis triple;

The translators were encouraged to split the workload into multiple sessions with breaks in between. On average, the task took 20 hours of work to complete.Additionally, translators were encouraged to provide feedback, commenting on translation challenges and chosen solutions, which we discuss in §3.

Figure 2: Heatmap of the entropy of the distributions of WALS features (x axis) in language samples from famous cross-lingual datasets outlined in §5 (y axis).
CO-ZS XLM-R 57.6 59.8 49.4 58 56 50.7 57.2 56.6 52.8 56.2 58.5 56.6
MBERT 62 50.6 51.4 55 53.8 54.7 53.6 52 53.2 56.8 55.4 59
USE 63 53.8 49.4 57.6 60 48.3 52.2 53 57.2 55 54.8 60.2
CO-TLV XLM-R 57.6 57.8 48.6 60.8 54.4 49.5 55.4 55.8 54.2 54.8 57.6 57.2
MBERT 62 52 52.6 58.2 55 52.7 53 52 52.4 53.8 52.6 61.8
USE 63 49.4 49.6 57.6 62 54 50.8 53.6 58.6 56.2 51.4 59.2
SI-ZS XLM-R 68 59.4 49.2 67.2 63.6 51 57.6 58.8 61.6 60.4 65.8 66
MBERT 62.2 55.2 51.4 57 57 50.2 51 52.2 51 53.2 59.2 64.4
USE 62.6 51.6 46.8 60.2 61.8 50.5 52.4 48.8 60.8 54.6 54.8 63
SI+CO-ZS XLM-R 66.8 58 51.4 65 60.2 51.2 52 58.4 62 56.6 65.6 68.8
MBERT 63.2 52.2 54 59.4 57.2 48 56 54.6 51.2 57.4 58 65.6
USE 63.8 51.2 48.4 57.6 61.8 52 51.8 47 58 55.6 51 60.2
SI+CO-TLV XLM-R 66.8 59.4 50 71 61.6 46 58.8 60 63.2 62.2 67.6 67.4
MBERT 63.2 52.2 51.8 58.2 57.2 53 51 57.2 52.6 54.6 57.8 52.4
USE 63.8 51.8 47.8 56.6 61.6 52.2 52.4 47 59.8 54.4 52.8 60.6
Table 7: Detailed per-language XCOPA results. None of the models was exposed to ht and qu in pretraining. USE was exposed in pretraining only to it, th, tr, and zh.

Appendix B Why is Grammatical Tense Problematic for XCOPA?

The scenarios included in COPA refer to events that took place in the past and are formulated in what can be described as a narrative register (one of the sources from which question topics were drawn was a corpus of personal stories published online gordon2009identifying). This is grammatically rendered exclusively by means of past simple (preterite) or past continuous (imperfect) verb forms. Temporal anteriority of a hypothesis sentence with respect to the premise is not grammatically marked (e.g., with a past perfect verb form) and can only be deduced based on the prompt (“What was the cause of this?”). The preterite-imperfect contrast used in English to distinguish background states (imperfective) from the main event (perfective) (e.g., I was expecting company. imperf vs. I tidied up my house. perf) is not universally applicable and different languages employ different discourse grounding strategies hopper1979aspect, which has interesting implications for the multilingual extension of COPA to XCOPA.

In the languages with grammatical tense different strategies are employed to capture the perfective-imperfective distinction, which is prominent in COPA. For example, in Haitian Creole, the simple past marker te is used to indicate a bounded event in the past, while the continuous aspect is signaled with an ap marker. Italian additionally distinguishes between two perfective past tenses, expressed by means of a simple and compound past (vidi - ho visto, ‘I saw’). The opposition is between completed actions whose effects are detached from the present and those with persisting effects on the present. Both contrast with the imperfect, which emphasises the event’s extension or repetition in time. Given that the opposition is a matter of the speaker’s perspective on events rather than based on deixis (remote versus proximate past), the translator opted for the most natural choice given a specific context/situation.

Appendix C Hyper-Parameter Search

For MBERT and XLM-R we searched the following hyperparameter grid in both SIQA and COPA training: learning rate

, dropout rate (applied to the output layer of the transformer and the hidden layer of the feed-forward scoring net) , and batch size . For USE, we searched over different values for the learning rate, . We evaluate the performance on the respective development set every updates for SIQA and every updates for COPA and stop the training if there is no improvement over consecutive evaluations. In all setups, we optimize the parameters with the Adam algorithm kingma2015adam (, no weight decay nor warmup) and clip the norms of gradients in single updates to .

Appendix D Full Results (Per Language)

Table 7 contains the detailed per language results for all XCOPA languages and all five of our evaluation setups (CO-ZS, CO-TLV, SI-ZS, SI+CO-ZS, SI+CO-TLV).