To measure the progress across the fast evolving area of natural language processing (NLP), several prominent benchmarking suites have been proposed, such as SentEval, GLUE , and SuperGLUE . SentEval aims to evaluate sentence encoders, while GLUE and its more challenging successor, SuperGLUE, evaluate general language understanding. These two benchmark suites contain public training sets and private test sets that can be assessed through an evaluation server giving separate scores for each of the tasks and an overall single-number score. The tasks in SuperGLUE are diverse and comprised of question answering (QA), natural language inference (NLI), coreference resolution, and word sense disambiguation (WSD). Non-expert humans evaluated all the tasks to provide a human performance baseline.
Despite recent criticism of the way how the evaluation scores in these benchmarks are computed (e.g., the arithmetic mean of separate metrics is used for all tasks of different complexity and different sizes of testing sets) [9, 5], there is little doubt that the datasets contained in the benchmarks significantly contribute towards the progress of the NLP area. Unfortunately, most of the research is on English, which often limits the generality of the approaches and does not address the full scale of language complexity. Still, we can observe significant efforts to make relevant datasets cross-lingual or adapted to more languages. Broad multilingual benchmarking suites, such as XTREME  and XGLUE 
, provide machine-translated datasets for several relevant tasks and languages. Complementary to that, we present an in the direction of less-resourced languages, namely a combined machine-human translation of SuperGLUE benchmarking suite to Slovene. We describe the difficulties arising in the translation and provide the first evaluation of the datasets using large pretrained BERT-like language models. Our evaluation encompasses monolingual, cross-lingual and multilingual approaches, comparison of human and machine-translated parts, and comparison of two state-of-the-art machine translation systems.
The paper is structured into further five sections. In Section 2, we present related work on non-English benchmarking suites. In Section 3, we describe the translation process and the final SuperGLUE datasets in Slovene. We describe three evaluation settings (monolingual, cross-lingual, and multilingual) in Section 4, and present the results in Section 5. Section 6 presents conclusions, limitations of our work, and ideas for further improvements. To make the paper self-contained and ease understanding of the different datasets in the SuperGLUE benchmark, we follow wang2019superglue and present an example for each of the tasks in Appendix.
2 Related work
In the area of NLP, there are several prominent benchmarking suites, the most prominent being GLUE  and SuperGLUE . While most benchmarking datasets are available in English, there are a few notable attempts to extend them to other languages.
is a multi-task benchmark for evaluating the cross-lingual representations across 40 languages and 9 tasks. It focuses on the zero-shot cross-lingual transfer and provides large English training sets, much smaller testing sets in target languages (from 5 to 40 human translated or checked, the rest are machine translated without human interventions). The suite covers tasks such as natural language inference (NLI), paraphrasing, part-of-speech (POS) tagging, named entity recognition (NER), question answering, and parallel sentence extraction and detection.
XGLUE  is intended to evaluate cross-lingual pretrained models on 11 tasks: NER, POS-tagging, question answering, paraphrasing, ad relevance, web page relevance, question-answer matching, question generation, and news title generation. Similarly to XTREME, the suite provides training data in English and tests the cross-lingual models on testing data on several target languages (from 3 to 18).
SuperGLUE is likely the most prominent benchmarking suite for English. It follows the design of the GLUE suite, consisting of a public leaderboard for eight tasks, which are evaluated individually and jointly. The suite provides public training and development datasets, while testing data is hidden and only used to evaluate predictions submitted to the leaderboard. The benchmark includes QA, NLI, coreference resolution, and WSD tasks, for which non-expert human baselines are provided.
Following the English example, Russian SuperGLUE  was developed independently or manually translated from English SuperGLUE. It contains nine tasks: linguistic diagnostic, WSD, QA, NLI, and coreference resolution.
Slovene language possesses significantly fewer resources than English or Russian and is not included in the XTREME and XGLUE massively cross-lingual benchmarks. We aimed to provide a modern natural language understanding suite for this less-resourced language using modest resources at our disposal. For that purpose, we followed the English SuperGLUE design and translated the entire datasets (except WiC) to Slovene, mainly using machine translation and in small part human translation (120,000 words). This allows testing of monolingual, cross-lingual and multilingual approaches. Keeping the format of the original SuperGLUE, our datasets can be evaluated with the original SuperGLUE leaderboard and can be compared to English baselines and state-of-the-art.
3 Slovene SuperGLUE translation
To evaluate cross-lingual transfer and test specifics of morphologically rich languages, we translated the SuperGLUE datasets to Slovene. Due to limited funds, we partially used human translation (HT) and partially machine translation (MT). Altogether, approximately 120,000 words were human translated. Some datasets are too large (BoolQ, MultiRC, ReCoRD, RTE) to be fully human translated. We thus provide ratios between the human translated and the original English sizes in Table1. For MT from English to Slovene, we used the GoogleMT Cloud service. In our evaluation, we use six of the original eight tasks. As explained below, we excluded ReCoRD and WiC.
We decided to use only the HT translated part of our test sets in our evaluations to avoid noise due to translations. This makes some test sets much smaller compared to English test sets. We did not include ReCoRD in the Slovene benchmark due to the low quality of our test set, consisting of confusing and ambiguous examples. Further, there are differences between English and Slovene ReCoRD tasks due to the morphological richness of Slovene. Namely, in Slovene, the correct declension of a query is often not present in the text, making it impossible to provide the right answer. Finally, similarly to WSC (discussed below), ReCoRD is also affected by the problem of translating HTML tags with GoogleMT.
The WiC task cannot be translated and would have to be conceived anew because it is impossible to transfer the same set of meanings of a given word from English to a target language. Taking the WiC example in Table 7, we note that the word board in two different contexts translates to two completely different words in Slovene (Context 1: Bivanje in hrana. Context 2: Čez okna je nabil deske.).
The WSC dataset cannot be machine-translated because it requires human assistance and verification. First, GoogleMT translations cannot handle the correct placement of HTML tags indicating coreferences. The second reason is that in Slovene coreferences can also be expressed with verbs, while coreferences in English are mainly nouns, proper names and pronouns. This makes the task different in Slovene compared to English. On the one hand, the task is more difficult in Slovene because solutions cover more types of words; on the other hand, the Slovene verbs might reveal the coreference information for some instances.
4 Evaluation settings
SuperGLUE benchmark is extensively used to compare large pretrained models in English111https://super.gluebenchmark.com/leaderboard. In contrast to that, we concentrate on the Slovene translation of the SuperGLUE tasks and new opportunities which arise from that.
We compare four BERT models available for Slovene: monolingual Slovene SloBERTa , trilingual (Croatian-Slovene-English) CroSloEngual BERT , massively multilingual mBERT  (bert-base-multilingual-cased222https://huggingface.co/bert-base-multilingual-cased), and XLM-R  (xlm-roberta-base333https://huggingface.co/xlm-roberta-base).
The single-number overall average score (i.e. Avg in the second column of, e.g., Table 2) comprises equally weighted tasks. For English, there are eight tasks; for Slovene, there are all six translated tasks: BoolQ, CB, COPA, MultiRC, RTE, and WSC. In tasks with multiple metrics, we averaged those metrics to get a single task score. For the details on how the score is calculated for each task, see .
We test models in three settings. The monolingual setting uses the same language (Slovene or English) for training and testing. In the cross-lingual setting, we tested the cross-lingual models (CroSloEngual, mBERT, and XLM-R) and transfer between English and Slovene datasets in both directions. In the multilingual setting, the models were trained on the combined full size English and Slovene data. The results for the three settings as well as a comparison of two MT systems (GoogleMT and DeepL) are reported in Section 5.
We fine-tuned the available Slovene BERT models on SuperGLUE tasks using the Jiant tool 
. We used a single-task learning setting for each task and fine-tuned models for 100 epochs, with the initial learning rate of. Each model was fine-tuned using the dataset corresponding to one of the three settings.
We first report the results for each of the three settings (monolingual, cross-lingual, and multilingual) separately. We end the section with the comparison between human and machine-translated data and the comparison of machine translation systems.
5.1 Monolingual results
Our monolingual analysis compares different Slovene prediction models on the complete datasets, composed of existing human translated instances while the remaining instances are machine-translated. Table 2
shows the results together with several baselines trained on the original English datasets. Some comparisons to English baselines are questionable as the Slovene models are trained on only a small fraction of better quality HT data (BoolQ, MultiRC) and tested on smaller set of HT test data. In the case of the BERT++ model, the English model was additionally pretrained with transfer learning tasks similar to a target one (CB, RTE, BoolQ, COPA). In terms of datasets, the only fair comparison is possible with the CB, COPA, and WSC.
Considering the Avg scores in Table 2, the monolingual SloBERTa is the best performing Slovene model. On average, all Slovene BERT models perform better than the Most Frequent baseline. Concerning individual tasks, none of the Slovene models exceeds the Most Frequent baseline in the MultiRC task. SloBERTa was significantly better than the rest of the models in CB, COPA, and WSC, while XML-R was the best on BoolQ.
Compared to English models, the best Slovene model (SloBERTa) achieved better results on WSC. It seems that none of the English models learned anything from WSC (they are below the Most Frequent baseline), but the SloBERTa model achieved the score of 73.3 (the Most Frequent baseline gives 65.8). The success of SloBERTa on WSC might stem from the morphology of Slovene verbs, which include the information on the gender; this information is beneficial in coreference resolution and makes some instances easier in Slovene than in English. Nevertheless, there is still a large gap compared to human performance. All Slovene models showed good performance on CB and fell between English CBoW and BERT.
5.2 Cross-lingual results
In the cross-lingual scenario, we tested the three multilingual BERT models (mBERT, CroSloEngual, XLM-R) and the transfer between English and Slovene datasets (both directions). For Slovene as the source language, we used the available human translated examples. To make the comparison balanced, we only used the same examples from the English datasets. We tested both zero-shot transfer (no training data in the target language) and few-shot transfer. In the few-shot training, we used 10 additional examples from the target language for each task. We randomly sampled these 10 examples for 5 times and reported averages to achieve more statistically valid results. The fine-tuning hyperparameters are the same as in the monolingual setup.
The results are presented in Table 3. Averaged over all tasks, some models improved the Most frequent baseline. In general, they were quite unsuccessful on BoolQ, MultiRC, and WSC but showed some promising results on COPA, RTE, and especially CB. Additional training examples in the few-shot scenario brought some visible improvements. It seems that models perform better in the English-Slovene direction than vice versa. The best performing model is XLM-R, followed by CroSloEngual BERT and mBERT.
The low overall performance can be explained by a low number of training examples in the source language. If we take a closer look at specific models, we can observe that XLM-R shows very good results on CB in both directions. CroSloEngual BERT achieved a similar outstanding result on COPA. It is the only model that learned something on this dataset as well.
We can conclude that for the difficult SuperGLUE benchmark, the cross-lingual transfer is challenging but not impossible. In the future, we plan to expand the current set of experiments in several directions. First, we will train English models on the complete SuperGLUE datasets and transfer them to Slovene human and machine-translated datasets. Second, we will train Slovene models on the combined machine and human translated datasets and transfer them to complete English datasets. Finally, we will also combine training for several tasks and test transfer learning scenarios.
5.3 Multilingual results
In the multilingual setting, the three multilingual BERT models (CroSloEngual BERT, mBERT, and XLM-R) were trained on the combined full-size English and Slovene data. Slovene data is comprised of human-translated and machine-translated data. For Slovene, the models are tested on only HT data. The results are reported in Table 4.
Interestingly, all the best scores for all tasks were achieved when tested on English testing sets (the training sets were identical and comprised of both languages). This might be due to the fact that Slovene BERT models were pretrained on lower amounts of data and the quality of the Slovene translation.
The overall best model with the highest Avg score is mBERT (best in BoolQ, MultiRC and RTE), followed by XLM-R (best in CB and WSC) and CroSloEngual BERT (best in COPA). In the MultiRC and WSC all models lag behind the Most Frequent baseline.
5.4 Comparing human and machine translation
To test the difference in human and machine translation, we repeated the monolingual experiments separately on human translated (HT) data and the same machine translated (MT) instances (the sizes of these HT datasets are reported in Table 1). Each model was fine-tuned using either MT or HT datasets of the same size. Only the translated content varies between both translation types; otherwise, they contain exactly the same examples. The splits of instances into train, validation and test sets is the same as in the English variant (but mostly considerably smaller, see Table 1). WSC is excluded from this evaluation as it can only be human translated, so the average score (Avg) is computed from the five remaining tasks. Table 5 shows the results. The comparison with results in Table 2 is not fair because the models there used significantly more training examples.
Considering the Avg scores in Table 5, the monolingual SloBERTa is again the best performing Slovene model and all BERT models, regardless of translation type, perform better than the Most Frequent baselines. From the translation type perspective, the models trained on HT datasets perform better than those trained on MT datasets by 1.5 points. The only task where MT is better than HT is MultiRC, but looking at single scores, we can observe that none of the models learned anything in this task as there is a large gap between the Most Frequent baseline and the rest of the models.
Analysis of specific tasks shows that For the BoolQ dataset, all models predict the most frequent class (the testing set might be too small for reliable conclusions in BoolQ). We can safely assume that training sample sizes are too small in BoolQ and MultiRC (none of the models learned anything) and must be increased (we have only 92 HT examples in BoolQ and 15 HT examples in MultiRC). The same is also true for RTE.
5.5 Comparing machine translation systems
To check the effect of MT systems on the performance of models, we perform a separate study in the monolingual setting using two currently best MT systems for Slovene: GoogleMT444https://translate.google.com/ and DeepL555https://www.deepl.com/translator.
We compare all four Slovene BERT models on two datasets where all models performed better than the Most Frequent baseline, CB and COPA (WSC needs manual translation as discussed in Section 3). Results in Table 6 show surprisingly large differences between the two MT systems. DeepL gives favourable results on the CB dataset, while the inverse is true on the CB dataset. We cannot explain the differences and leave a more detailed analysis of the issue for further work.
We prepared the Slovene translation of natural language understanding benchmark suite SuperGLUE and released it under an open-source licence SloSuperGLUE666http://hdl.handle.net/11356/1380. We described the translation process and obstacles in the transfer to a morphologically rich language. The partially machine and partially human translated datasets were used in the assessment of four BERT-based models available for Slovene. The results show that the monolingual SloBERTa model is currently the best performing Slovene pretrained model. However, the performance of Slovene models is still significantly worse compared to the state-of-the-art English models, showing considerable potential for improvement of NLP approaches for less-resourced languages.
Our analyses show that the models’ performance improved with human translated datasets, and in future, we intend to increase the share of human translated data. For two of the English SuperGLUE datasets, the MT was not possible. We intend to create a Slovene version of the WiC task from scratch using word sense disambiguation tasks and manually adapt the ReCoRD task to get the full SuperGLUE suite. In the WSC tasks, Slovene verb morphology leaks some coreference information. We intend to analyze this issue and form a more challenging Slovene WSC. Next, both Slovene and Russian are Slavic languages, and therefore more similar to each other than English. It would be interesting to combine equivalent Russian and Slovene tasks from the translated SuperGLUE tasks in further multilingual and cross-lingual experiments. Finally, as we kept the format of the original SuperGLUE benchmarks, Slovene datasets can be evaluated with the original SuperGLUE leaderboard. While this allows comparison to English models, the obtained results are not publicly available. We intend to prepare a separate Slovene leaderboard and encourage the NLP community to pay attention to less-resourced languages.
The work was partially supported by the Slovenian Research Agency (ARRS) core research programmes P6-0411, as well as the Ministry of Culture of Republic of Slovenia through project Development of Slovene in Digital Environment (RSDO). This paper is supported by European Union’s Horizon 2020 research and innovation programme under grant agreement No 825153, project EMBEDDIA (Cross-Lingual Embeddings for Less-Represented Languages in European News Media). We thank the authors of SuperGLUE benchmark for sharing test set answers for some of the tasks.
Barq’s – Barq’s is an American soft drink. Its brand of root beer is notable for having caffeine. Barq’s, created by Edward Barq and bottled since the turn of the 20th century, is owned by the Barq family but bottled by the Coca-Cola Company. It was known as Barq’s Famous Olde Tyme Root Beer until 2012.
|Question:is barq’s root beer a pepsi product|
|CB||Text:B: And yet, uh, I we-, I hope to see employer based, you know, helping out. You know, child, uh, care centers at the place of employment and things like that, that will help out. A: Uh-huh. B: What do you think, do you think we are, setting a trend?|
|Hypothesis:they are setting a trend|
|COPA||Premise:My body cast a shadow over the grass.|
|Question:What’s the CAUSE for this|
|Alternative 1:The sun was rising.|
|Alternative 2:The grass was cut.|
|MultiRC||Paragraph:Susan wanted to have a birthday party. She called all of her friends. She has five friends. Her mom said that Susan can invite them all to the party. Her first friend could not go to the party because she was sick. Her second friend was going out of town. Her third friend was not so sure if her parents would let her. The fourth friend said maybe. The fifth friend could go to the party for sure. Susan was a little sad. On the day of the party, all five friends showed up. Each friend had a present for Susan. Susan was happy and sent each friend a thank you card the next week.|
|Question:Did Susan’s sick friend recover?|
|Candidate answers: Yes, she recoverd, No (F), Yes (T), No, she didn’t recover (F), Yes, she was at Susan’s party (T)|
|ReCoRD||Paragraph:(CNN) Puerto Rico on Sunday overwhelmingly voted for statehood. But Congress, the only body that can approve new states, will ultimately decide whether the status of the US commonwealth changes. Ninety-seven percent of the votes in the nonbinding referendum favored statehood, an increase over the results of a 2012 referendum, official results from the State Electoral Commission show. It was the fifth such vote on statehood. Today, we the people of Puerto Rico are sending a strong and clear message to the US Congress … and to the world … claiming our equal rights as American citizens, Puerto Rico Gov. Ricardo Rossello said in a news release.@highlight Puerto Rico voted Sunday in favor of US statehood|
|Query:For one, they can truthfully say, ”Don’t blame me, I didn’t vote for them,” when discussing the placeholder presidency.|
|Correct Entities: US|
|RTE||Text:Dana Reeve, the widow of the actor Christopher Reeve, has died of lung cancer at age 44, according to the Christopher Reeve Foundation.|
|Hypothesis:Christopher Reeve had an accident.|
|WiC||Context 1:Room and board.|
|Context 2:He nailed boards across the windows.|
|Sense match: False|
|WSC||Text:Mark told Pete many lies about himself, which Pete included in his book. He should have been more truthful.|
7 Bibliographical References
-  (2019) Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116. Cited by: §4.
-  (2018) Word translation without parallel data. In Proceedings of International Conference on Learning Representation ICLR, Cited by: §1.
-  (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. External Links: Cited by: §4.
Xtreme: a massively multilingual multi-task benchmark for evaluating cross-lingual generalisation.
International Conference on Machine Learning, pp. 4411–4421. Cited by: §1, §2.
-  (2021) Bidimensional leaderboards: Generate and evaluate language hand in hand. Note: ArXiv preprint 2112.04139 Cited by: §1.
-  (2020) XGLUE: a new benchmark datasetfor cross-lingual pre-training, understanding and generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6008–6018. Cited by: §1, §2.
-  (2020) jiant 2.0: a software toolkit for research on general-purpose text understanding models. Note: http://jiant.info/ Cited by: §4.
-  (2020) RussianSuperGLUE: A Russian language understanding evaluation benchmark. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4717–4726. Cited by: §2.
-  (2021) How not to lie with a benchmark: rearranging NLP leaderboards. Note: ArXiv preprint 2112.01342 Cited by: §1.
-  (2020) FinEst BERT and CroSloEngual BERT: less is more in multilingual models.. In Proceedings of Text, Speech, and Dialogue, TSD 2020, pp. 104–111. Cited by: §4.
-  (2021) SloBERTa: Slovene monolingual large pretrained masked language model. In Proceedings of Data Mining and Data Warehousing, SiKDD, Cited by: §4.
-  (2019) SuperGLUE: A stickier benchmark for general-purpose language understanding systems. Advances in Neural Information Processing Systems 32. Cited by: §1, §2, §4, Table 2.
-  (2019) GLUE: a multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations, Cited by: §1, §2.
8 Language Resource References