ScandEval: A Benchmark for Scandinavian Natural Language Processing

04/03/2023
by   Dan Saattrup Nielsen, et al.
0

This paper introduces a Scandinavian benchmarking platform, ScandEval, which can benchmark any pretrained model on four different tasks in the Scandinavian languages. The datasets used in two of the tasks, linguistic acceptability and question answering, are new. We develop and release a Python package and command-line interface, scandeval, which can benchmark any model that has been uploaded to the Hugging Face Hub, with reproducible results. Using this package, we benchmark more than 100 Scandinavian or multilingual models and present the results of these in an interactive online leaderboard, as well as provide an analysis of the results. The analysis shows that there is substantial cross-lingual transfer among the Mainland Scandinavian languages (Danish, Swedish and Norwegian), with limited cross-lingual transfer between the group of Mainland Scandinavian languages and the group of Insular Scandinavian languages (Icelandic and Faroese). The benchmarking results also show that the investment in language technology in Norway, Sweden and Denmark has led to language models that outperform massively multilingual models such as XLM-RoBERTa and mDeBERTaV3. We release the source code for both the package and leaderboard.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/02/2021

Larger-Scale Transformers for Multilingual Masked Language Modeling

Recent work has demonstrated the effectiveness of cross-lingual language...
research
03/24/2020

XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization

Much recent progress in applications of machine learning models to NLP h...
research
05/24/2022

Analyzing the Mono- and Cross-Lingual Pretraining Dynamics of Multilingual Language Models

The emergent cross-lingual transfer seen in multilingual pretrained mode...
research
08/08/2023

CLASSLA-Stanza: The Next Step for Linguistic Processing of South Slavic Languages

We present CLASSLA-Stanza, a pipeline for automatic linguistic annotatio...
research
01/31/2023

Zero-shot cross-lingual transfer language selection using linguistic similarity

We study the selection of transfer languages for different Natural Langu...
research
07/02/2018

Modeling Language Variation and Universals: A Survey on Typological Linguistics for Natural Language Processing

Addressing the cross-lingual variation of grammatical structures and mea...
research
09/02/2023

LinkTransformer: A Unified Package for Record Linkage with Transformer Language Models

Linking information across sources is fundamental to a variety of analys...

Please sign up or login with your details

Forgot password? Click here to reset