SweLL on the rise: Swedish Learner Language corpus for European Reference Level studies

04/22/2016
by   Elena Volodina, et al.
0

We present a new resource for Swedish, SweLL, a corpus of Swedish Learner essays linked to learners' performance according to the Common European Framework of Reference (CEFR). SweLL consists of three subcorpora - SpIn, SW1203 and Tisus, collected from three different educational establishments. The common metadata for all subcorpora includes age, gender, native languages, time of residence in Sweden, type of written task. Depending on the subcorpus, learner texts may contain additional information, such as text genres, topics, grades. Five of the six CEFR levels are represented in the corpus: A1, A2, B1, B2 and C1 comprising in total 339 essays. C2 level is not included since courses at C2 level are not offered. The work flow consists of collection of essays and permits, essay digitization and registration, meta-data annotation, automatic linguistic annotation. Inter-rater agreement is presented on the basis of SW1203 subcorpus. The work on SweLL is still ongoing with more than 100 essays waiting in the pipeline. This article both describes the resource and the "how-to" behind the compilation of SweLL.

READ FULL TEXT
research
06/28/2018

Predicting CEFRL levels in learner English on the basis of metrics and full texts

This paper analyses the contribution of language metrics and, potentiall...
research
12/30/2021

YACLC: A Chinese Learner Corpus with Multidimensional Annotation

Learner corpus collects language data produced by L2 learners, that is s...
research
08/17/2023

Is Argument Structure of Learner Chinese Understandable: A Corpus-Based Analysis

This paper presents a corpus-based analysis of argument structure errors...
research
04/30/2018

A Portuguese Native Language Identification Dataset

In this paper we present NLI-PT, the first Portuguese dataset compiled f...
research
05/14/2021

DaLAJ - a dataset for linguistic acceptability judgments for Swedish: Format, baseline, sharing

We present DaLAJ 1.0, a Dataset for Linguistic Acceptability Judgments f...
research
01/25/2022

The Text Anonymization Benchmark (TAB): A Dedicated Corpus and Evaluation Framework for Text Anonymization

We present a novel benchmark and associated evaluation metrics for asses...
research
06/03/2022

ArgRewrite V.2: an Annotated Argumentative Revisions Corpus

Analyzing how humans revise their writings is an interesting research qu...

Please sign up or login with your details

Forgot password? Click here to reset