CompLex — A New Corpus for Lexical Complexity Predicition from Likert Scale Data

03/16/2020
by   Matthew Shardlow, et al.
0

Predicting which words are considered hard to understand for a given target population is a vital step in many NLP applications such as text simplification. This task is commonly referred to as Complex Word Identification (CWI). With a few exceptions, previous studies have approached the task as a binary classification task in which systems predict a complexity value (complex vs. non-complex) for a set of target words in a text. This choice is motivated by the fact that all CWI datasets compiled so far have been annotated using a binary annotation scheme. Our paper addresses this limitation by presenting the first English dataset for continuous lexical complexity prediction. We use a 5-point Likert scale scheme to annotate complex words in texts from three sources/domains: the Bible, Europarl, and biomedical texts. This resulted in a corpus of 9,476 sentences each annotated by around 7 annotators.

READ FULL TEXT
research
02/17/2021

Predicting Lexical Complexity in English Texts

The first step in most text simplification is to predict which words are...
research
07/27/2023

Retrieval-based Text Selection for Addressing Class-Imbalanced Data in Classification

This paper addresses the problem of selecting of a set of texts for anno...
research
10/13/2017

Complex Word Identification: Challenges in Data Annotation and System Performance

This paper revisits the problem of complex word identification (CWI) fol...
research
06/30/2023

Japanese Lexical Complexity for Non-Native Readers: A New Dataset

Lexical complexity prediction (LCP) is the task of predicting the comple...
research
03/08/2023

Lexical Complexity Prediction: An Overview

The occurrence of unknown words in texts significantly hinders reading c...
research
07/09/2020

DISCO PAL: Diachronic Spanish Sonnet Corpus with Psychological and Affective Labels

Nowadays, there are many applications of text mining over corpus from di...
research
11/14/2021

Towards annotation of text worlds in a literary work

Literary texts are usually rich in meanings and their interpretation com...

Please sign up or login with your details

Forgot password? Click here to reset