Language Variety Identification with True Labels

03/02/2023
by   Marcos Zampieri, et al.
0

Language identification is an important first step in many IR and NLP applications. Most publicly available language identification datasets, however, are compiled under the assumption that the gold label of each instance is determined by where texts are retrieved from. Research has shown that this is a problematic assumption, particularly in the case of very similar languages (e.g., Croatian and Serbian) and national language varieties (e.g., Brazilian and European Portuguese), where texts may contain no distinctive marker of the particular language or variety. To overcome this important limitation, this paper presents DSL True Labels (DSL-TL), the first human-annotated multilingual dataset for language variety identification. DSL-TL contains a total of 12,900 instances in Portuguese, split between European Portuguese and Brazilian Portuguese; Spanish, split between Argentine Spanish and Castilian Spanish; and English, split between American English and British English. We trained multiple models to discriminate between these language varieties, and we present the results in detail. The data and models presented in this paper provide a reliable benchmark toward the development of robust and fairer language variety identification systems. We make DSL-TL freely available to the research community.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/16/2020

Offensive Language Identification in Greek

As offensive language has become a rising issue for online communities a...
research
09/10/2021

FBERT: A Neural Transformer for Identifying Offensive Content

Transformer-based models such as BERT, XLNET, and XLM-R have achieved st...
research
05/30/2017

A Low Dimensionality Representation for Language Variety Identification

Language variety identification aims at labelling texts in a native lang...
research
05/31/2021

Singing Language Identification using a Deep Phonotactic Approach

Extensive works have tackled Language Identification (LID) in the speech...
research
02/20/2023

EuroCrops: All you need to know about the Largest Harmonised Open Crop Dataset Across the European Union

EuroCrops contains geo-referenced polygons of agricultural croplands fro...
research
05/23/2023

Towards Massively Multi-domain Multilingual Readability Assessment

We present ReadMe++, a massively multi-domain multilingual dataset for a...
research
09/19/2022

Challenges and Opportunities of Large Transnational Datasets: A Case Study on European Administrative Crop Data

Expansive, informative datasets are vital in providing foundations and p...

Please sign up or login with your details

Forgot password? Click here to reset