Corpus-Guided Contrast Sets for Morphosyntactic Feature Detection in Low-Resource English Varieties

09/15/2022
by   Tessa Masis, et al.
0

The study of language variation examines how language varies between and within different groups of speakers, shedding light on how we use language to construct identities and how social contexts affect language use. A common method is to identify instances of a certain linguistic feature - say, the zero copula construction - in a corpus, and analyze the feature's distribution across speakers, topics, and other variables, to either gain a qualitative understanding of the feature's function or systematically measure variation. In this paper, we explore the challenging task of automatic morphosyntactic feature detection in low-resource English varieties. We present a human-in-the-loop approach to generate and filter effective contrast sets via corpus-guided edits. We show that our approach improves feature detection for both Indian English and African American English, demonstrate how it can assist linguistic research, and release our fine-tuned models for use by other researchers.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/28/2023

H-AES: Towards Automated Essay Scoring for Hindi

The use of Natural Language Processing (NLP) for Automated Essay Scoring...
research
09/06/2023

Offensive Hebrew Corpus and Detection using BERT

Offensive language detection has been well studied in many languages, bu...
research
07/27/2020

Linguistic Taboos and Euphemisms in Nepali

Languages across the world have words, phrases, and behaviors – the tabo...
research
04/13/2022

Study of Indian English Pronunciation Variabilities relative to Received Pronunciation

In contrast to British or American English, labeled pronunciation data o...
research
01/15/2022

KazakhTTS2: Extending the Open-Source Kazakh TTS Corpus With More Data, Speakers, and Topics

We present an expanded version of our previously released Kazakh text-to...
research
06/20/2019

Low-Resource Corpus Filtering using Multilingual Sentence Embeddings

In this paper, we describe our submission to the WMT19 low-resource para...
research
11/04/2022

CLSE: Corpus of Linguistically Significant Entities

One of the biggest challenges of natural language generation (NLG) is th...

Please sign up or login with your details

Forgot password? Click here to reset