Multi-VALUE: A Framework for Cross-Dialectal English NLP

12/15/2022
by   Caleb Ziems, et al.
0

Dialect differences caused by regional, social, and economic barriers cause performance discrepancies for many groups of users of language technology. Fair, inclusive, and equitable language technology must critically be dialect invariant, meaning that performance remains constant over dialectal shifts. Current English systems often fall significantly short of this ideal since they are designed and tested on a single dialect: Standard American English. We introduce Multi-VALUE – a suite of resources for evaluating and achieving English dialect invariance. We build a controllable rule-based translation system spanning 50 English dialects and a total of 189 unique linguistic features. Our translation maps Standard American English text to synthetic form of each dialect, which uses an upper-bound on the natural density of features in that dialect. First, we use this system to build stress tests for question answering, machine translation, and semantic parsing tasks. Stress tests reveal significant performance disparities for leading models on non-standard dialects. Second, we use this system as a data augmentation technique to improve the dialect robustness of existing systems. Finally, we partner with native speakers of Chicano and Indian English to release new gold-standard variants of the popular CoQA task.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/28/2023

HaVQA: A Dataset for Visual Question Answering and Multimodal Research in Hausa Language

This paper presents HaVQA, the first multimodal dataset for visual quest...
research
03/18/2020

Unsupervised Pidgin Text Generation By Pivoting English Data and Self-Training

West African Pidgin English is a language that is significantly spoken i...
research
04/26/2022

Disambiguation of morpho-syntactic features of African American English – the case of habitual be

Recent research has highlighted that natural language processing (NLP) s...
research
04/30/2020

Use of Machine Translation to Obtain Labeled Datasets for Resource-Constrained Languages

The large annotated datasets in NLP are overwhelmingly in English. This ...
research
04/06/2022

VALUE: Understanding Dialect Disparity in NLU

English Natural Language Understanding (NLU) systems have achieved great...
research
05/26/2023

TADA: Task-Agnostic Dialect Adapters for English

Large Language Models, the dominant starting point for Natural Language ...
research
01/14/2021

SICKNL: A Dataset for Dutch Natural Language Inference

We present SICK-NL (read: signal), a dataset targeting Natural Language ...

Please sign up or login with your details

Forgot password? Click here to reset