Building Representative Corpora from Illiterate Communities: A Review of Challenges and Mitigation Strategies for Developing Countries

02/04/2021
by   Stephanie Hirmer, et al.
1

Most well-established data collection methods currently adopted in NLP depend on the assumption of speaker literacy. Consequently, the collected corpora largely fail to represent swathes of the global population, which tend to be some of the most vulnerable and marginalised people in society, and often live in rural developing areas. Such underrepresented groups are thus not only ignored when making modeling and system design decisions, but also prevented from benefiting from development outcomes achieved through data-driven NLP. This paper aims to address the under-representation of illiterate communities in NLP corpora: we identify potential biases and ethical issues that might arise when collecting data from rural communities with high illiteracy rates in Low-Income Countries, and propose a set of practical mitigation strategies to help future work.

READ FULL TEXT

page 2

page 9

research
03/30/2021

Privacidade digital como direito do cidadao: o caso dos grupos indigenas do Brasil

The article presents a brief review of the Brazilian legislation that im...
research
12/15/2021

Measure and Improve Robustness in NLP Models: A Survey

As NLP models achieved state-of-the-art performances over benchmarks and...
research
04/24/2022

"It Feels Like Being Locked in A Cage": Understanding Blind or Low Vision Streamers' Perceptions of Content Curation Algorithms

Blind or low vision (BLV) people were recently reported to be live strea...
research
05/10/2022

Richer Countries and Richer Representations

We examine whether some countries are more richly represented in embeddi...
research
11/16/2020

Don't Patronize Me! An Annotated Dataset with Patronizing and Condescending Language towards Vulnerable Communities

In this paper, we introduce a new annotated dataset which is aimed at su...

Please sign up or login with your details

Forgot password? Click here to reset