Building Representative Corpora from Illiterate Communities: A Review of Challenges and Mitigation Strategies for Developing Countries

by   Stephanie Hirmer, et al.

Most well-established data collection methods currently adopted in NLP depend on the assumption of speaker literacy. Consequently, the collected corpora largely fail to represent swathes of the global population, which tend to be some of the most vulnerable and marginalised people in society, and often live in rural developing areas. Such underrepresented groups are thus not only ignored when making modeling and system design decisions, but also prevented from benefiting from development outcomes achieved through data-driven NLP. This paper aims to address the under-representation of illiterate communities in NLP corpora: we identify potential biases and ethical issues that might arise when collecting data from rural communities with high illiteracy rates in Low-Income Countries, and propose a set of practical mitigation strategies to help future work.



There are no comments yet.


page 2

page 9


Privacidade digital como direito do cidadao: o caso dos grupos indigenas do Brasil

The article presents a brief review of the Brazilian legislation that im...

Measure and Improve Robustness in NLP Models: A Survey

As NLP models achieved state-of-the-art performances over benchmarks and...

"It Feels Like Being Locked in A Cage": Understanding Blind or Low Vision Streamers' Perceptions of Content Curation Algorithms

Blind or low vision (BLV) people were recently reported to be live strea...

Richer Countries and Richer Representations

We examine whether some countries are more richly represented in embeddi...

The ICT-Buen Vivir Paradox: Using Digital Tools to Defend Indigenous Cultures

Arguably shaped by political economy perspectives from the Global North,...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.