Is Machine Learning Speaking my Language? A Critical Look at the NLP-Pipeline Across 8 Human Languages

07/11/2020
by   Esma Wali, et al.
0

Natural Language Processing (NLP) is increasingly used as a key ingredient in critical decision-making systems such as resume parsers used in sorting a list of job candidates. NLP systems often ingest large corpora of human text, attempting to learn from past human behavior and decisions in order to produce systems that will make recommendations about our future world. Over 7000 human languages are being spoken today and the typical NLP pipeline underrepresents speakers of most of them while amplifying the voices of speakers of other languages. In this paper, a team including speakers of 8 languages - English, Chinese, Urdu, Farsi, Arabic, French, Spanish, and Wolof - takes a critical look at the typical NLP pipeline and how even when a language is technically supported, substantial caveats remain to prevent full participation. Despite huge and admirable investments in multilingual support in many tools and resources, we are still making NLP-guided decisions that systematically and dramatically underrepresent the voices of much of the world.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/25/2020

A Panoramic Survey of Natural Language Processing in the Arab World

The term natural language refers to any system of symbolic communication...
research
03/24/2022

One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia

NLP research is impeded by a lack of resources and awareness of the chal...
research
03/18/2022

Challenges and Strategies in Cross-Cultural NLP

Various efforts in the Natural Language Processing (NLP) community have ...
research
08/10/2016

An assessment of orthographic similarity measures for several African languages

Natural Language Interfaces and tools such as spellcheckers and Web sear...
research
08/08/2023

CLASSLA-Stanza: The Next Step for Linguistic Processing of South Slavic Languages

We present CLASSLA-Stanza, a pipeline for automatic linguistic annotatio...
research
02/13/2021

The first large scale collection of diverse Hausa language datasets

Hausa language belongs to the Afroasiatic phylum, and with more first-la...
research
10/13/2021

Masader: Metadata Sourcing for Arabic Text and Speech Data Resources

The NLP pipeline has evolved dramatically in the last few years. The fir...

Please sign up or login with your details

Forgot password? Click here to reset