Dim Wihl Gat Tun: The Case for Linguistic Expertise in NLP for Underdocumented Languages

03/17/2022
by   Clarissa Forbes, et al.
0

Recent progress in NLP is driven by pretrained models leveraging massive datasets and has predominantly benefited the world's political and economic superpowers. Technologically underserved languages are left behind because they lack such resources. Hundreds of underserved languages, nevertheless, have available data sources in the form of interlinear glossed text (IGT) from language documentation efforts. IGT remains underutilized in NLP work, perhaps because its annotations are only semi-structured and often language-specific. With this paper, we make the case that IGT data can be leveraged successfully provided that target language expertise is available. We specifically advocate for collaboration with documentary linguists. Our paper provides a roadmap for successful projects utilizing IGT data: (1) It is essential to define which NLP tasks can be accomplished with the given IGT data and how these will benefit the speech community. (2) Great care and target language expertise is required when converting the data into structured formats commonly employed in NLP. (3) Task-specific and user-specific evaluation can help to ascertain that the tools which are created benefit the target language speech community. We illustrate each step through a case study on developing a morphological reinflection system for the Tsimchianic language Gitksan.

READ FULL TEXT
research
04/25/2022

How can NLP Help Revitalize Endangered Languages? A Case Study and Roadmap for the Cherokee Language

More than 43 language loss currently occurs at an accelerated rate becau...
research
03/17/2022

Expanding Pretrained Models to Thousands More Languages via Lexicon-based Adaptation

The performance of multilingual pretrained models is highly dependent on...
research
07/13/2022

O-Dang! The Ontology of Dangerous Speech Messages

Inside the NLP community there is a considerable amount of language reso...
research
10/21/2022

Bootstrapping NLP tools across low-resourced African languages: an overview and prospects

Computing and Internet access are substantially growing markets in South...
research
05/06/2022

Bridging the Domain Gap for Stance Detection for the Zulu language

Misinformation has become a major concern in recent last years given its...
research
07/08/2022

The Harvard USPTO Patent Dataset: A Large-Scale, Well-Structured, and Multi-Purpose Corpus of Patent Applications

Innovation is a major driver of economic and social development, and inf...
research
07/23/2020

AI4D – African Language Dataset Challenge

As language and speech technologies become more advanced, the lack of fu...

Please sign up or login with your details

Forgot password? Click here to reset