One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia

by   Alham Fikri Aji, et al.

NLP research is impeded by a lack of resources and awareness of the challenges presented by underrepresented languages and dialects. Focusing on the languages spoken in Indonesia, the second most linguistically diverse and the fourth most populous nation of the world, we provide an overview of the current state of NLP research for Indonesia's 700+ languages. We highlight challenges in Indonesian NLP and how these affect the performance of current NLP systems. Finally, we provide general recommendations to help develop NLP technology not only for languages of Indonesia but also other underrepresented languages.



page 1

page 2

page 3

page 4


Challenges of language technologies for the indigenous languages of the Americas

Indigenous languages of the American continent are highly diverse. Howev...

NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages

Natural language processing (NLP) has a significant impact on society vi...

NLP for Ghanaian Languages

NLP Ghana is an open-source non-profit organization aiming to advance th...

Masader: Metadata Sourcing for Arabic Text and Speech Data Resources

The NLP pipeline has evolved dramatically in the last few years. The fir...

Systematic Inequalities in Language Technology Performance across the World's Languages

Natural language processing (NLP) systems have become a central technolo...

Is Machine Learning Speaking my Language? A Critical Look at the NLP-Pipeline Across 8 Human Languages

Natural Language Processing (NLP) is increasingly used as a key ingredie...

The State and Fate of Linguistic Diversity and Inclusion in the NLP World

Language technologies contribute to promoting multilingualism and lingui...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.