Location-based Twitter Filtering for the Creation of Low-Resource Language Datasets in Indonesian Local Languages

06/15/2022
by   Mukhlis Amien, et al.
0

Twitter contains an abundance of linguistic data from the real world. We examine Twitter for user-generated content in low-resource languages such as local Indonesian. For NLP to work in Indonesian, it must consider local dialects, geographic context, and regional culture influence Indonesian languages. This paper identifies the problems we faced when constructing a Local Indonesian NLP dataset. Furthermore, we are developing a framework for creating, collecting, and classifying Local Indonesian datasets for NLP. Using twitter's geolocation tool for automatic annotating.

READ FULL TEXT
research
05/31/2022

NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages

Natural language processing (NLP) has a significant impact on society vi...
research
06/12/2020

Low-resource Languages: A Review of Past Work and Future Challenges

A current problem in NLP is massaging and processing low-resource langua...
research
07/20/2021

TLA: Twitter Linguistic Analysis

Linguistics has been instrumental in developing a deeper understanding o...
research
06/12/2018

Challenges of language technologies for the indigenous languages of the Americas

Indigenous languages of the American continent are highly diverse. Howev...
research
04/09/2015

Leveraging Twitter for Low-Resource Conversational Speech Language Modeling

In applications involving conversational speech, data sparsity is a limi...
research
01/18/2019

Automatic Keyboard Layout Design for Low-Resource Latin-Script Languages

We present our approach to automatically designing and implementing keyb...
research
05/25/2022

Evaluating Inclusivity, Equity, and Accessibility of NLP Technology: A Case Study for Indian Languages

In order for NLP technology to be widely applicable and useful, it needs...

Please sign up or login with your details

Forgot password? Click here to reset