Approaches to Corpus Creation for Low-Resource Language Technology: the Case of Southern Kurdish and Laki

04/03/2023
by   Sina Ahmadi, et al.
0

One of the major challenges that under-represented and endangered language communities face in language technology is the lack or paucity of language data. This is also the case of the Southern varieties of the Kurdish and Laki languages for which very limited resources are available with insubstantial progress in tools. To tackle this, we provide a few approaches that rely on the content of local news websites, a local radio station that broadcasts content in Southern Kurdish and fieldwork for Laki. In this paper, we describe some of the challenges of such under-represented languages, particularly in writing and standardization, and also, in retrieving sources of data and retro-digitizing handwritten content to create a corpus for Southern Kurdish and Laki. In addition, we study the task of language identification in light of the other variants of Kurdish and Zaza-Gorani languages.

READ FULL TEXT
research
02/18/2020

Investigating an approach for low resource language dataset creation, curation and classification: Setswana and Sepedi

The recent advances in Natural Language Processing have been a boon for ...
research
04/12/2022

Not always about you: Prioritizing community needs when developing endangered language technology

Languages are classified as low-resource when they lack the quantity of ...
research
05/25/2023

Script Normalization for Unconventional Writing of Under-Resourced Languages in Bilingual Communities

The wide accessibility of social media has provided linguistically under...
research
01/18/2019

Automatic Keyboard Layout Design for Low-Resource Latin-Script Languages

We present our approach to automatically designing and implementing keyb...
research
11/30/2019

Automatic Creation of Text Corpora for Low-Resource Languages from the Internet: The Case of Swiss German

This paper presents SwissCrawl, the largest Swiss German text corpus to ...
research
11/05/2021

Developing Successful Shared Tasks on Offensive Language Identification for Dravidian Languages

With the fast growth of mobile computing and Web technologies, offensive...
research
04/02/2020

Mapping Languages: The Corpus of Global Language Use

This paper describes a web-based corpus of global language use with a fo...

Please sign up or login with your details

Forgot password? Click here to reset