Challenges in Developing LRs for Non-Scheduled Languages: A Case of Magahi

11/30/2021
by   Ritesh Kumar, et al.
0

Magahi is an Indo-Aryan Language, spoken mainly in the Eastern parts of India. Despite having a significant number of speakers, there has been virtually no language resource (LR) or language technology (LT) developed for the language, mainly because of its status as a non-scheduled language. The present paper describes an attempt to develop an annotated corpus of Magahi. The data is mainly taken from a couple of blogs in Magahi, some collection of stories in Magahi and the recordings of conversation in Magahi and it is annotated at the POS level using BIS tagset.

READ FULL TEXT
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

01/13/2022

Speech Resources in the Tamasheq Language

In this paper we present two datasets for Tamasheq, a developing languag...
04/06/2022

Language Resources and Technologies for Non-Scheduled and Endangered Indian Languages

In the present paper, we will present a survey of the language resources...
04/12/2022

Not always about you: Prioritizing community needs when developing endangered language technology

Languages are classified as low-resource when they lack the quantity of ...
07/04/2018

A Formal Ontology-Based Classification of Lexemes and its Applications

The paper describes the enrichment of OntoSenseNet - a verb-centric lexi...
05/19/2022

Curras + Baladi: Towards a Levantine Corpus

The processing of the Arabic language is a complex field of research. Th...
05/11/2020

Luganda Text-to-Speech Machine

In Uganda, Luganda is the most spoken native language. It is used for co...
04/20/2022

Who Is Missing? Characterizing the Participation of Different Demographic Groups in a Korean Nationwide Daily Conversation Corpus

A conversation corpus is essential to build interactive AI applications....
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.