Basic Linguistic Resources and Baselines for Bhojpuri, Magahi and Maithili for Natural Language Processing

by   Rajesh Kumar Mundotiya, et al.

Corpus preparation for low-resource languages and for development of human language technology to analyze or computationally process them is a laborious task, primarily due to the unavailability of expert linguists who are native speakers of these languages and also due to the time and resources required. Bhojpuri, Magahi, and Maithili, languages of the Purvanchal region of India (in the north-eastern parts), are low-resource languages belonging to the Indo-Aryan (or Indic) family. They are closely related to Hindi, which is a relatively high-resource language, which is why we make our comparisons with Hindi. We collected corpora for these three languages from various sources and cleaned them to the extent possible, without changing the data in them. The text belongs to different domains and genres. We calculated some basic statistical measures for these corpora at character, word, syllable, and morpheme levels. These corpora were also annotated with parts-of-speech (POS) and chunk tags. The basic statistical measures were both absolute and relative and were meant to give an indication of linguistic properties such as morphological, lexical, phonological, and syntactic complexities (or richness). The results were compared with a standard Hindi corpus. For most of the measures, we tried to keep the size of the corpus the same across the languages so as to avoid the effect of corpus size, but in some cases it turned out that using the full corpus was better, even if sizes were very different. Although the results are not very clear, we try to draw some conclusions about the languages and the corpora. For POS tagging and chunking, the BIS tagset was used to manually annotate the data. The sizes of the POS tagged data are 16067, 14669 and 12310 sentences, respectively for Bhojpuri, Magahi and Maithili. The sizes for chunking are 9695 and 1954 sentences for Bhojpuri and Maithili, respect


page 1

page 2

page 3

page 4


Annotated Speech Corpus for Low Resource Indian Languages: Awadhi, Bhojpuri, Braj and Magahi

In this paper we discuss an in-progress work on the development of a spe...

A Survey of Corpora for Germanic Low-Resource Languages and Dialects

Despite much progress in recent years, the vast majority of work in natu...

Automatic Identification of Closely-related Indian Languages: Resources and Experiments

In this paper, we discuss an attempt to develop an automatic language id...

Cross-Lingual Morphological Tagging for Low-Resource Languages

Morphologically rich languages often lack the annotated linguistic resou...

Representations of Language Varieties Are Reliable Given Corpus Similarity Measures

This paper measures similarity both within and between 84 language varie...

A study of conceptual language similarity: comparison and evaluation

An interesting line of research in natural language processing (NLP) aim...

Please sign up or login with your details

Forgot password? Click here to reset