Annotated Speech Corpus for Low Resource Indian Languages: Awadhi, Bhojpuri, Braj and Magahi

06/26/2022
by   Ritesh Kumar, et al.
2

In this paper we discuss an in-progress work on the development of a speech corpus for four low-resource Indo-Aryan languages – Awadhi, Bhojpuri, Braj and Magahi using the field methods of linguistic data collection. The total size of the corpus currently stands at approximately 18 hours (approx. 4-5 hours each language) and it is transcribed and annotated with grammatical information such as part-of-speech tags, morphological features and Universal dependency relationships. We discuss our methodology for data collection in these languages, most of which was done in the middle of the COVID-19 pandemic, with one of the aims being to generate some additional income for low-income groups speaking these languages. In the paper, we also discuss the results of the baseline experiments for automatic speech recognition system in these languages.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/12/2022

Huqariq: A Multilingual Speech Corpus of Native Languages of Peru for Speech Recognition

The Huqariq corpus is a multilingual collection of speech from native Pe...
research
04/27/2021

Using Radio Archives for Low-Resource Speech Recognition: Towards an Intelligent Virtual Assistant for Illiterate Users

For many of the 700 million illiterate people around the world, speech r...
research
04/29/2020

Basic Linguistic Resources and Baselines for Bhojpuri, Magahi and Maithili for Natural Language Processing

Corpus preparation for low-resource languages and for development of hum...
research
03/29/2023

Tackling Hate Speech in Low-resource Languages with Context Experts

Given Myanmars historical and socio-political context, hate speech sprea...
research
09/16/2019

Fast transcription of speech in low-resource languages

We present software that, in only a few hours, transcribes forty hours o...
research
12/09/2022

PACMAN: a framework for pulse oximeter digit detection and reading in a low-resource setting

In light of the COVID-19 pandemic, patients were required to manually in...
research
11/18/2022

Dialogs Re-enacted Across Languages

To support machine learning of cross-language prosodic mappings and othe...

Please sign up or login with your details

Forgot password? Click here to reset