Huqariq: A Multilingual Speech Corpus of Native Languages of Peru for Speech Recognition

07/12/2022
by   Rodolfo Zevallos, et al.
0

The Huqariq corpus is a multilingual collection of speech from native Peruvian languages. The transcribed corpus is intended for the research and development of speech technologies to preserve endangered languages in Peru. Huqariq is primarily designed for the development of automatic speech recognition, language identification and text-to-speech tools. In order to achieve corpus collection sustainably, we employ the crowdsourcing methodology. Huqariq includes four native languages of Peru, and it is expected that by the end of the year 2022, it can reach up to 20 native languages out of the 48 native languages in Peru. The corpus has 220 hours of transcribed audio recorded by more than 500 volunteers, making it the largest speech corpus for native languages in Peru. In order to verify the quality of the corpus, we present speech recognition experiments using 220 hours of fully transcribed audio.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/13/2019

Common Voice: A Massively-Multilingual Speech Corpus

The Common Voice corpus is a massively-multilingual collection of transc...
research
06/26/2022

Annotated Speech Corpus for Low Resource Indian Languages: Awadhi, Bhojpuri, Braj and Magahi

In this paper we discuss an in-progress work on the development of a spe...
research
03/27/2018

Comprehending Real Numbers: Development of Bengali Real Number Speech Corpus

Speech recognition has received a less attention in Bengali literature d...
research
08/23/2019

Deploying Technology to Save Endangered Languages

Computer scientists working on natural language processing, native speak...
research
02/15/2021

Jira: a Kurdish Speech Recognition System Designing and Building Speech Corpus and Pronunciation Lexicon

In this paper, we introduce the first large vocabulary speech recognitio...
research
01/22/2020

TLT-school: a Corpus of Non Native Children Speech

This paper describes "TLT-school" a corpus of speech utterances collecte...
research
04/27/2021

Using Radio Archives for Low-Resource Speech Recognition: Towards an Intelligent Virtual Assistant for Illiterate Users

For many of the 700 million illiterate people around the world, speech r...

Please sign up or login with your details

Forgot password? Click here to reset