Google Crowdsourced Speech Corpora and Related Open-Source Resources for Low-Resource Languages and Dialects: An Overview

10/14/2020
by   Alena Butryna, et al.
14

This paper presents an overview of a program designed to address the growing need for developing freely available speech resources for under-represented languages. At present we have released 38 datasets for building text-to-speech and automatic speech recognition applications for languages and dialects of South and Southeast Asia, Africa, Europe and South America. The paper describes the methodology used for developing such corpora and presents some of our findings that could benefit under-represented language communities.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/01/2017

Using of heterogeneous corpora for training of an ASR system

The paper summarizes the development of the LVCSR system built as a part...
research
03/30/2022

Vakyansh: ASR Toolkit for Low Resource Indic languages

We present Vakyansh, an end to end toolkit for Speech Recognition in Ind...
research
12/31/2020

Open Korean Corpora: A Practical Report

Korean is often referred to as a low-resource language in the research c...
research
03/18/2020

Gender Representation in Open Source Speech Resources

With the rise of artificial intelligence (AI) and the growing use of dee...
research
12/23/2018

Pansori: ASR Corpus Generation from Open Online Video Contents

This paper introduces Pansori, a program used to create ASR (automatic s...
research
03/20/2020

Language Technology Programme for Icelandic 2019-2023

In this paper, we describe a new national language technology programme ...
research
03/29/2023

Tackling Hate Speech in Low-resource Languages with Context Experts

Given Myanmars historical and socio-political context, hate speech sprea...

Please sign up or login with your details

Forgot password? Click here to reset