ASR2K: Speech Recognition for Around 2000 Languages without Audio

09/06/2022
by   Xinjian Li, et al.
0

Most recent speech recognition models rely on large supervised datasets, which are unavailable for many low-resource languages. In this work, we present a speech recognition pipeline that does not require any audio for the target language. The only assumption is that we have access to raw text datasets or a set of n-gram statistics. Our speech pipeline consists of three components: acoustic, pronunciation, and language models. Unlike the standard pipeline, our acoustic and pronunciation models use multilingual models without any supervision. The language model is built using n-gram statistics or the raw text dataset. We build speech recognition for 1909 languages by combining it with Crubadan: a large endangered languages n-gram database. Furthermore, we test our approach on 129 languages across two datasets: Common Voice and CMU Wilderness dataset. We achieve 50 with Crubadan statistics only and improve them to 45 using 10000 raw text utterances.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/30/2023

Speech Wikimedia: A 77 Language Multilingual Speech Dataset

The Speech Wikimedia Dataset is a publicly available compilation of audi...
research
08/13/2020

LSTM Acoustic Models Learn to Align and Pronounce with Graphemes

Automated speech recognition coverage of the world's languages continues...
research
06/21/2021

Computational Pronunciation Analysis in Sung Utterances

Recent automatic lyrics transcription (ALT) approaches focus on building...
research
04/05/2021

SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network

We present SpeechStew, a speech recognition model that is trained on a c...
research
06/23/2016

NN-grams: Unifying neural network and n-gram language models for Speech Recognition

We present NN-grams, a novel, hybrid language model integrating n-grams ...
research
09/11/2022

Applying wav2vec2 for Speech Recognition on Bengali Common Voices Dataset

Speech is inherently continuous, where discrete words, phonemes and othe...
research
05/31/2021

Singing Language Identification using a Deep Phonotactic Approach

Extensive works have tackled Language Identification (LID) in the speech...

Please sign up or login with your details

Forgot password? Click here to reset