Effectiveness of Mining Audio and Text Pairs from Public Data for Improving ASR Systems for Low-Resource Languages

08/26/2022
by   Kaushal Santosh Bhogale, et al.
1

End-to-end (E2E) models have become the default choice for state-of-the-art speech recognition systems. Such models are trained on large amounts of labelled data, which are often not available for low-resource languages. Techniques such as self-supervised learning and transfer learning hold promise, but have not yet been effective in training accurate models. On the other hand, collecting labelled datasets on a diverse set of domains and speakers is very expensive. In this work, we demonstrate an inexpensive and effective alternative to these approaches by “mining” text and audio pairs for Indian languages from public sources, specifically from the public archives of All India Radio. As a key component, we adapt the Needleman-Wunsch algorithm to align sentences with corresponding audio segments given a long audio and a PDF of its transcript, while being robust to errors due to OCR, extraneous text, and non-transcribed speech. We thus create Shrutilipi, a dataset which contains over 6,400 hours of labelled audio across 12 Indian languages totalling to 4.95M sentences. On average, Shrutilipi results in a 2.3x increase over publicly available labelled data. We establish the quality of Shrutilipi with 21 human evaluators across the 12 languages. We also establish the diversity of Shrutilipi in terms of represented regions, speakers, and mentioned named entities. Significantly, we show that adding Shrutilipi to the training set of Wav2Vec models leads to an average decrease in WER of 5.8% for 7 languages on the IndicSUPERB benchmark. For Hindi, which has the most benchmarks (7), the average WER falls from 18.8 models: We show a 2.3 Wav2Vec). Finally, we demonstrate the diversity of Shrutilipi by showing that the model trained with it is more robust to noisy input.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/31/2022

Effectiveness of text to speech pseudo labels for forced alignment and cross lingual pretrained models for low resource speech recognition

In the recent years end to end (E2E) automatic speech recognition (ASR) ...
research
06/01/2022

Snow Mountain: Dataset of Audio Recordings of The Bible in Low Resource Languages

Automatic Speech Recognition (ASR) has increasing utility in the modern ...
research
11/23/2022

IMaSC – ICFOSS Malayalam Speech Corpus

Modern text-to-speech (TTS) systems use deep learning to synthesize spee...
research
11/25/2020

De-STT: De-entaglement of unwanted Nuisances and Biases in Speech to Text System using Adversarial Forgetting

Training a robust Speech to Text (STT) system requires tens of thousands...
research
11/06/2021

Towards Building ASR Systems for the Next Billion Users

Recent methods in speech and language technology pretrain very LARGE mod...
research
10/06/2021

Integrating Categorical Features in End-to-End ASR

All-neural, end-to-end ASR systems gained rapid interest from the speech...
research
01/01/2020

Low-Budget Unsupervised Label Query through Domain Alignment Enforcement

Deep learning revolution happened thanks to the availability of a massiv...

Please sign up or login with your details

Forgot password? Click here to reset