Leveraging supplementary text data to kick-start automatic speech recognition system development with limited transcriptions

02/09/2023
by   Nay San, et al.
1

Recent research using pre-trained transformer models suggests that just 10 minutes of transcribed speech may be enough to fine-tune such a model for automatic speech recognition (ASR) – at least if we can also leverage vast amounts of text data (803 million tokens). But is that much text data necessary? We study the use of different amounts of text data, both for creating a lexicon that constrains ASR decoding to possible words (e.g. *dogz vs. dogs), and for training larger language models that bias the system toward probable word sequences (e.g. too dogs vs. two dogs). We perform experiments using 10 minutes of transcribed speech from English (for replicating prior work) and two additional pairs of languages differing in the availability of supplemental text data: Gronings and Frisian ( 7.5M token corpora available), and Besemah and Nasal (only small lexica available). For all languages, we found that using only a lexicon did not appreciably improve ASR performance. For Gronings and Frisian, we found that lexica and language models derived from 'novel-length' 80k token subcorpora reduced the word error rate (WER) to 39 average. Our findings suggest that where a text corpus in the upper tens of thousands of tokens or more is available, fine-tuning a transformer model with just tens of minutes of transcribed speech holds some promise towards obtaining human-correctable transcriptions near the 30

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/30/2022

Improving Speech Recognition for Indic Languages using Language Model

We study the effect of applying a language model (LM) on the output of A...
research
08/20/2023

Indonesian Automatic Speech Recognition with XLSR-53

This study focuses on the development of Indonesian Automatic Speech Rec...
research
06/01/2017

Machine Assisted Analysis of Vowel Length Contrasts in Wolof

Growing digital archives and improving algorithms for automatic analysis...
research
10/07/2019

A Case Study on Combining ASR and Visual Features for Generating Instructional Video Captions

Instructional videos get high-traffic on video sharing platforms, and pr...
research
09/01/2023

Contextual Biasing of Named-Entities with Large Language Models

This paper studies contextual biasing with Large Language Models (LLMs),...
research
03/29/2022

Locality Matters: A Locality-Biased Linear Attention for Automatic Speech Recognition

Conformer has shown a great success in automatic speech recognition (ASR...
research
10/13/2021

Efficient domain adaptation of language models in ASR systems using Prompt-tuning

Automatic Speech Recognition (ASR) systems have found their use in numer...

Please sign up or login with your details

Forgot password? Click here to reset