USTED: Improving ASR with a Unified Speech and Text Encoder-Decoder

02/12/2022
by   Bolaji Yusuf, et al.
0

Improving end-to-end speech recognition by incorporating external text data has been a longstanding research topic. There has been a recent focus on training E2E ASR models that get the performance benefits of external text data without incurring the extra cost of evaluating an external language model at inference time. In this work, we propose training ASR model jointly with a set of text-to-text auxiliary tasks with which it shares a decoder and parts of the encoder. When we jointly train ASR and masked language model with the 960-hour Librispeech and Opensubtitles data respectively, we observe WER reductions of 16 without any extra cost at inference time, and reductions of 6 to a stronger MUTE-L baseline which trains the decoder with the same text data as our model. We achieve further improvements when we train masked language model on Librispeech data or when we use machine translation as the auxiliary task, without significantly sacrificing performance on the task itself.

READ FULL TEXT
research
11/25/2019

Independent language modeling architecture for end-to-end ASR

The attention-based end-to-end (E2E) automatic speech recognition (ASR) ...
research
09/16/2023

Decoder-only Architecture for Speech Recognition with CTC Prompts and Text Data Augmentation

Collecting audio-text pairs is expensive; however, it is much easier to ...
research
03/20/2023

On-the-fly Text Retrieval for End-to-End ASR Adaptation

End-to-end speech recognition models are improved by incorporating exter...
research
05/02/2022

Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages

We introduce Wav2Seq, the first self-supervised approach to pre-train bo...
research
10/14/2022

Learning to Jointly Transcribe and Subtitle for End-to-End Spontaneous Speech Recognition

TV subtitles are a rich source of transcriptions of many types of speech...
research
05/19/2020

Improving Proper Noun Recognition in End-to-End ASR By Customization of the MWER Loss Criterion

Proper nouns present a challenge for end-to-end (E2E) automatic speech r...
research
09/19/2023

Harnessing the Zero-Shot Power of Instruction-Tuned Large Language Model in End-to-End Speech Recognition

We present a novel integration of an instruction-tuned large language mo...

Please sign up or login with your details

Forgot password? Click here to reset