Input Length Matters: An Empirical Study Of RNN-T And MWER Training For Long-form Telephony Speech Recognition

10/08/2021
by   Zhiyun Lu, et al.
0

End-to-end models have achieved state-of-the-art results on several automatic speech recognition tasks. However, they perform poorly when evaluated on long-form data, e.g., minutes long conversational telephony audio. One reason the model fails on long-form speech is that it has only seen short utterances during training. This paper presents an empirical study on the effect of training utterance length on the word error rate (WER) for RNN-transducer (RNN-T) model. We compare two widely used training objectives, log loss (or RNN-T loss) and minimum word error rate (MWER) loss. We conduct experiments on telephony datasets in four languages. Our experiments show that for both losses, the WER on long-form speech reduces substantially as the training utterance length increases. The average relative WER gain is 15.7 and 8.8 lower WER than the log loss. Such difference between the two losses diminishes when the input length increases.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/28/2022

Improving short-video speech recognition using random utterance concatenation

One of the limitations in end-to-end automatic speech recognition framew...
research
03/18/2023

Powerful and Extensible WFST Framework for RNN-Transducer Losses

This paper presents a framework based on Weighted Finite-State Transduce...
research
10/24/2019

Recognizing long-form speech using streaming end-to-end models

All-neural end-to-end (E2E) automatic speech recognition (ASR) systems t...
research
06/10/2017

Articulation rate in Swedish child-directed speech increases as a function of the age of the child even when surprisal is controlled for

In earlier work, we have shown that articulation rate in Swedish child-d...
research
12/10/2021

Directed Speech Separation for Automatic Speech Recognition of Long Form Conversational Speech

Many of the recent advances in speech separation are primarily aimed at ...
research
07/26/2023

Say Goodbye to RNN-T Loss: A Novel CIF-based Transducer Architecture for Automatic Speech Recognition

RNN-T models are widely used in ASR, which rely on the RNN-T loss to ach...
research
03/01/2023

WhisperX: Time-Accurate Speech Transcription of Long-Form Audio

Large-scale, weakly-supervised speech recognition models, such as Whispe...

Please sign up or login with your details

Forgot password? Click here to reset