Improving short-video speech recognition using random utterance concatenation

10/28/2022
by   Haihua Xu, et al.
0

One of the limitations in end-to-end automatic speech recognition framework is its performance would be compromised if train-test utterance lengths are mismatched. In this paper, we propose a random utterance concatenation (RUC) method to alleviate train-test utterance length mismatch issue for short-video speech recognition task. Specifically, we are motivated by observations our human-transcribed training utterances tend to be much shorter for short-video spontaneous speech ( 3 seconds on average), while our test utterance generated from voice activity detection front-end is much longer ( 10 seconds on average). Such a mismatch can lead to sub-optimal performance. Experimentally, by using the proposed RUC method, the best word error rate reduction (WERR) can be achieved with around three fold training data size increase as well as two utterance concatenation for each. In practice, the proposed method consistently outperforms the strong baseline models, where 3.64 14 languages.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/08/2021

Input Length Matters: An Empirical Study Of RNN-T And MWER Training For Long-form Telephony Speech Recognition

End-to-end models have achieved state-of-the-art results on several auto...
research
03/20/2020

Detecting Mismatch between Text Script and Voice-over Using Utterance Verification Based on Phoneme Recognition Ranking

The purpose of this study is to detect the mismatch between text script ...
research
02/27/2021

Silent versus modal multi-speaker speech recognition from ultrasound and video

We investigate multi-speaker speech recognition from ultrasound images o...
research
12/11/2019

SpecAugment on Large Scale Datasets

Recently, SpecAugment, an augmentation scheme for automatic speech recog...
research
03/25/2021

Radically Old Way of Computing Spectra: Applications in End-to-End ASR

We propose a technique to compute spectrograms using Frequency Domain Li...
research
08/02/2021

User-Initiated Repetition-Based Recovery in Multi-Utterance Dialogue Systems

Recognition errors are common in human communication. Similar errors oft...
research
02/15/2020

Small energy masking for improved neural network training for end-to-end speech recognition

In this paper, we present a Small Energy Masking (SEM) algorithm, which ...

Please sign up or login with your details

Forgot password? Click here to reset