Recognizing long-form speech using streaming end-to-end models

10/24/2019
by   Arun Narayanan, et al.
0

All-neural end-to-end (E2E) automatic speech recognition (ASR) systems that use a single neural network to transduce audio to word sequences have been shown to achieve state-of-the-art results on several tasks. In this work, we examine the ability of E2E models to generalize to unseen domains, where we find that models trained on short utterances fail to generalize to long-form speech. We propose two complementary solutions to address this: training on diverse acoustic data, and LSTM state manipulation to simulate long-form audio when training using short utterances. On a synthesized long-form test set, adding data diversity improves word error rate (WER) by 90 simulating long-form training improves it by 67 combination doesn't improve over data diversity alone. On a real long-form call-center test set, adding data diversity improves WER by 40 Simulating long-form training on top of data diversity improves performance by an additional 27

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/07/2020

RNN-T Models Fail to Generalize to Out-of-Domain Audio: Causes and Solutions

In recent years, all-neural end-to-end approaches have obtained state-of...
research
10/08/2021

Input Length Matters: An Empirical Study Of RNN-T And MWER Training For Long-form Telephony Speech Recognition

End-to-end models have achieved state-of-the-art results on several auto...
research
09/18/2023

Investigating End-to-End ASR Architectures for Long Form Audio Transcription

This paper presents an overview and evaluation of some of the end-to-end...
research
06/02/2023

Improved Training for End-to-End Streaming Automatic Speech Recognition Model with Punctuation

Punctuated text prediction is crucial for automatic speech recognition a...
research
11/06/2019

A comparison of end-to-end models for long-form speech recognition

End-to-end automatic speech recognition (ASR) models, including both att...
research
04/22/2022

E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR

Improving the performance of end-to-end ASR models on long utterances ra...
research
06/01/2021

A Neural Acoustic Echo Canceller Optimized Using An Automatic Speech Recognizer And Large Scale Synthetic Data

We consider the problem of recognizing speech utterances spoken to a dev...

Please sign up or login with your details

Forgot password? Click here to reset