E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR

04/22/2022
by   W. Ronny Huang, et al.
0

Improving the performance of end-to-end ASR models on long utterances ranging from minutes to hours in length is an ongoing challenge in speech recognition. A common solution is to segment the audio in advance using a separate voice activity detector (VAD) that decides segment boundary locations based purely on acoustic speech/non-speech information. VAD segmenters, however, may be sub-optimal for real-world speech where, e.g., a complete sentence that should be taken as a whole may contain hesitations in the middle ("set an alarm for... 5 o'clock"). We propose to replace the VAD with an end-to-end ASR model capable of predicting segment boundaries in a streaming fashion, allowing the segmentation decision to be conditioned not only on better acoustic features but also on semantic features from the decoded text with negligible extra computation. In experiments on real world long-form audio (YouTube) with lengths of up to 30 minutes, we demonstrate 8.5 median end-of-segment latency compared to the VAD segmenter baseline on a state-of-the-art Conformer RNN-T model.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/01/2022

Unified End-to-End Speech Recognition and Endpointing for Fast and Efficient Speech Systems

Automatic speech recognition (ASR) systems typically rely on an external...
research
11/06/2019

A comparison of end-to-end models for long-form speech recognition

End-to-end automatic speech recognition (ASR) models, including both att...
research
03/02/2021

Long-Running Speech Recognizer:An End-to-End Multi-Task Learning Framework for Online ASR and VAD

When we use End-to-end automatic speech recognition (E2E-ASR) system for...
research
09/18/2023

Investigating End-to-End ASR Architectures for Long Form Audio Transcription

This paper presents an overview and evaluation of some of the end-to-end...
research
10/24/2019

Recognizing long-form speech using streaming end-to-end models

All-neural end-to-end (E2E) automatic speech recognition (ASR) systems t...
research
04/16/2021

Segmenting Subtitles for Correcting ASR Segmentation Errors

Typical ASR systems segment the input audio into utterances using purely...
research
07/11/2023

Improving RNN-Transducers with Acoustic LookAhead

RNN-Transducers (RNN-Ts) have gained widespread acceptance as an end-to-...

Please sign up or login with your details

Forgot password? Click here to reset