Unified End-to-End Speech Recognition and Endpointing for Fast and Efficient Speech Systems

11/01/2022
by   Shaan Bijwadia, et al.
0

Automatic speech recognition (ASR) systems typically rely on an external endpointer (EP) model to identify speech boundaries. In this work, we propose a method to jointly train the ASR and EP tasks in a single end-to-end (E2E) multitask model, improving EP quality by optionally leveraging information from the ASR audio encoder. We introduce a "switch" connection, which trains the EP to consume either the audio frames directly or low-level latent representations from the ASR model. This results in a single E2E model that can be used during inference to perform frame filtering at low cost, and also make high quality end-of-query (EOQ) predictions based on ongoing ASR computation. We present results on a voice search test set showing that, compared to separate single-task models, this approach reduces median endpoint latency by 120 ms (30.8 without regressing word error rate. For continuous recognition, WER improves by 10.6

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/21/2020

FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization

Streaming automatic speech recognition (ASR) aims to emit each hypothesi...
research
03/02/2021

Long-Running Speech Recognizer:An End-to-End Multi-Task Learning Framework for Online ASR and VAD

When we use End-to-end automatic speech recognition (E2E-ASR) system for...
research
04/22/2022

E2E Segmenter: Joint Segmenting and Decoding for Long-Form ASR

Improving the performance of end-to-end ASR models on long utterances ra...
research
07/22/2022

ASR Error Detection via Audio-Transcript entailment

Despite improved performances of the latest Automatic Speech Recognition...
research
07/03/2019

End-to-End Speech Recognition with High-Frame-Rate Features Extraction

State-of-the-art end-to-end automatic speech recognition (ASR) extracts ...
research
11/21/2022

SpeechNet: Weakly Supervised, End-to-End Speech Recognition at Industrial Scale

End-to-end automatic speech recognition systems represent the state of t...
research
12/21/2021

Voice Quality and Pitch Features in Transformer-Based Speech Recognition

Jitter and shimmer measurements have shown to be carriers of voice quali...

Please sign up or login with your details

Forgot password? Click here to reset