Dissecting User-Perceived Latency of On-Device E2E Speech Recognition

04/06/2021
by   Yuan Shangguan, et al.
0

As speech-enabled devices such as smartphones and smart speakers become increasingly ubiquitous, there is growing interest in building automatic speech recognition (ASR) systems that can run directly on-device; end-to-end (E2E) speech recognition models such as recurrent neural network transducers and their variants have recently emerged as prime candidates for this task. Apart from being accurate and compact, such systems need to decode speech with low user-perceived latency (UPL), producing words as soon as they are spoken. This work examines the impact of various techniques – model architectures, training criteria, decoding hyperparameters, and endpointer parameters – on UPL. Our analyses suggest that measures of model size (parameters, input chunk sizes), or measures of computation (e.g., FLOPS, RTF) that reflect the model's ability to process input frames are not always strongly correlated with observed UPL. Thus, conventional algorithmic latency measurements might be inadequate in accurately capturing latency observed when models are deployed on embedded devices. Instead, we find that factors affecting token emission latency, and endpointing behavior significantly impact on UPL. We achieve the best trade-off between latency and word error rate when performing ASR jointly with endpointing, and using the recently proposed alignment regularization.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/02/2023

Streaming Speech-to-Confusion Network Speech Recognition

In interactive automatic speech recognition (ASR) systems, low-latency r...
research
11/15/2018

Streaming End-to-end Speech Recognition For Mobile Devices

End-to-end (E2E) models, which directly predict output character sequenc...
research
06/02/2020

Analyzing the Quality and Stability of a Streaming End-to-End On-Device Speech Recognizer

The demand for fast and accurate incremental speech recognition increase...
research
04/06/2021

Flexi-Transducer: Optimizing Latency, Accuracy and Compute forMulti-Domain On-Device Scenarios

Often, the storage and computational constraints of embeddeddevices dema...
research
03/22/2020

Low Latency ASR for Simultaneous Speech Translation

User studies have shown that reducing the latency of our simultaneous le...
research
01/27/2020

Scaling Up Online Speech Recognition Using ConvNets

We design an online end-to-end speech recognition system based on Time-D...
research
05/12/2023

Accelerator-Aware Training for Transducer-Based Speech Recognition

Machine learning model weights and activations are represented in full-p...

Please sign up or login with your details

Forgot password? Click here to reset