Log In Sign Up

Flexi-Transducer: Optimizing Latency, Accuracy and Compute forMulti-Domain On-Device Scenarios

by   Jay Mahadeokar, et al.

Often, the storage and computational constraints of embeddeddevices demand that a single on-device ASR model serve multiple use-cases / domains. In this paper, we propose aFlexibleTransducer(FlexiT) for on-device automatic speech recognition to flexibly deal with multiple use-cases / domains with different accuracy and latency requirements. Specifically, using a single compact model, FlexiT provides a fast response for voice commands, and accurate transcription but with more latency for dictation. In order to achieve flexible and better accuracy and latency trade-offs, the following techniques are used. Firstly, we propose using domain-specific altering of segment size for Emformer encoder that enables FlexiT to achieve flexible de-coding. Secondly, we use Alignment Restricted RNNT loss to achieve flexible fine-grained control on token emission latency for different domains. Finally, we add a domain indicator vector as an additional input to the FlexiT model. Using the combination of techniques, we show that a single model can be used to improve WERs and real time factor for dictation scenarios while maintaining optimal latency for voice commands use-cases


page 1

page 2

page 3

page 4


A Language Agnostic Multilingual Streaming On-Device ASR System

On-device end-to-end (E2E) models have shown improvements over a convent...

Multi-mode Transformer Transducer with Stochastic Future Context

Automatic speech recognition (ASR) models make fewer errors when more su...

Dissecting User-Perceived Latency of On-Device E2E Speech Recognition

As speech-enabled devices such as smartphones and smart speakers become ...

Dynamic Encoder Transducer: A Flexible Solution For Trading Off Accuracy For Latency

We propose a dynamic encoder transducer (DET) for on-device speech recog...

Unified Modeling of Multi-Domain Multi-Device ASR Systems

Modern Automatic Speech Recognition (ASR) systems often use a portfolio ...

Personal VAD 2.0: Optimizing Personal Voice Activity Detection for On-Device Speech Recognition

Personalization of on-device speech recognition (ASR) has seen explosive...

Efficiently Scaling Transformer Inference

We study the problem of efficient generative inference for Transformer m...