Multi-mode Transformer Transducer with Stochastic Future Context

06/17/2021
by   Kwangyoun Kim, et al.
0

Automatic speech recognition (ASR) models make fewer errors when more surrounding speech information is presented as context. Unfortunately, acquiring a larger future context leads to higher latency. There exists an inevitable trade-off between speed and accuracy. Naively, to fit different latency requirements, people have to store multiple models and pick the best one under the constraints. Instead, a more desirable approach is to have a single model that can dynamically adjust its latency based on different constraints, which we refer to as Multi-mode ASR. A Multi-mode ASR model can fulfill various latency requirements during inference – when a larger latency becomes acceptable, the model can process longer future context to achieve higher accuracy and when a latency budget is not flexible, the model can be less dependent on future context but still achieve reliable accuracy. In pursuit of Multi-mode ASR, we propose Stochastic Future Context, a simple training procedure that samples one streaming configuration in each iteration. Through extensive experiments on AISHELL-1 and LibriSpeech datasets, we show that a Multi-mode ASR model rivals, if not surpasses, a set of competitive streaming baselines trained with different latency budgets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/12/2020

Universal ASR: Unify and Improve Streaming ASR with Full-context Modeling

Streaming automatic speech recognition (ASR) aims to emit each hypothesi...
research
09/09/2023

Mask-CTC-based Encoder Pre-training for Streaming End-to-End Speech Recognition

Achieving high accuracy with low latency has always been a challenge in ...
research
03/31/2022

CUSIDE: Chunking, Simulating Future Context and Decoding for Streaming ASR

History and future contextual information are known to be important for ...
research
04/06/2021

Flexi-Transducer: Optimizing Latency, Accuracy and Compute forMulti-Domain On-Device Scenarios

Often, the storage and computational constraints of embeddeddevices dema...
research
01/08/2022

Defining maximum acceptable latency of AI-enhanced CAI tools

Recent years have seen an increasing number of studies around the design...
research
03/14/2023

Adapting Offline Speech Translation Models for Streaming with Future-Aware Distillation and Inference

A popular approach to streaming speech translation is to employ a single...
research
12/10/2021

Building a great multi-lingual teacher with sparsely-gated mixture of experts for speech recognition

The sparsely-gated Mixture of Experts (MoE) can magnify a network capaci...

Please sign up or login with your details

Forgot password? Click here to reset