Dynamic latency speech recognition with asynchronous revision

11/03/2020
by   Mingkun Huang, et al.
0

In this work we propose an inference technique, asynchronous revision, to unify streaming and non-streaming speech recognition models. Specifically, we achieve dynamic latency with only one model by using arbitrary right context during inference. The model is composed of a stack of convolutional layers for audio encoding. In inference stage, the history states of encoder and decoder can be asynchronously revised to trade off between the latency and the accuracy of the model. To alleviate training and inference mismatch, we propose a training technique, segment cropping, which randomly splits input utterances into several segments with forward connections. This allows us to have dynamic latency speech recognition results with large improvements in accuracy. Experiments show that our dynamic latency model with asynchronous revision gives 8%-14% relative improvements over the streaming models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/07/2020

Transformer Transducer: One Model Unifying Streaming and Non-streaming Speech Recognition

In this paper we present a Transformer-Transducer model architecture and...
research
03/29/2022

Dynamic Latency for CTC-Based Streaming Automatic Speech Recognition With Emformer

An inferior performance of the streaming automatic speech recognition mo...
research
03/14/2023

Adapting Offline Speech Translation Models for Streaming with Future-Aware Distillation and Inference

A popular approach to streaming speech translation is to employ a single...
research
04/05/2021

Dynamic Encoder Transducer: A Flexible Solution For Trading Off Accuracy For Latency

We propose a dynamic encoder transducer (DET) for on-device speech recog...
research
11/16/2022

Streaming Joint Speech Recognition and Disfluency Detection

Disfluency detection has mainly been solved in a pipeline approach, as p...
research
05/26/2022

Global Normalization for Streaming Speech Recognition in a Modular Framework

We introduce the Globally Normalized Autoregressive Transducer (GNAT) fo...
research
10/07/2021

Streaming Transformer Transducer Based Speech Recognition Using Non-Causal Convolution

This paper improves the streaming transformer transducer for speech reco...

Please sign up or login with your details

Forgot password? Click here to reset