DeepAI AI Chat
Log In Sign Up

Dynamic Latency for CTC-Based Streaming Automatic Speech Recognition With Emformer

03/29/2022
by   Jingyu Sun, et al.
SenseTime Corporation
0

An inferior performance of the streaming automatic speech recognition models versus non-streaming model is frequently seen due to the absence of future context. In order to improve the performance of the streaming model and reduce the computational complexity, a frame-level model using efficient augment memory transformer block and dynamic latency training method is employed for streaming automatic speech recognition in this paper. The long-range history context is stored into the augment memory bank as a complement to the limited history context used in the encoder. Key and value are cached by a cache mechanism and reused for next chunk to reduce computation. Afterwards, a dynamic latency training method is proposed to obtain better performance and support low and high latency inference simultaneously. Our experiments are conducted on benchmark 960h LibriSpeech data set. With an average latency of 640ms, our model achieves a relative WER reduction of 6.0 3.0

READ FULL TEXT

page 1

page 2

page 3

page 4

10/21/2020

Emformer: Efficient Memory Transformer Based Acoustic Model For Low Latency Streaming Speech Recognition

This paper proposes an efficient memory transformer Emformer for low lat...
10/07/2021

Streaming Transformer Transducer Based Speech Recognition Using Non-Causal Convolution

This paper improves the streaming transformer transducer for speech reco...
11/03/2020

Dynamic latency speech recognition with asynchronous revision

In this work we propose an inference technique, asynchronous revision, t...
11/02/2022

Fast-U2++: Fast and Accurate End-to-End Speech Recognition in Joint CTC/Attention Frames

Recently, the unified streaming and non-streaming two-pass (U2/U2++) end...
03/11/2022

Transformer-based Streaming ASR with Cumulative Attention

In this paper, we propose an online attention mechanism, known as cumula...
02/23/2023

Evaluating Automatic Speech Recognition in an Incremental Setting

The increasing reliability of automatic speech recognition has prolifera...
11/04/2022

Minimum Latency Training of Sequence Transducers for Streaming End-to-End Speech Recognition

Sequence transducers, such as the RNN-T and the Conformer-T, are one of ...