DeepAI AI Chat
Log In Sign Up

TrimTail: Low-Latency Streaming ASR with Simple but Effective Spectrogram-Level Length Penalty

by   Xingchen Song, et al.
Horizon Robotics

In this paper, we present TrimTail, a simple but effective emission regularization method to improve the latency of streaming ASR models. The core idea of TrimTail is to apply length penalty (i.e., by trimming trailing frames, see Fig. 1-(b)) directly on the spectrogram of input utterances, which does not require any alignment. We demonstrate that TrimTail is computationally cheap and can be applied online and optimized with any training loss or any model architecture on any dataset without any extra effort by applying it on various end-to-end streaming ASR networks either trained with CTC loss [1] or Transducer loss [2]. We achieve 100 ∼ 200ms latency reduction with equal or even better accuracy on both Aishell-1 and Librispeech. Moreover, by using TrimTail, we can achieve a 400ms algorithmic improvement of User Sensitive Delay (USD) with an accuracy loss of less than 0.2.


FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization

Streaming automatic speech recognition (ASR) aims to emit each hypothesi...

An Investigation of Enhancing CTC Model for Triggered Attention-based Streaming ASR

In the present paper, an attempt is made to combine Mask-CTC and the tri...

Reducing Streaming ASR Model Delay with Self Alignment

Reducing prediction delay for streaming end-to-end ASR models with minim...

Detecting Rumours with Latency Guarantees using Massive Streaming Data

Today's social networks continuously generate massive streams of data, w...

CUSIDE: Chunking, Simulating Future Context and Decoding for Streaming ASR

History and future contextual information are known to be important for ...

Bridging the gap between streaming and non-streaming ASR systems bydistilling ensembles of CTC and RNN-T models

Streaming end-to-end automatic speech recognition (ASR) systems are wide...

Conversation-oriented ASR with multi-look-ahead CBS architecture

During conversations, humans are capable of inferring the intention of t...