ZeroPrompt: Streaming Acoustic Encoders are Zero-Shot Masked LMs

05/18/2023
by   Xingchen Song, et al.
0

In this paper, we present ZeroPrompt (Figure 1-(a)) and the corresponding Prompt-and-Refine strategy (Figure 3), two simple but effective training-free methods to decrease the Token Display Time (TDT) of streaming ASR models without any accuracy loss. The core idea of ZeroPrompt is to append zeroed content to each chunk during inference, which acts like a prompt to encourage the model to predict future tokens even before they were spoken. We argue that streaming acoustic encoders naturally have the modeling ability of Masked Language Models and our experiments demonstrate that ZeroPrompt is engineering cheap and can be applied to streaming acoustic encoders on any dataset without any accuracy loss. Specifically, compared with our baseline models, we achieve 350 ∼ 700ms reduction on First Token Display Time (TDT-F) and 100 ∼ 400ms reduction on Last Token Display Time (TDT-L), with theoretically and experimentally equal WER on both Aishell-1 and Librispeech datasets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/01/2022

TrimTail: Low-Latency Streaming ASR with Simple but Effective Spectrogram-Level Length Penalty

In this paper, we present TrimTail, a simple but effective emission regu...
research
03/06/2023

FoundationTTS: Text-to-Speech for ASR Customization with Generative Language Model

Neural text-to-speech (TTS) generally consists of cascaded architecture ...
research
02/02/2022

Streaming Multi-Talker ASR with Token-Level Serialized Output Training

This paper proposes a token-level serialized output training (t-SOT), a ...
research
09/13/2023

Auto-Regressive Next-Token Predictors are Universal Learners

Large language models display remarkable capabilities in logical and mat...
research
01/23/2023

Efficient Encoders for Streaming Sequence Tagging

A naive application of state-of-the-art bidirectional encoders for strea...
research
07/01/2021

StableEmit: Selection Probability Discount for Reducing Emission Latency of Streaming Monotonic Attention ASR

While attention-based encoder-decoder (AED) models have been successfull...
research
03/31/2023

Lego-Features: Exporting modular encoder features for streaming and deliberation ASR

In end-to-end (E2E) speech recognition models, a representational tight-...

Please sign up or login with your details

Forgot password? Click here to reset