Conversation-oriented ASR with multi-look-ahead CBS architecture

11/02/2022
by   Huaibo Zhao, et al.
0

During conversations, humans are capable of inferring the intention of the speaker at any point of the speech to prepare the following action promptly. Such ability is also the key for conversational systems to achieve rhythmic and natural conversation. To perform this, the automatic speech recognition (ASR) used for transcribing the speech in real-time must achieve high accuracy without delay. In streaming ASR, high accuracy is assured by attending to look-ahead frames, which leads to delay increments. To tackle this trade-off issue, we propose a multiple latency streaming ASR to achieve high accuracy with zero look-ahead. The proposed system contains two encoders that operate in parallel, where a primary encoder generates accurate outputs utilizing look-ahead frames, and the auxiliary encoder recognizes the look-ahead portion of the primary encoder without look-ahead. The proposed system is constructed based on contextual block streaming (CBS) architecture, which leverages block processing and has a high affinity for the multiple latency architecture. Various methods are also studied for architecting the system, including shifting the network to perform as different encoders; as well as generating both encoders' outputs in one encoding pass.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/09/2023

Mask-CTC-based Encoder Pre-training for Streaming End-to-End Speech Recognition

Achieving high accuracy with low latency has always been a challenge in ...
research
10/25/2022

Streaming Parrotron for on-device speech-to-speech conversion

We present a fully on-device and streaming Speech-To-Speech (STS) conver...
research
10/20/2021

An Investigation of Enhancing CTC Model for Triggered Attention-based Streaming ASR

In the present paper, an attempt is made to combine Mask-CTC and the tri...
research
07/06/2022

Improving Streaming End-to-End ASR on Transformer-based Causal Models with Encoder States Revision Strategies

There is often a trade-off between performance and latency in streaming ...
research
03/29/2022

Streaming parallel transducer beam search with fast-slow cascaded encoders

Streaming ASR with strict latency constraints is required in many speech...
research
08/11/2023

Bilingual Streaming ASR with Grapheme units and Auxiliary Monolingual Loss

We introduce a bilingual solution to support English as secondary locale...
research
01/23/2023

Efficient Encoders for Streaming Sequence Tagging

A naive application of state-of-the-art bidirectional encoders for strea...

Please sign up or login with your details

Forgot password? Click here to reset