Improving Transformer-based Conversational ASR by Inter-Sentential Attention Mechanism

07/02/2022
by   Kun Wei, et al.
0

Transformer-based models have demonstrated their effectiveness in automatic speech recognition (ASR) tasks and even shown superior performance over the conventional hybrid framework. The main idea of Transformers is to capture the long-range global context within an utterance by self-attention layers. However, for scenarios like conversational speech, such utterance-level modeling will neglect contextual dependencies that span across utterances. In this paper, we propose to explicitly model the inter-sentential information in a Transformer based end-to-end architecture for conversational speech recognition. Specifically, for the encoder network, we capture the contexts of previous speech and incorporate such historic information into current input by a context-aware residual attention mechanism. For the decoder, the prediction of current utterance is also conditioned on the historic linguistic information through a conditional decoder framework. We show the effectiveness of our proposed method on several open-source dialogue corpora and the proposed method consistently improved the performance from the utterance-level Transformer-based ASR models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/16/2021

Hierarchical Transformer-based Large-Context End-to-end ASR with Large-Context Knowledge Distillation

We present a novel large-context end-to-end automatic speech recognition...
research
04/19/2021

Advanced Long-context End-to-end Speech Recognition Using Context-expanded Transformers

This paper addresses end-to-end automatic speech recognition (ASR) for l...
research
06/23/2023

Towards Effective and Compact Contextual Representation for Conformer Transducer Speech Recognition Systems

Current ASR systems are mainly trained and evaluated at the utterance le...
research
11/05/2021

Conversational speech recognition leveraging effective fusion methods for cross-utterance language modeling

Conversational speech normally is embodied with loose syntactic structur...
research
10/23/2020

GraphSpeech: Syntax-Aware Graph Attention Network For Neural Speech Synthesis

Attention-based end-to-end text-to-speech synthesis (TTS) is superior to...
research
04/21/2021

Label-Synchronous Speech-to-Text Alignment for ASR Using Forward and Backward Transformers

This paper proposes a novel label-synchronous speech-to-text alignment t...
research
04/10/2021

Boundary and Context Aware Training for CIF-based Non-Autoregressive End-to-end ASR

Continuous integrate-and-fire (CIF) based models, which use a soft and m...

Please sign up or login with your details

Forgot password? Click here to reset