Advanced Long-context End-to-end Speech Recognition Using Context-expanded Transformers

04/19/2021
by   Takaaki Hori, et al.
0

This paper addresses end-to-end automatic speech recognition (ASR) for long audio recordings such as lecture and conversational speeches. Most end-to-end ASR models are designed to recognize independent utterances, but contextual information (e.g., speaker or topic) over multiple utterances is known to be useful for ASR. In our prior work, we proposed a context-expanded Transformer that accepts multiple consecutive utterances at the same time and predicts an output sequence for the last utterance, achieving 5-15 reduction from utterance-based baselines in lecture and conversational ASR benchmarks. Although the results have shown remarkable performance gain, there is still potential to further improve the model architecture and the decoding process. In this paper, we extend our prior work by (1) introducing the Conformer architecture to further improve the accuracy, (2) accelerating the decoding process with a novel activation recycling technique, and (3) enabling streaming decoding with triggered attention. We demonstrate that the extended Transformer provides state-of-the-art end-to-end ASR performance, obtaining a 17.3 rates for the Switchboard-300 Eval2000 CallHome/Switchboard test sets. The new decoding method reduces decoding time by more than 50 streaming ASR with limited accuracy degradation.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/02/2022

Improving Transformer-based Conversational ASR by Inter-Sentential Attention Mechanism

Transformer-based models have demonstrated their effectiveness in automa...
research
10/27/2022

Contextual-Utterance Training for Automatic Speech Recognition

Recent studies of streaming automatic speech recognition (ASR) recurrent...
research
07/05/2022

Compute Cost Amortized Transformer for Streaming ASR

We present a streaming, Transformer-based end-to-end automatic speech re...
research
06/29/2023

Leveraging Cross-Utterance Context For ASR Decoding

While external language models (LMs) are often incorporated into the dec...
research
11/05/2021

Conversational speech recognition leveraging effective fusion methods for cross-utterance language modeling

Conversational speech normally is embodied with loose syntactic structur...
research
07/17/2020

Towards an Automated SOAP Note: Classifying Utterances from Medical Conversations

Summaries generated from medical conversations can improve recall and un...
research
01/14/2022

A Study of Transducer based End-to-End ASR with ESPnet: Architecture, Auxiliary Loss and Decoding Strategies

In this study, we present recent developments of models trained with the...

Please sign up or login with your details

Forgot password? Click here to reset