Contextual-Utterance Training for Automatic Speech Recognition

10/27/2022
by   Alejandro Gomez-Alanis, et al.
0

Recent studies of streaming automatic speech recognition (ASR) recurrent neural network transducer (RNN-T)-based systems have fed the encoder with past contextual information in order to improve its word error rate (WER) performance. In this paper, we first propose a contextual-utterance training technique which makes use of the previous and future contextual utterances in order to do an implicit adaptation to the speaker, topic and acoustic environment. Also, we propose a dual-mode contextual-utterance training technique for streaming automatic speech recognition (ASR) systems. This proposed approach allows to make a better use of the available acoustic context in streaming models by distilling "in-place" the knowledge of a teacher, which is able to see both past and future contextual utterances, to the student which can only see the current and past contextual utterances. The experimental results show that a conformer-transducer system trained with the proposed techniques outperforms the same system trained with the classical RNN-T loss. Specifically, the proposed technique is able to reduce both the WER and the average last token emission latency by more than 6 respectively.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/02/2021

Attention-based Contextual Language Model Adaptation for Speech Recognition

Language modeling (LM) for automatic speech recognition (ASR) does not u...
research
06/11/2021

Improving RNN-T ASR Performance with Date-Time and Location Awareness

In this paper, we explore the benefits of incorporating context into a R...
research
04/19/2021

Advanced Long-context End-to-end Speech Recognition Using Context-expanded Transformers

This paper addresses end-to-end automatic speech recognition (ASR) for l...
research
08/19/2020

Cross-Utterance Language Models with Acoustic Error Sampling

The effective exploitation of richer contextual information in language ...
research
05/14/2021

Listen with Intent: Improving Speech Recognition with Audio-to-Intent Front-End

Comprehending the overall intent of an utterance helps a listener recogn...
research
04/25/2021

Bridging the gap between streaming and non-streaming ASR systems bydistilling ensembles of CTC and RNN-T models

Streaming end-to-end automatic speech recognition (ASR) systems are wide...
research
11/20/2020

Improving RNN-T ASR Accuracy Using Untranscribed Context Audio

We present a new training scheme for streaming automatic speech recognit...

Please sign up or login with your details

Forgot password? Click here to reset