LongFNT: Long-form Speech Recognition with Factorized Neural Transducer

11/17/2022
by   Xun Gong, et al.
0

Traditional automatic speech recognition (ASR) systems usually focus on individual utterances, without considering long-form speech with useful historical information, which is more practical in real scenarios. Simply attending longer transcription history for a vanilla neural transducer model shows no much gain in our preliminary experiments, since the prediction network is not a pure language model. This motivates us to leverage the factorized neural transducer structure, containing a real language model, the vocabulary predictor. We propose the LongFNT-Text architecture, which fuses the sentence-level long-form features directly with the output of the vocabulary predictor and then embeds token-level long-form features inside the vocabulary predictor, with a pre-trained contextual encoder RoBERTa to further boost the performance. Moreover, we propose the LongFNT architecture by extending the long-form speech to the original speech input and achieve the best performance. The effectiveness of our LongFNT approach is validated on LibriSpeech and GigaSpeech corpora with 19 respectively.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/27/2021

Factorized Neural Transducer for Efficient Language Model Adaptation

In recent years, end-to-end (E2E) based automatic speech recognition (AS...
research
12/05/2022

Fast and accurate factorized neural transducer for text adaption of end-to-end speech recognition models

Neural transducer is now the most popular end-to-end model for speech re...
research
05/22/2017

Use of Knowledge Graph in Rescoring the N-Best List in Automatic Speech Recognition

With the evolution of neural network based methods, automatic speech rec...
research
10/23/2020

Enriching Under-Represented Named-Entities To Improve Speech Recognition Performance

Automatic speech recognition (ASR) for under-represented named-entity (U...
research
01/15/2023

Rationalizing Predictions by Adversarial Information Calibration

Explaining the predictions of AI models is paramount in safety-critical ...
research
05/04/2020

Fast and Robust Unsupervised Contextual Biasing for Speech Recognition

Automatic speech recognition (ASR) system is becoming a ubiquitous techn...

Please sign up or login with your details

Forgot password? Click here to reset