JEIT: Joint End-to-End Model and Internal Language Model Training for Speech Recognition

02/16/2023
by   Zhong Meng, et al.
0

We propose JEIT, a joint end-to-end (E2E) model and internal language model (ILM) training method to inject large-scale unpaired text into ILM during E2E training which improves rare-word speech recognition. With JEIT, the E2E model computes an E2E loss on audio-transcript pairs while its ILM estimates a cross-entropy loss on unpaired text. The E2E model is trained to minimize a weighted sum of E2E and ILM losses. During JEIT, ILM absorbs knowledge from unpaired text while the E2E training serves as regularization. Unlike ILM adaptation methods, JEIT does not require a separate adaptation step and avoids the need for Kullback-Leibler divergence regularization of ILM. We also show that modular hybrid autoregressive transducer (MHAT) performs better than HAT in the JEIT framework, and is much more robust than HAT during ILM adaptation. To push the limit of unpaired text injection, we further propose a combined JEIT and JOIST training (CJJT) that benefits from modality matching, encoder text injection and ILM training. Both JEIT and CJJT can foster a more effective LM fusion. With 100B unpaired sentences, JEIT/CJJT improves rare-word recognition accuracy by up to 16.4

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/06/2021

Internal Language Model Adaptation with Text-Only Data for End-to-End Speech Recognition

Text-only adaptation of an end-to-end (E2E) model remains a challenging ...
research
10/03/2016

Nonsymbolic Text Representation

We introduce the first generic text representation model that is complet...
research
10/31/2022

Modular Hybrid Autoregressive Transducer

Text-only adaptation of a transducer model remains challenging for end-t...
research
06/15/2022

Residual Language Model for End-to-end Speech Recognition

End-to-end automatic speech recognition suffers from adaptation to unkno...
research
10/05/2021

Fast Contextual Adaptation with Neural Associative Memory for On-Device Personalized Speech Recognition

Fast contextual adaptation has shown to be effective in improving Automa...
research
10/13/2022

JOIST: A Joint Speech and Text Streaming Model For ASR

We present JOIST, an algorithm to train a streaming, cascaded, encoder e...
research
03/09/2022

Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech Recognition

Language model fusion helps smart assistants recognize words which are r...

Please sign up or login with your details

Forgot password? Click here to reset