Minimum Word Error Rate Training with Language Model Fusion for End-to-End Speech Recognition

06/04/2021
by   Zhong Meng, et al.
0

Integrating external language models (LMs) into end-to-end (E2E) models remains a challenging task for domain-adaptive speech recognition. Recently, internal language model estimation (ILME)-based LM fusion has shown significant word error rate (WER) reduction from Shallow Fusion by subtracting a weighted internal LM score from an interpolation of E2E model and external LM scores during beam search. However, on different test sets, the optimal LM interpolation weights vary over a wide range and have to be tuned extensively on well-matched validation sets. In this work, we perform LM fusion in the minimum WER (MWER) training of an E2E model to obviate the need for LM weights tuning during inference. Besides MWER training with Shallow Fusion (MWER-SF), we propose a novel MWER training with ILME (MWER-ILME) where the ILME-based fusion is conducted to generate N-best hypotheses and their posteriors. Additional gradient is induced when internal LM is engaged in MWER-ILME loss computation. During inference, LM weights pre-determined in MWER training enable robust LM integrations on test sets from different domains. Experimented with 30K-hour trained transformer transducers, MWER-ILME achieves on average 8.8 respectively, on 6 different test sets

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/03/2020

Internal Language Model Estimation for Domain-Adaptive End-to-End Speech Recognition

The external language models (LM) integration remains a challenging task...
research
04/15/2022

Improving Rare Word Recognition with LM-aware MWER Training

Language models (LMs) significantly improve the recognition accuracy of ...
research
01/28/2022

Neural-FST Class Language Model for End-to-End Speech Recognition

We propose Neural-FST Class Language Model (NFCLM) for end-to-end speech...
research
10/23/2020

On Minimum Word Error Rate Training of the Hybrid Autoregressive Transducer

Hybrid Autoregressive Transducer (HAT) is a recently proposed end-to-end...
research
10/06/2021

Internal Language Model Adaptation with Text-Only Data for End-to-End Speech Recognition

Text-only adaptation of an end-to-end (E2E) model remains a challenging ...
research
02/17/2023

Massively Multilingual Shallow Fusion with Large Language Models

While large language models (LLM) have made impressive progress in natur...
research
11/02/2022

Internal Language Model Estimation based Adaptive Language Model Fusion for Domain Adaptation

ASR model deployment environment is ever-changing, and the incoming spee...

Please sign up or login with your details

Forgot password? Click here to reset