A Density Ratio Approach to Language Model Fusion in End-To-End Automatic Speech Recognition

02/26/2020
by   Erik McDermott, et al.
0

This article describes a density ratio approach to integrating external Language Models (LMs) into end-to-end models for Automatic Speech Recognition (ASR). Applied to a Recurrent Neural Network Transducer (RNN-T) ASR model trained on a given domain, a matched in-domain RNN-LM, and a target domain RNN-LM, the proposed method uses Bayes' Rule to define RNN-T posteriors for the target domain, in a manner directly analogous to the classic hybrid model for ASR based on Deep Neural Networks (DNNs) or LSTMs in the Hidden Markov Model (HMM) framework (Bourlard Morgan, 1994). The proposed approach is evaluated in cross-domain and limited-data scenarios, for which a significant amount of target domain text data is used for LM training, but only limited (or no) audio, transcript training data pairs are used to train the RNN-T. Specifically, an RNN-T model trained on paired audio transcript data from YouTube is evaluated for its ability to generalize to Voice Search data. The Density Ratio method was found to consistently outperform the dominant approach to LM and end-to-end ASR integration, Shallow Fusion.

READ FULL TEXT

page 4

page 5

research
11/03/2020

Internal Language Model Estimation for Domain-Adaptive End-to-End Speech Recognition

The external language models (LM) integration remains a challenging task...
research
02/26/2022

Integrating Text Inputs For Training and Adapting RNN Transducer ASR Models

Compared to hybrid automatic speech recognition (ASR) systems that use a...
research
03/31/2022

An Empirical Study of Language Model Integration for Transducer based Speech Recognition

Utilizing text-only data with an external language model (LM) in end-to-...
research
01/10/2022

A Likelihood Ratio based Domain Adaptation Method for E2E Models

End-to-end (E2E) automatic speech recognition models like Recurrent Neur...
research
05/07/2020

RNN-T Models Fail to Generalize to Out-of-Domain Audio: Causes and Solutions

In recent years, all-neural end-to-end approaches have obtained state-of...
research
02/16/2023

Adaptable End-to-End ASR Models using Replaceable Internal LMs and Residual Softmax

End-to-end (E2E) automatic speech recognition (ASR) implicitly learns th...
research
10/26/2020

Improved Neural Language Model Fusion for Streaming Recurrent Neural Network Transducer

Recurrent Neural Network Transducer (RNN-T), like most end-to-end speech...

Please sign up or login with your details

Forgot password? Click here to reset