Sentence-Select: Large-Scale Language Model Data Selection for Rare-Word Speech Recognition

03/09/2022
by   W. Ronny Huang, et al.
5

Language model fusion helps smart assistants recognize words which are rare in acoustic data but abundant in text-only corpora (typed search logs). However, such corpora have properties that hinder downstream performance, including being (1) too large, (2) beset with domain-mismatched content, and (3) heavy-headed rather than heavy-tailed (excessively many duplicate search queries such as "weather"). We show that three simple strategies for selecting language modeling data can dramatically improve rare-word recognition without harming overall performance. First, to address the heavy-headedness, we downsample the data according to a soft log function, which tunably reduces high frequency (head) sentences. Second, to encourage rare-word exposure, we explicitly filter for words rare in the acoustic data. Finally, we tackle domain-mismatch via perplexity-based contrastive selection, filtering for examples matched to the target domain. We down-select a large corpus of web search queries by a factor of 53x and achieve better LM perplexities than without down-selection. When shallow-fused with a state-of-the-art, production speech engine, our LM achieves WER reductions of up to 24 rare-word sentences (without changing overall WER) compared to a baseline LM trained on the raw corpus. These gains are further validated through favorable side-by-side evaluations on live voice search traffic.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/13/2021

Dict-BERT: Enhancing Language Model Pre-training with Dictionary

Pre-trained language models (PLMs) aim to learn universal language repre...
research
11/30/2020

Improving accuracy of rare words for RNN-Transducer through unigram shallow fusion

End-to-end automatic speech recognition (ASR) systems, such as recurrent...
research
11/23/2020

Multi-task Language Modeling for Improving Speech Recognition of Rare Words

End-to-end automatic speech recognition (ASR) systems are increasingly p...
research
04/20/2022

Detecting Unintended Memorization in Language-Model-Fused ASR

End-to-end (E2E) models are often being accompanied by language models (...
research
02/16/2023

JEIT: Joint End-to-End Model and Internal Language Model Training for Speech Recognition

We propose JEIT, a joint end-to-end (E2E) model and internal language mo...
research
07/20/2021

Seed Words Based Data Selection for Language Model Adaptation

We address the problem of language model customization in applications w...
research
04/15/2022

Improving Rare Word Recognition with LM-aware MWER Training

Language models (LMs) significantly improve the recognition accuracy of ...

Please sign up or login with your details

Forgot password? Click here to reset