On Sampling-Based Training Criteria for Neural Language Modeling

04/21/2021
by   Yingbo Gao, et al.
0

As the vocabulary size of modern word-based language models becomes ever larger, many sampling-based training criteria are proposed and investigated. The essence of these sampling methods is that the softmax-related traversal over the entire vocabulary can be simplified, giving speedups compared to the baseline. A problem we notice about the current landscape of such sampling methods is the lack of a systematic comparison and some myths about preferring one over another. In this work, we consider Monte Carlo sampling, importance sampling, a novel method we call compensated partial summation, and noise contrastive estimation. Linking back to the three traditional criteria, namely mean squared error, binary cross-entropy, and cross-entropy, we derive the theoretical solutions to the training problems. Contrary to some common belief, we show that all these sampling methods can perform equally well, as long as we correct for the intended class posterior probabilities. Experimental results in language modeling and automatic speech recognition on Switchboard and LibriSpeech support our claim, with all sampling-based methods showing similar perplexities and word error rates while giving the expected speedups.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/11/2021

Self-Normalized Importance Sampling for Neural Language Modeling

To mitigate the problem of having to traverse over the full vocabulary i...
research
05/20/2020

Investigation of Large-Margin Softmax in Neural Language Modeling

To encourage intra-class compactness and inter-class separability among ...
research
09/05/2016

PMI Matrix Approximations with Applications to Neural Language Modeling

The negative sampling (NEG) objective function, used in word2vec, is a s...
research
11/21/2015

BlackOut: Speeding up Recurrent Neural Network Language Models With Very Large Vocabularies

We propose BlackOut, an approximation algorithm to efficiently train mas...
research
06/08/2017

Optimizing expected word error rate via sampling for speech recognition

State-level minimum Bayes risk (sMBR) training has become the de facto s...
research
10/30/2020

Phoneme Based Neural Transducer for Large Vocabulary Speech Recognition

To join the advantages of classical and end-to-end approaches for speech...
research
02/10/2022

Learning Branch Probabilities in Compiler from Datacenter Workloads

Estimating the probability with which a conditional branch instruction i...

Please sign up or login with your details

Forgot password? Click here to reset