BlackOut: Speeding up Recurrent Neural Network Language Models With Very Large Vocabularies

11/21/2015
by   Shihao Ji, et al.
0

We propose BlackOut, an approximation algorithm to efficiently train massive recurrent neural network language models (RNNLMs) with million word vocabularies. BlackOut is motivated by using a discriminative loss, and we describe a new sampling strategy which significantly reduces computation while improving stability, sample efficiency, and rate of convergence. One way to understand BlackOut is to view it as an extension of the DropOut strategy to the output layer, wherein we use a discriminative training loss and a weighted sampling scheme. We also establish close connections between BlackOut, importance sampling, and noise contrastive estimation (NCE). Our experiments, on the recently released one billion word language modeling benchmark, demonstrate scalability and accuracy of BlackOut; we outperform the state-of-the art, and achieve the lowest perplexity scores on this dataset. Moreover, unlike other established methods which typically require GPUs or CPU clusters, we show that a carefully implemented version of BlackOut requires only 1-10 days on a single machine to train a RNNLM with a million word vocabulary and billions of parameters on one billion words. Although we describe BlackOut in the context of RNNLM training, it can be used to any networks with large softmax output layers.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/13/2016

Compressing Neural Language Models by Sparse Word Representations

Neural networks are among the state-of-the-art techniques for language m...
research
02/02/2015

Scaling Recurrent Neural Network Language Models

This paper investigates the scaling properties of Recurrent Neural Netwo...
research
08/20/2017

A Batch Noise Contrastive Estimation Approach for Training Large Vocabulary Language Models

Training large vocabulary Neural Network Language Models (NNLMs) is a di...
research
09/14/2016

Efficient softmax approximation for GPUs

We propose an approximate strategy to efficiently train neural network b...
research
02/04/2016

A Factorized Recurrent Neural Network based architecture for medium to large vocabulary Language Modelling

Statistical language models are central to many applications that use se...
research
04/21/2021

On Sampling-Based Training Criteria for Neural Language Modeling

As the vocabulary size of modern word-based language models becomes ever...
research
11/11/2021

Self-Normalized Importance Sampling for Neural Language Modeling

To mitigate the problem of having to traverse over the full vocabulary i...

Please sign up or login with your details

Forgot password? Click here to reset