Partially Shuffling the Training Data to Improve Language Models

03/11/2019
by   Ofir Press, et al.
0

Although SGD requires shuffling the training data between epochs, currently none of the word-level language modeling systems do this. Naively shuffling all sentences in the training data would not permit the model to learn inter-sentence dependencies. Here we present a method that partially shuffles the training data between epochs. This method makes each batch random, while keeping most sentence ordering intact. It achieves new state of the art results on word-level language modeling on both the Penn Treebank and WikiText-2 datasets.

READ FULL TEXT

page 1

page 2

page 3

research
12/28/2018

Looking for ELMo's friends: Sentence-Level Pretraining Beyond Language Modeling

Work on the problem of contextualized word representation -- the develop...
research
07/14/2021

Deduplicating Training Data Makes Language Models Better

We find that existing language modeling datasets contain many near-dupli...
research
08/14/2018

Improved Language Modeling by Decoding the Past

Highly regularized LSTMs that model the auto-regressive conditional fact...
research
03/22/2018

An Analysis of Neural Language Modeling at Multiple Scales

Many of the leading approaches in language modeling introduce novel, com...
research
08/11/2018

Fake Sentence Detection as a Training Task for Sentence Encoding

Sentence encoders are typically trained on language modeling tasks which...
research
11/06/2022

Suffix Retrieval-Augmented Language Modeling

Causal language modeling (LM) uses word history to predict the next word...
research
05/25/2023

Sequential Integrated Gradients: a simple but effective method for explaining language models

Several explanation methods such as Integrated Gradients (IG) can be cha...

Please sign up or login with your details

Forgot password? Click here to reset