SimpleBooks: Long-term dependency book dataset with simplified English vocabulary for word-level language modeling

11/27/2019
by   Huyen Nguyen, et al.
0

With language modeling becoming the popular base task for unsupervised representation learning in Natural Language Processing, it is important to come up with new architectures and techniques for faster and better training of language models. However, due to a peculiarity of languages – the larger the dataset, the higher the average number of times a word appears in that dataset – datasets of different sizes have very different properties. Architectures performing well on small datasets might not perform well on larger ones. For example, LSTM models perform well on WikiText-2 but poorly on WikiText-103, while Transformer models perform well on WikiText-103 but not on WikiText-2. For setups like architectural search, this is a challenge since it is prohibitively costly to run a search on the full dataset but it is not indicative to experiment on smaller ones. In this paper, we introduce SimpleBooks, a small dataset with the average word frequency as high as that of much larger ones. Created from 1,573 Gutenberg books with the highest ratio of word-level book length to vocabulary size, SimpleBooks contains 92M word-level tokens, on par with WikiText-103 (103M tokens), but has the vocabulary of 98K, a third of WikiText-103's. SimpleBooks can be downloaded from https://dldata-public.s3.us-east-2.amazonaws.com/simplebooks.zip.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/23/2017

Learning to Create and Reuse Words in Open-Vocabulary Neural Language Modeling

Fixed-vocabulary language models fail to account for one of the most cha...
research
05/26/2023

Tokenization Impacts Multilingual Language Modeling: Assessing Vocabulary Allocation and Overlap Across Languages

Multilingual language models have recently gained attention as a promisi...
research
03/30/2017

Neutral evolution and turnover over centuries of English word popularity

Here we test Neutral models against the evolution of English word freque...
research
10/25/2021

No News is Good News: A Critique of the One Billion Word Benchmark

The One Billion Word Benchmark is a dataset derived from the WMT 2011 Ne...
research
03/04/2019

Russian Language Datasets in the Digitial Humanities Domain and Their Evaluation with Word Embeddings

In this paper, we present Russian language datasets in the digital human...
research
01/29/2023

Composer's Assistant: Interactive Transformers for Multi-Track MIDI Infilling

We consider the task of multi-track MIDI infilling when arbitrary (track...
research
06/04/2021

Modeling the Unigram Distribution

The unigram distribution is the non-contextual probability of finding a ...

Please sign up or login with your details

Forgot password? Click here to reset