To develop new architectures or techniques in deep learning, it is important to have small and easy-to-train datasets for quick experiments. Datasets like MNIST(Cireşan et al., 2012), Fashion-MNIST (Xiao et al., 2017), and CIFAR (Krizhevsky and Hinton, 2009)
have become the standard testbeds in the field of computer vision. Their contributions are invaluable.
Given how popular the task of language modeling has become, it is important to have a small long-term dependency dataset that is representative of bigger datasets to serve as a testbed and benchmark for language modeling task. However, this is hard to achieve due to one peculiarity of languages: the larger the body of text, the higher the average number of times a word appears in that text. For simplicity, let FREQ denote the average number of times a token appears in a dataset.
Consider the most popular datasets for word-level LMs:
Penn TreeBank (PTB) dataset contains the Penn Treebank portion of the Wall Street Journal corpus, pre-processed by Mikolov et al. (Mikolov et al., 2011). It consists of 929k tokens for train, 73k for validation, and 82k for test. All words are lower-cased, numbers replaced with N, and most punctuations removed. The vocabulary is the most frequent 10k words. Out-of-vocabulary (OOV) words are replaced by an unk token. PTB contains sentences instead of paragraphs, so its context is limited.
WikiText-103 consists of 28,475 good and featured articles from Wikipedia. It has long-term dependency with 103 million tokens. After replacing all tokens that appear less than 3 times with a unk token, it has a vocabulary size of 267,735. (Merity et al., 2016) This makes it prohibitive to experiment with word-level LMs on this dataset. For an embedding size of 400, the embedding layer alone has 267K x 400 106M parameters.
WikiText-2 is a 2M token version of WikiText-103 with a vocabulary size of 33,278.
One-Billion Word (1Billion) dataset consists of 829M tokens over a vocabulary of 793K. Sentences in this dataset are shuffled and hence the context is limited. It is also too big for quick experimenting.
Table 1 shows that the bigger the body of text, the higher FREQ. The low FREQ for PTB and WikiText-2 explains why it is so hard to achieve low perplexity on these two datasets: each token simply does not appear enough times for the language model to learn a good representation of each token. The high percentage of OOV tokens also adds to the difficulty.
Looking at the state-of-the-art (SOTA) results, there is a pattern: the best performing models on small datasets like PTB and WikiText-2 are LSTM-based while the best performing models on larger datasets like WikiText-103 and 1Billion are dominated by Transformer models (See Figures 1 and 2).
There are a few possible reasons. One is because LSTMs have been around longer, there have been more regularization techniques developed for them, which make them work better with small datasets that often require more regularization.
Another is that for datasets with low FREQ, models have to rely more on the structural information of text, and RNNs are better at capturing and exploiting hierarchical information (Tran et al., 2018). RNNs, due to their recurrent nature, have a stronger inductive bias towards the most recent symbols. Transformer models, since they can attend to any symbol within the context, need a lot of data to learn that the most recent symbols are more relevant. When incorporating inductive bias, Transformer models seem to generalize better on small datasets (Dehghani et al., 2018).
One thing is clear: an architecture that works well for a small dataset might not work well for a bigger one. This makes it challenging for setups like architectural search where it is prohibitive to run the search on a large dataset, yet architectures found by the search on a small dataset might not be useful.
We believe that a small long-term dependency dataset with high FREQ will not only provide a useful benchmark for language modeling, but also a more suitable testbed for setups like architectural search and meta-learning. We introduce SimpleBooks-92, a dataset of 92M tokens, 90% that of WikiText-103, but with a vocabulary size of 98K, one third of that of WikiText-103. It has FREQ of 931.4, 90% that of 1Billion, with OOV tokens accounting for only 0.11%, even lower than 1Billion. See Table 1 for comparison.
We also include a 2M-token version, SimpleBooks-2, that has the vocabulary size one third of that of WikiText-2. Transformer models outperform LSTM models on both small and large versions of SimpleBooks.
|1Billion||News||829M||793,471||No||0.28%||1045.09||21.8 (Dai et al., 2019)|
|WikiText-103||Wikipedia||103M||267,735||Yes||0.4%||385.56||16.4 (Krause et al., 2019)|
|WikiText-2||Wikipedia||2M||33,278||Yes||2.6%||62.76||39.14 (Gong et al., 2018)|
|PTB||News||0.9M||10,000||No||4.8%||88.75||46.54 (Gong et al., 2018)|
2 SimpleBooks dataset
To create this dataset, we downloaded all available books from Gutenberg US (www.gutenberg.org). After discarding mal-formatted books and books of poems, plays, manuals111Knitting was apparently a hit in the early 20th century, recipes, and the literary nonsense, we obtained 39,432 books. We removed meta-data, tables of contents, illustrations. We tokenized the books by simply separating the words by space. Let be the number of tokens in a book and be its vocabulary size. Our goal is to choose a subset of those books such that when combining those books together, we have a body of text of approximately 100M tokens but with a vocabulary size of less than 100K.
To do so, we originally chose books with high ratio (which is FREQ). However, this biases towards long books because they tend to have higher FREQ. So we chose books with high instead. See Figure 3 for the distribution of the ratio.
We picked all books with the ratio of at least 0.0012. Most of them are children’s books, which makes sense since children’s books tend to use simpler English. We then went over each book from the largest to the smallest, either adding it to the to-use list or discard it if it has at least 50% 8-gram token overlap with the books that are already in the to-use list. We ended up with 1,573 books. Scripts used to create this dataset are from the lazynlp library (Nguyen, 2019).
Of these 1,573 books, 5 books are used for the validation set and 5 books for the test set. We tokenized each book using SpaCy (Honnibal and Montani, 2017) and separating numbers like “300,000” and “1.93” to “300 @,@ 000” and “1 @.@ 93”. Otherwise, all original case and punctuations are preserved. SimpleBooks-92 contains 92M tokens for train set, and 200k tokens for each validation and test sets. SimpleBooks-2 has the same validation and test sets as SimpleBooks-92, but with only 2M tokens for the train set.
We also include the raw version of unprocessed text for character-level LMs.
3.1 Language modeling
We trained word-level LMs on SimpleBooks-2 and SimpleBooks-92 using both AWD-LSTM222We used the implementation at https://github.com/salesforce/awd-lstm-lm (Merity et al., 2017) and Transformer-XL333We used the implementation at https://github.com/kimiyoung/transformer-xl (Dai et al., 2019). Note that AWD-LSTM is a highly regularized version of LSTM while the only regularization Transformer-XL uses is dropout.
3.1.1 LSTM vs Transformer on SimpleBooks-2
We evaluated whether on a small dataset with high FREQ, a vanilla implementation of Transformer models can outperform RNNs, consistent with the results on much larger datasets. We used Milano444https://github.com/NVIDIA/Milano
to search through 500 sets of hyperparameters on the first 30 epochs of SimpleBooks-2 for both AWD-LSTM and Transformer-XL. We then trained each architecture on the best set of hyperparameters until convergence. For the set of hyperparameters that we used, see AppendixA.
We found that Transformer-XL indeed outperformed AWD-LSTM on SimpleBooks-2 (See Table 2), while also requiring less parameters (M against M) and fewer epochs to converge ( against ).
3.1.2 WikiText-103 vs SimpleBooks-92
It is not surprising that on SimpleBooks-92, both AWD-LSTM and Transformer-XL converge faster and require less parameters compared to on WikiText-103. With identical settings that lead to near-SOTA validation perplexity on both datasets, SimpleBooks-92 can reduce 45.3% parameters for Transformer-XL and 39.7% for AWD-LSTM (See Table 3
). Note that both models tie the embedding and softmax layers.
|Dataset||# emb||# params||# emb||# params|
3.2 Transfer learning from SimpleBooks to WikiText
One interesting note is that even though SimpleBooks-92 has the vocabulary size of only 36.7% that of WikiText-103, it covers 92%, or 93% uncased, of all tokens in a slightly different tokenized version of WikiText-103555In the public copy of WikiText-103, negation contraction such as “don’t” is tokenized as “don ’t”. We re-tokenized it as “do n’t” to be consistent with SimpleBooks-92. This raises a research question: can what we learn from text of simplified English (SimpleBooks-92) be transferred to tasks using normal English (WikiText-103)?
We experimented with training word-embeddings using word2vec skip-gram algorithm (Mikolov et al., 2013). We first trained a skip-gram model on SimpleBooks-92 for 100k steps. We then ran two experiments on WikiText-103, each for 200k steps:
Train a skip-gram model on WikiText-103 from scratch.
For the words in WikiText-103 that are also in SimpleBooks-92, initialize the corresponding rows with the learned embedding from SimpleBooks-92. For all the other rows, uniform randomly initialize them within the (min, max) range, with min being the smallest value in the learned SimpleBooks-92 embedding, and max being the largest.
We found that the second experiment, while the model is able to learn much better, the final losses for both models are comparable (See Figure 4).
We introduced SimpleBooks-2 and SimpleBooks-92, a 2-million token and a 92-million token dataset, with unique property: they have a much smaller word-level vocabulary than the current datasets of the same size. This property makes it faster and easier to train word-level LMs on these datasets to convergence, which makes them ideal benchmarks and testbeds for the task of language modeling. While Transformer models usually outperform RNNs on large datasets but underperform RNNs on small datasets, in our experiments, Transformer-XL outperformed AWD-LSTM on both SimpleBooks-2 and SimpleBooks-92.
We also experimented with transfer learning from simple English to normal English with the task of training word embedding and saw some potential. In the future, we would like to experiment with whether it would save time to train a language model on simple English first and use the learned weights to train a language model on normal English.
I’d like to thank my wonderful colleagues Boris Ginsburg, Oleksii Kuchaiev, and Oleksii Hrinchuk for helping me with this project!
- Cireşan et al. (2012) Dan Cireşan, Ueli Meier, and Jürgen Schmidhuber. Multi-column deep neural networks for image classification. arXiv preprint arXiv:1202.2745, 2012.
- Xiao et al. (2017) Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
- Krizhevsky and Hinton (2009) Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
- Mikolov et al. (2011) Tomáš Mikolov, Anoop Deoras, Stefan Kombrink, Lukáš Burget, and Jan Černockỳ. Empirical evaluation and combination of advanced language modeling techniques. In Twelfth Annual Conference of the International Speech Communication Association, 2011.
- Merity et al. (2016) Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
- Tran et al. (2018) Ke Tran, Arianna Bisazza, and Christof Monz. The importance of being recurrent for modeling hierarchical structure. arXiv preprint arXiv:1803.03585, 2018.
- Dehghani et al. (2018) Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Universal transformers. arXiv preprint arXiv:1807.03819, 2018.
- Dai et al. (2019) Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
- Krause et al. (2019) Ben Krause, Emmanuel Kahembwe, Iain Murray, and Steve Renals. Dynamic evaluation of transformer language models. arXiv preprint arXiv:1904.08378, 2019.
- Gong et al. (2018) Chengyue Gong, Di He, Xu Tan, Tao Qin, Liwei Wang, and Tie-Yan Liu. Frage: frequency-agnostic word representation. In Advances in Neural Information Processing Systems, pages 1334–1345, 2018.
- Nguyen (2019) Huyen Nguyen. github.com/chiphuyen/lazynlp: First release of lazynlp. Mar 2019. doi: 10.5281/zenodo.2582057.
Honnibal and Montani (2017)
Matthew Honnibal and Ines Montani.
spacy 2: Natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing.To appear, 2017.
- Merity et al. (2017) Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and optimizing lstm language models. arXiv preprint arXiv:1708.02182, 2017.
- Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013.
Appendix A Hyperparameters used for training on SimpleBooks-2
- alpha: 2.0 - batch_size: 64 - beta: 1.0 - bptt: 48 - clip: 0.9431390850687401 - dropout: 0.09351714464370996 - dropoute: 0.15413135362263264 - dropouth: 0.2379440016364301 - dropouti: 0.782495906512577 - emsize: 576 - lr: 18.0 - nhid: 1152 - nlayers: 3 - nonmono: 5 - optimizer: sgd - seed: 1882 - tied: True - wdecay: 1.2e-06 - wdrop: 0.2983586710139643
- n_layer : 12 - n_head : 10 - d_head : 40 - d_embed : 320 - d_model : 320 - d_inner : 1280 - dropout : 0.35 - dropatt : 0.35 - init_range : 0.1 - emb_init_range : 0.01 - init_std : 0.02 - proj_init_std : 0.01 - optim : adam - lr : 0.00025 - decay_rate : 0.5 - lr_min : 0.0 - clip : 0.25 - clip_nonemb : False - max_step : 20000 - batch_size : 32 - tgt_len : 150 - eval_tgt_len : 150 - mem_len : 150 - not_tied : False - seed : 1111 - pre_lnorm : False - attn_type : 0 - clamp_len : -1