-
Language Models are Open Knowledge Graphs
This paper shows how to construct knowledge graphs (KGs) from pre-traine...
read it
-
Unsupervised Paraphrase Generation using Pre-trained Language Models
Large scale Pre-trained Language Models have proven to be very powerful ...
read it
-
Limits of Detecting Text Generated by Large-Scale Language Models
Some consider large-scale language models that can generate long and coh...
read it
-
Enabling Language Models to Fill in the Blanks
We present a simple approach for text infilling, the task of predicting ...
read it
-
Neural Academic Paper Generation
In this work, we tackle the problem of structured text generation, speci...
read it
-
WangchanBERTa: Pretraining transformer-based Thai Language Models
Transformer-based language models, more specifically BERT-based architec...
read it
-
Neural Sentence Ordering
Sentence ordering is a general and critical task for natural language ge...
read it
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Recent work has demonstrated that increased training dataset diversity improves general cross-domain knowledge and downstream generalization capability for large-scale language models. With this in mind, we present the Pile: an 825 GiB English text corpus targeted at training large-scale language models. The Pile is constructed from 22 diverse high-quality subsets – both existing and newly constructed – many of which derive from academic or professional sources. Our evaluation of the untuned performance of GPT-2 and GPT-3 on the Pile shows that these models struggle on many of its components, such as academic writing. Conversely, models trained on the Pile improve significantly over both Raw CC and CC-100 on all components of the Pile, while improving performance on downstream evaluations. Through an in-depth exploratory analysis, we document potentially concerning aspects of the data for prospective users. We make publicly available the code used in its construction.
READ FULL TEXT
Comments
dfwroofingpro ∙
Good Work