The Nordic Pile: A 1.2TB Nordic Dataset for Language Modeling

03/30/2023
by   Joey Öhman, et al.
0

Pre-training Large Language Models (LLMs) require massive amounts of text data, and the performance of the LLMs typically correlates with the scale and quality of the datasets. This means that it may be challenging to build LLMs for smaller languages such as Nordic ones, where the availability of text corpora is limited. In order to facilitate the development of the LLMS in the Nordic languages, we curate a high-quality dataset consisting of 1.2TB of text, in all of the major North Germanic languages (Danish, Icelandic, Norwegian, and Swedish), as well as some high-quality English data. This paper details our considerations and processes for collecting, cleaning, and filtering the dataset.

READ FULL TEXT

page 2

page 6

page 7

research
01/25/2022

Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection

Language models increasingly rely on massive web dumps for diverse text ...
research
11/01/2019

CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data

Pre-training text representations have led to significant improvements i...
research
01/13/2022

Datasheet for the Pile

This datasheet describes the Pile, a 825 GiB dataset of human-authored t...
research
05/22/2023

GPT-SW3: An Autoregressive Language Model for the Nordic Languages

This paper details the process of developing the first native large gene...
research
03/12/2022

A Proposal to Study "Is High Quality Data All We Need?"

Even though deep neural models have achieved superhuman performance on m...
research
12/31/2020

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Recent work has demonstrated that increased training dataset diversity i...
research
12/03/2019

Unsupervised Inflection Generation Using Neural Language Modeling

The use of Deep Neural Network architectures for Language Modeling has r...

Please sign up or login with your details

Forgot password? Click here to reset