A standardized Project Gutenberg corpus for statistical analysis of natural language and quantitative linguistics

12/19/2018
by   Martin Gerlach, et al.
0

The use of Project Gutenberg (PG) as a text corpus has been extremely popular in statistical analysis of language for more than 25 years. However, in contrast to other major linguistic datasets of similar importance, no consensual full version of PG exists to date. In fact, most PG studies so far either consider only a small number of manually selected books, leading to potential biased subsets, or employ vastly different pre-processing strategies (often specified in insufficient details), raising concerns regarding the reproducibility of published results. In order to address these shortcomings, here we present the Standardized Project Gutenberg Corpus (SPGC), an open science approach to a curated version of the complete PG data containing more than 50,000 books and more than 3 × 10^9 word-tokens. Using different sources of annotated metadata, we not only provide a broad characterization of the content of PG, but also show different examples highlighting the potential of SPGC for investigating language variability across time, subjects, and authors. We publish our methodology in detail, the code to download and process the data, as well as the obtained corpus itself on 3 different levels of granularity (raw text, timeseries of word tokens, and counts of words). In this way, we provide a reproducible, pre-processed, full-size version of Project Gutenberg as a new scientific resource for corpus linguistics, natural language processing, and information retrieval.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/28/2023

Carolina: a General Corpus of Contemporary Brazilian Portuguese with Provenance, Typology and Versioning Information

This paper presents the first publicly available version of the Carolina...
research
06/29/2020

Towards the Study of Morphological Processing of the Tangkhul Language

There is no or little work on natural language processing of Tangkhul la...
research
09/25/2019

Developing a Fine-Grained Corpus for a Less-resourced Language: the case of Kurdish

Kurdish is a less-resourced language consisting of different dialects wr...
research
11/21/2021

More Romanian word embeddings from the RETEROM project

Automatically learned vector representations of words, also known as "wo...
research
04/18/2021

The Preposition Project

Prepositions are an important vehicle for indicating semantic roles. The...
research
07/11/2022

TArC: Tunisian Arabish Corpus First complete release

In this paper we present the final result of a project on Tunisian Arabi...
research
08/29/2022

naab: A ready-to-use plug-and-play corpus for Farsi

Huge corpora of textual data are always known to be a crucial need for t...

Please sign up or login with your details

Forgot password? Click here to reset