Computational analyses of the topics, sentiments, literariness, creativity and beauty of texts in a large Corpus of English Literature

01/12/2022
by   Arthur M. Jacobs, et al.
8

The Gutenberg Literary English Corpus (GLEC, Jacobs, 2018a) provides a rich source of textual data for research in digital humanities, computational linguistics or neurocognitive poetics. In this study we address differences among the different literature categories in GLEC, as well as differences between authors. We report the results of three studies providing i) topic and sentiment analyses for six text categories of GLEC (i.e., children and youth, essays, novels, plays, poems, stories) and its >100 authors, ii) novel measures of semantic complexity as indices of the literariness, creativity and book beauty of the works in GLEC (e.g., Jane Austen's six novels), and iii) two experiments on text classification and authorship recognition using novel features of semantic complexity. The data on two novel measures estimating a text's literariness, intratextual variance and stepwise distance (van Cranenburgh et al., 2019) revealed that plays are the most literary texts in GLEC, followed by poems and novels. Computation of a novel index of text creativity (Gray et al., 2016) revealed poems and plays as the most creative categories with the most creative authors all being poets (Milton, Pope, Keats, Byron, or Wordsworth). We also computed a novel index of perceived beauty of verbal art (Kintsch, 2012) for the works in GLEC and predict that Emma is the theoretically most beautiful of Austen's novels. Finally, we demonstrate that these novel measures of semantic complexity are important features for text classification and authorship recognition with overall predictive accuracies in the range of .75 to .97. Our data pave the way for future computational and empirical studies of literature or experiments in reading psychology and offer multiple baselines and benchmarks for analysing and validating other book corpora.

READ FULL TEXT

page 4

page 6

page 10

page 12

page 13

page 19

page 27

research
10/21/2020

Quasi Error-free Text Classification and Authorship Recognition in a large Corpus of English Literature based on a Novel Feature Set

The Gutenberg Literary English Corpus (GLEC) provides a rich source of t...
research
09/24/2020

A Comparative Study of Feature Types for Age-Based Text Classification

The ability to automatically determine the age audience of a novel provi...
research
09/26/2021

Electoral Programs of German Parties 2021: A Computational Analysis Of Their Comprehensibility and Likeability Based On SentiArt

The electoral programs of six German parties issued before the parliamen...
research
01/07/2020

Text Complexity Classification Based on Linguistic Information: Application to Intelligent Tutoring of ESL

The goal of this work is to build a classifier that can identify text co...
research
07/27/2022

CompText: Visualizing, Comparing Understanding Text Corpus

A common practice in Natural Language Processing (NLP) is to visualize t...
research
08/25/2020

Comparative Computational Analysis of Global Structure in Canonical, Non-Canonical and Non-Literary Texts

This study investigates global properties of literary and non-literary t...
research
05/08/2016

A corpus of preposition supersenses in English web reviews

We present the first corpus annotated with preposition supersenses, unle...

Please sign up or login with your details

Forgot password? Click here to reset