Quasi Error-free Text Classification and Authorship Recognition in a large Corpus of English Literature based on a Novel Feature Set

10/21/2020
by   Arthur M. Jacobs, et al.
0

The Gutenberg Literary English Corpus (GLEC) provides a rich source of textual data for research in digital humanities, computational linguistics or neurocognitive poetics. However, so far only a small subcorpus, the Gutenberg English Poetry Corpus, has been submitted to quantitative text analyses providing predictions for scientific studies of literature. Here we show that in the entire GLEC quasi error-free text classification and authorship recognition is possible with a method using the same set of five style and five content features, computed via style and sentiment analysis, in both tasks. Our results identify two standard and two novel features (i.e., type-token ratio, frequency, sonority score, surprise) as most diagnostic in these tasks. By providing a simple tool applicable to both short poems and long novels generating quantitative predictions about features that co-determe the cognitive and affective processing of specific text categories or authors, our data pave the way for many future computational and empirical studies of literature or experiments in reading psychology.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/12/2022

Computational analyses of the topics, sentiments, literariness, creativity and beauty of texts in a large Corpus of English Literature

The Gutenberg Literary English Corpus (GLEC, Jacobs, 2018a) provides a r...
research
03/16/2018

Corpus Statistics in Text Classification of Online Data

Transformation of Machine Learning (ML) from a boutique science to a gen...
research
01/06/2018

Explorations in an English Poetry Corpus: A Neurocognitive Poetics Perspective

This paper describes a corpus of about 3000 English literary texts with ...
research
10/15/2020

Token Sequence Labeling vs. Clause Classification for English Emotion Stimulus Detection

Emotion stimulus detection is the task of finding the cause of an emotio...
research
10/05/2022

Token Classification for Disambiguating Medical Abbreviations

Abbreviations are unavoidable yet critical parts of the medical text. Us...
research
06/14/2021

Evaluating Various Tokenizers for Arabic Text Classification

The first step in any NLP pipeline is learning word vector representatio...
research
09/26/2021

Electoral Programs of German Parties 2021: A Computational Analysis Of Their Comprehensibility and Likeability Based On SentiArt

The electoral programs of six German parties issued before the parliamen...

Please sign up or login with your details

Forgot password? Click here to reset