Computational thematics: Comparing algorithms for clustering the genres of literary fiction

05/18/2023
by   Oleg Sobchuk, et al.
0

What are the best methods of capturing thematic similarity between literary texts? Knowing the answer to this question would be useful for automatic clustering of book genres, or any other thematic grouping. This paper compares a variety of algorithms for unsupervised learning of thematic similarities between texts, which we call "computational thematics". These algorithms belong to three steps of analysis: text preprocessing, extraction of text features, and measuring distances between the lists of features. Each of these steps includes a variety of options. We test all the possible combinations of these options: every combination of algorithms is given a task to cluster a corpus of books belonging to four pre-tagged genres of fiction. This clustering is then validated against the "ground truth" genre labels. Such comparison of algorithms allows us to learn the best and the worst combinations for computational thematic analysis. To illustrate the sharp difference between the best and the worst methods, we then cluster 5000 random novels from the HathiTrust corpus of fiction.

READ FULL TEXT

page 11

page 22

page 25

page 33

page 34

page 35

page 36

research
06/17/2021

A Distance-based Separability Measure for Internal Cluster Validation

To evaluate clustering results is a significant part of cluster analysis...
research
12/20/2019

What do Asian Religions Have in Common? An Unsupervised Text Analytics Exploration

The main source of various religious teachings is their sacred texts whi...
research
09/02/2020

An Internal Cluster Validity Index Based on Distance-based Separability Measure

To evaluate clustering results is a significant part in cluster analysis...
research
07/01/2013

Semi-supervised clustering methods

Cluster analysis methods seek to partition a data set into homogeneous s...
research
01/30/2018

Manuscripts in Time and Space: Experiments in Scriptometrics on an Old French Corpus

Witnesses of medieval literary texts, preserved in manuscript, are layer...
research
11/24/2020

Automatic Clustering for Unsupervised Risk Diagnosis of Vehicle Driving for Smart Road

Early risk diagnosis and driving anomaly detection from vehicle stream a...
research
09/16/2015

amLite: Amharic Transliteration Using Key Map Dictionary

amLite is a framework developed to map ASCII transliterated Amharic text...

Please sign up or login with your details

Forgot password? Click here to reset