Chapter Captor: Text Segmentation in Novels

by   Charuta Pethe, et al.

Books are typically segmented into chapters and sections, representing coherent subnarratives and topics. We investigate the task of predicting chapter boundaries, as a proxy for the general task of segmenting long texts. We build a Project Gutenberg chapter segmentation data set of 9,126 English novels, using a hybrid approach combining neural inference and rule matching to recognize chapter title headers in books, achieving an F1-score of 0.77 on this task. Using this annotated data as ground truth after removing structural cues, we present cut-based and neural methods for chapter segmentation, achieving an F1-score of 0.453 on the challenging task of exact break prediction over book-length documents. Finally, we reveal interesting historical trends in the chapter structure of novels.



There are no comments yet.


page 1

page 2

page 3

page 4


Word Segmentation and Morphological Parsing for Sanskrit

We describe our participation in the Word Segmentation and Morphological...

Capturing Logical Structure of Visually Structured Documents with Multimodal Transition Parser

While many NLP papers, tasks and pipelines assume raw, clean texts, many...

Prediction of ICD Codes with Clinical BERT Embeddings and Text Augmentation with Label Balancing using MIMIC-III

This paper achieves state of the art results for the ICD code prediction...

gambit – An Open Source Name Disambiguation Tool for Version Control Systems

Name disambiguation is a complex but highly relevant challenge whenever ...

Semantic Relations and Deep Learning

The second edition of "Semantic Relations Between Nominals" (by Vivi Nas...

Autonomous Farm Vehicles: Prototype of Power Reaper

Chapter 2 will begin with introduction of Agricultural Robotics. There w...

Artificial Intelligence and Structural Injustice: Foundations for Equity, Values, and Responsibility

This chapter argues for a structural injustice approach to the governanc...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.