American Stories: A Large-Scale Structured Text Dataset of Historical U.S. Newspapers

08/24/2023
by   Melissa Dell, et al.
0

Existing full text datasets of U.S. public domain newspapers do not recognize the often complex layouts of newspaper scans, and as a result the digitized content scrambles texts from articles, headlines, captions, advertisements, and other layout regions. OCR quality can also be low. This study develops a novel, deep learning pipeline for extracting full article texts from newspaper images and applies it to the nearly 20 million scans in Library of Congress's public domain Chronicling America collection. The pipeline includes layout detection, legibility classification, custom OCR, and association of article texts spanning multiple bounding boxes. To achieve high scalability, it is built with efficient architectures designed for mobile phones. The resulting American Stories dataset provides high quality data that could be used for pre-training a large language model to achieve better understanding of historical English and historical world knowledge. The dataset could also be added to the external database of a retrieval-augmented language model to make historical information - ranging from interpretations of political events to minutiae about the lives of people's ancestors - more widely accessible. Furthermore, structured article texts facilitate using transformer-based methods for popular social science applications like topic classification, detection of reproduced content, and news story clustering. Finally, American Stories provides a massive silver quality dataset for innovating multimodal layout analysis models and other multimodal applications.

READ FULL TEXT

page 4

page 5

page 14

page 15

page 21

research
04/18/2020

A Large Dataset of Historical Japanese Documents with Complex Layouts

Deep learning-based approaches for automatic document layout analysis an...
research
06/07/2023

Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks

To promote the development of Vision-Language Pre-training (VLP) and mul...
research
08/24/2021

Detection of Criminal Texts for the Polish State Border Guard

This paper describes research on the detection of Polish criminal texts ...
research
05/02/2022

POLITICS: Pretraining with Same-story Article Comparison for Ideology Prediction and Stance Detection

Ideology is at the core of political science research. Yet, there still ...
research
01/26/2023

LoRaLay: A Multilingual and Multimodal Dataset for Long Range and Layout-Aware Summarization

Text Summarization is a popular task and an active area of research for ...
research
07/14/2023

Aspect-Driven Structuring of Historical Dutch Newspaper Archives

Digital libraries oftentimes provide access to historical newspaper arch...
research
04/08/2015

Mining and discovering biographical information in Difangzhi with a language-model-based approach

We present results of expanding the contents of the China Biographical D...

Please sign up or login with your details

Forgot password? Click here to reset