Processing topical queries on images of historical newspaper pages

02/20/2020
by   José E. B. Maia, et al.
0

Historical newspapers are a source of research for the human and social sciences. However, these image collections are difficult to read by machine due to the low quality of the print, the lack of standardization of the pages in addition to the low quality photograph of some files. This paper presents the processing model of a topic navigation system in historical newspaper page images. The general procedure consists of four modules which are: segmentation of text sub-images and text extraction, preprocessing and representation, induced topic extraction and representation, and document viewing and retrieval interface. The algorithmic and technological approaches of each module are described and the initial test results about a collection covering a range of 28 years are presented.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/10/2007

Text Line Segmentation of Historical Documents: a Survey

There is a huge amount of historical documents in libraries and in vario...
research
04/08/2022

A Generic Image Retrieval Method for Date Estimation of Historical Document Collections

Date estimation of historical document images is a challenging problem, ...
research
02/15/2020

Historical Document Processing: Historical Document Processing: A Survey of Techniques, Tools, and Trends

Historical Document Processing is the process of digitizing written mate...
research
12/06/2018

Neural Word Search in Historical Manuscript Collections

We address the problem of segmenting and retrieving word images in colle...
research
04/15/2020

An Evaluation of DNN Architectures for Page Segmentation of Historical Newspapers

One important and particularly challenging step in the optical character...
research
11/16/2016

How to do lexical quality estimation of a large OCRed historical Finnish newspaper collection with scarce resources

The National Library of Finland has digitized the historical newspapers ...
research
11/20/2020

Topic modelling discourse dynamics in historical newspapers

This paper addresses methodological issues in diachronic data analysis f...

Please sign up or login with your details

Forgot password? Click here to reset