Generalizability in Document Layout Analysis for Scientific Article Figure Caption Extraction

01/25/2023
by   Jill P. Naiman, et al.
0

The lack of generalizability – in which a model trained on one dataset cannot provide accurate results for a different dataset – is a known problem in the field of document layout analysis. Thus, when a model is used to locate important page objects in scientific literature such as figures, tables, captions, and math formulas, the model often cannot be applied successfully to new domains. While several solutions have been proposed, including newer and updated deep learning models, larger hand-annotated datasets, and the generation of large synthetic datasets, so far there is no "magic bullet" for translating a model trained on a particular domain or historical time period to a new field. Here we present our ongoing work in translating our document layout analysis model from the historical astrophysical literature to the larger corpus of scientific documents within the HathiTrust U.S. Federal Documents collection. We use this example as an avenue to highlight some of the problems with generalizability in the document layout analysis community and discuss several challenges and possible solutions to address these issues. All code for this work is available on The Reading Time Machine GitHub repository (https://github.com/ReadingTimeMachine/htrc_short_conf).

READ FULL TEXT
research
04/18/2020

A Large Dataset of Historical Japanese Documents with Complex Layouts

Deep learning-based approaches for automatic document layout analysis an...
research
04/06/2018

Extracting Scientific Figures with Distantly Supervised Neural Networks

Non-textual components such as charts, diagrams and tables provide key i...
research
08/21/2023

Performance Enhancement Leveraging Mask-RCNN on Bengali Document Layout Analysis

Understanding digital documents is like solving a puzzle, especially his...
research
06/01/2021

Incorporating Visual Layout Structures for Scientific Text Classification

Classifying the core textual components of a scientific paper-title, aut...
research
11/09/2022

DoSA : A System to Accelerate Annotations on Business Documents with Human-in-the-Loop

Business documents come in a variety of structures, formats and informat...
research
06/29/2021

SDL: New data generation tools for full-level annotated document layout

We present a novel data generation tool for document processing. The too...
research
02/16/2022

Processing the structure of documents: Logical Layout Analysis of historical newspapers in French

Background. In recent years, libraries and archives led important digiti...

Please sign up or login with your details

Forgot password? Click here to reset