A Large Dataset of Historical Japanese Documents with Complex Layouts

04/18/2020
by   Zejiang Shen, et al.
6

Deep learning-based approaches for automatic document layout analysis and content extraction have the potential to unlock rich information trapped in historical documents on a large scale. One major hurdle is the lack of large datasets for training robust models. In particular, little training data exist for Asian languages. To this end, we present HJDataset, a Large Dataset of Historical Japanese Documents with Complex Layouts. It contains over 250,000 layout element annotations of seven types. In addition to bounding boxes and masks of the content regions, it also includes the hierarchical structures and reading orders for layout elements. The dataset is constructed using a combination of human and machine efforts. A semi-rule based method is developed to extract the layout elements, and the results are checked by human inspectors. The resulting large-scale dataset is used to provide baseline performance analyses for text region detection using state-of-the-art deep learning models. And we demonstrate the usefulness of the dataset on real-world document digitization tasks. The dataset is available at https://dell-research-harvard.github.io/HJDataset/.

READ FULL TEXT

page 1

page 3

page 4

page 6

page 7

research
08/26/2021

LayoutReader: Pre-training of Text and Layout for Reading Order Detection

Reading order detection is the cornerstone to understanding visually-ric...
research
01/25/2023

Generalizability in Document Layout Analysis for Scientific Article Figure Caption Extraction

The lack of generalizability – in which a model trained on one dataset c...
research
02/16/2022

Processing the structure of documents: Logical Layout Analysis of historical newspapers in French

Background. In recent years, libraries and archives led important digiti...
research
11/11/2021

Synthetic Document Generator for Annotation-free Layout Recognition

Analyzing the layout of a document to identify headers, sections, tables...
research
08/12/2021

VTLayout: Fusion of Visual and Text Features for Document Layout Analysis

Documents often contain complex physical structures, which make the Docu...
research
08/24/2023

American Stories: A Large-Scale Structured Text Dataset of Historical U.S. Newspapers

Existing full text datasets of U.S. public domain newspapers do not reco...
research
12/15/2019

Indiscapes: Instance Segmentation Networks for Layout Parsing of Historical Indic Manuscripts

Historical palm-leaf manuscript and early paper documents from Indian su...

Please sign up or login with your details

Forgot password? Click here to reset