CHARTER: heatmap-based multi-type chart data extraction

11/28/2021
by   Joseph Shtok, et al.
0

The digital conversion of information stored in documents is a great source of knowledge. In contrast to the documents text, the conversion of the embedded documents graphics, such as charts and plots, has been much less explored. We present a method and a system for end-to-end conversion of document charts into machine readable tabular data format, which can be easily stored and analyzed in the digital domain. Our approach extracts and analyses charts along with their graphical elements and supporting structures such as legends, axes, titles, and captions. Our detection system is based on neural networks, trained solely on synthetic data, eliminating the limiting factor of data collection. As opposed to previous methods, which detect graphical elements using bounding-boxes, our networks feature auxiliary domain specific heatmaps prediction enabling the precise detection of pie charts, line and scatter plots which do not fit the rectangular bounding-box presumption. Qualitative and quantitative results show high robustness and precision, improving upon previous works on popular benchmarks

READ FULL TEXT

page 1

page 3

research
05/10/2021

DocReader: Bounding-Box Free Training of a Document Information Extraction Model

Information extraction from documents is a ubiquitous first step in many...
research
07/25/2022

Optimal Boxes: Boosting End-to-End Scene Text Recognition by Adjusting Annotated Bounding Boxes via Reinforcement Learning

Text detection and recognition are essential components of a modern OCR ...
research
05/15/2018

Corpus Conversion Service: A machine learning platform to ingest documents at scale [Poster abstract]

Over the past few decades, the amount of scientific articles and technic...
research
11/30/2016

Deep Cuboid Detection: Beyond 2D Bounding Boxes

We present a Deep Cuboid Detector which takes a consumer-quality RGB ima...
research
03/13/2023

Improving Table Structure Recognition with Visual-Alignment Sequential Coordinate Modeling

Table structure recognition aims to extract the logical and physical str...
research
09/10/2020

OCR Graph Features for Manipulation Detection in Documents

Detecting manipulations in digital documents is becoming increasingly im...
research
05/24/2018

Corpus Conversion Service: A Machine Learning Platform to Ingest Documents at Scale

Over the past few decades, the amount of scientific articles and technic...

Please sign up or login with your details

Forgot password? Click here to reset