OCR Graph Features for Manipulation Detection in Documents

09/10/2020
by   Hailey James, et al.
0

Detecting manipulations in digital documents is becoming increasingly important for information verification purposes. Due to the proliferation of image editing software, altering key information in documents has become widely accessible. Nearly all approaches in this domain rely on a procedural approach, using carefully generated features and a hand-tuned scoring system, rather than a data-driven and generalizable approach. We frame this issue as a graph comparison problem using the character bounding boxes, and propose a model that leverages graph features using OCR (Optical Character Recognition). Our model relies on a data-driven approach to detect alterations by training a random forest classifier on the graph-based OCR features. We evaluate our algorithm's forgery detection performance on dataset constructed from real business documents with slight forgery imperfections. Our proposed model dramatically outperforms the most closely-related document manipulation detection model on this task.

READ FULL TEXT
research
07/04/2022

BusiNet – a Light and Fast Text Detection Network for Business Documents

For digitizing or indexing physical documents, Optical Character Recogni...
research
01/28/2022

Self-paced learning to improve text row detection in historical documents with missing labels

An important preliminary step of optical character recognition systems i...
research
08/12/2019

Self-supervised Data Bootstrapping for Deep Optical Character Recognition of Identity Documents

The essential task of verifying person identities at airports and nation...
research
05/07/2020

A Gaussian Process Upsampling Model for Improvements in Optical Character Recognition

Optical Character Recognition and extraction is a key tool in the automa...
research
05/17/2022

Detection Masking for Improved OCR on Noisy Documents

Optical Character Recognition (OCR), the task of extracting textual info...
research
07/16/2023

DocTr: Document Transformer for Structured Information Extraction in Documents

We present a new formulation for structured information extraction (SIE)...
research
11/28/2021

CHARTER: heatmap-based multi-type chart data extraction

The digital conversion of information stored in documents is a great sou...

Please sign up or login with your details

Forgot password? Click here to reset