Fast(er) Reconstruction of Shredded Text Documents via Self-Supervised Deep Asymmetric Metric Learning

03/23/2020
by   Thiago M. Paixão, et al.
0

The reconstruction of shredded documents consists in arranging the pieces of paper (shreds) in order to reassemble the original aspect of such documents. This task is particularly relevant for supporting forensic investigation as documents may contain criminal evidence. As an alternative to the laborious and time-consuming manual process, several researchers have been investigating ways to perform automatic digital reconstruction. A central problem in automatic reconstruction of shredded documents is the pairwise compatibility evaluation of the shreds, notably for binary text documents. In this context, deep learning has enabled great progress for accurate reconstructions in the domain of mechanically-shredded documents. A sensitive issue, however, is that current deep model solutions require an inference whenever a pair of shreds has to be evaluated. This work proposes a scalable deep learning approach for measuring pairwise compatibility in which the number of inferences scales linearly (rather than quadratically) with the number of shreds. Instead of predicting compatibility directly, deep models are leveraged to asymmetrically project the raw shred content onto a common metric space in which distance is proportional to the compatibility. Experimental results show that our method has accuracy comparable to the state-of-the-art with a speed-up of about 22 times for a test instance with 505 shreds (20 mixed shredded-pages from different documents).

READ FULL TEXT
research
07/01/2020

Self-supervised Deep Reconstruction of Mixed Strip-shredded Text Documents

The reconstruction of shredded documents consists of coherently arrangin...
research
11/04/2020

Handwriting Classification for the Analysis of Art-Historical Documents

Digitized archives contain and preserve the knowledge of generations of ...
research
04/19/2023

Shuffle Divide: Contrastive Learning for Long Text

We propose a self-supervised learning method for long text documents bas...
research
09/25/2017

DOC: Deep Open Classification of Text Documents

Traditional supervised learning makes the closed-world assumption that t...
research
04/28/2019

TMIXT: A process flow for Transcribing MIXed handwritten and machine-printed Text

Handling large corpuses of documents is of significant importance in man...
research
09/27/2016

Semi Automatic Color Segmentation of Document Pages

-This paper presents a semi automatic method used to segment color docum...
research
09/29/2020

Immigration Document Classification and Automated Response Generation

In this paper, we consider the problem of organizing supporting document...

Please sign up or login with your details

Forgot password? Click here to reset