A Retrieval Framework and Implementation for Electronic Documents with Similar Layouts

10/16/2018
by   Hyunji Chung, et al.
0

As the number of digital documents requiring investigation increases, it has become more important to identify relevant documents to a given case. There have been continual demands for finding relevant files in order to overcome this kind of issues. Regarding finding similar files, there can be a situation where there is no available metadata such as timestamp, file size, title, subject, template, author, etc. In this situation, investigators will focus on searching document files having specific keywords related to a given case. Although the traditional keyword search with elaborate regular expressions is useful for digital forensics, there is a possibility that closely related documents are missing because they have totally different body contents. In this paper, we introduce a recent actual case on handling large amounts of document files. This case suggests that similar layout search will be useful for more efficient digital investigations if it can be utilized appropriately for supplementing results of the traditional keyword search. Until now, research involving electronic-document similarity has mainly focused on byte streams, format structures and body contents. However, there has been little research on the similarity of visual layouts from the viewpoint of digital forensics. In order to narrow this gap, this study demonstrates a novel framework for retrieving electronic document files having similar layouts, and implements a tool for finding similar Microsoft OOXML files using user-controlled layout queries based on the framework.

READ FULL TEXT

page 15

page 21

research
03/09/2020

Forensic Analysis of Residual Information in Adobe PDF Files

In recent years, as electronic files include personal records and busine...
research
09/12/2022

One-Shot Doc Snippet Detection: Powering Search in Document Beyond Text

Active consumption of digital documents has yielded scope for research i...
research
07/28/2019

TopicSifter: Interactive Search Space Reduction Through Targeted Topic Modeling

Topic modeling is commonly used to analyze and understand large document...
research
01/07/2020

Provenance-based Classification Policy based on Encrypted Search

As an important type of cloud data, digital provenance is arousing incre...
research
10/22/2019

One-Shot Template Matching for Automatic Document Data Capture

In this paper, we propose a novel one-shot template-matching algorithm t...
research
09/20/2021

Traitor-Proof PDF Watermarking

This paper presents a traitor-tracing technique based on the watermarkin...
research
01/28/2022

Probably Reasonable Search in eDiscovery

In eDiscovery, a party to a lawsuit or similar action must search throug...

Please sign up or login with your details

Forgot password? Click here to reset