Identifying document similarity using a fast estimation of the Levenshtein Distance based on compression and signatures

07/21/2023
by   Peter Coates, et al.
0

Identifying document similarity has many applications, e.g., source code analysis or plagiarism detection. However, identifying similarities is not trivial and can be time complex. For instance, the Levenshtein Distance is a common metric to define the similarity between two documents but has quadratic runtime which makes it impractical for large documents where large starts with a few hundred kilobytes. In this paper, we present a novel concept that allows estimating the Levenshtein Distance: the algorithm first compresses documents to signatures (similar to hash values) using a user-defined compression ratio. Signatures can then be compared against each other (some constrains apply) where the outcome is the estimated Levenshtein Distance. Our evaluation shows promising results in terms of runtime efficiency and accuracy. In addition, we introduce a significance score allowing examiners to set a threshold and identify related documents.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/20/2022

JEDI: These aren't the JSON documents you're looking for... (Extended Version*)

The JavaScript Object Notation (JSON) is a popular data format used in d...
research
01/07/2019

IDStack - The Common Protocol for Document Verification built on Digital Signatures

The use of physical documents is inconvenient and inefficient in today's...
research
10/07/2018

A Fast Text Similarity Measure for Large Document Collections using Multi-reference Cosine and Genetic Algorithm

One of the important factors that make a search engine fast and accurate...
research
02/14/2014

Authorship Analysis based on Data Compression

This paper proposes to perform authorship analysis using the Fast Compre...
research
12/15/2020

Efficient Clustering from Distributions over Topics

There are many scenarios where we may want to find pairs of textually si...
research
10/28/2018

Dynamic Thresholding Mechanisms for IR-Based Filtering in Efficient Source Code Plagiarism Detection

To solve time inefficiency issue, only potential pairs are compared in s...
research
06/14/2022

A Hierarchical-DBSCAN Method for Extracting Microservices from Monolithic Applications

The microservices architectural style offers many advantages such as sca...

Please sign up or login with your details

Forgot password? Click here to reset