A Fast Text Similarity Measure for Large Document Collections using Multi-reference Cosine and Genetic Algorithm

10/07/2018
by   Hamid Mohammadi, et al.
0

One of the important factors that make a search engine fast and accurate is a concise and duplicate free index. In order to remove duplicate and near-duplicate documents from the index, a search engine needs a swift and reliable duplicate and near-duplicate text document detection system. Traditional approaches to this problem, such as brute force comparisons or simple hash-based algorithms are not suitable as they are not scalable and are not capable of detecting near-duplicate documents effectively. In this paper, a new signature-based approach to text similarity detection is introduced which is fast, scalable, reliable and needs less storage space. The proposed method is examined on popular text document data-sets such as CiteseerX, Enron, Gold Set of Near-duplicate News Articles and etc. The results are promising and comparable with the best cutting-edge algorithms, considering the accuracy and performance. The proposed method is based on the idea of using reference texts to generate signatures for text documents. The novelty of this paper is the use of genetic algorithms to generate better reference texts.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/07/2018

Multi-reference Cosine: A New Approach to Text Similarity Measurement in Large Collections

The importance of an efficient and scalable document similarity detectio...
research
07/21/2023

Identifying document similarity using a fast estimation of the Levenshtein Distance based on compression and signatures

Identifying document similarity has many applications, e.g., source code...
research
11/08/2017

A compressed dynamic self-index for highly repetitive text collections

We present a novel compressed dynamic self-index for highly repetitive t...
research
12/01/2021

Efficient Big Text Data Clustering Algorithms using Hadoop and Spark

Document clustering is a traditional, efficient and yet quite effective,...
research
03/29/2018

High Capacity Image Data Hiding of Scanned Text Documents Using Improved Quadtree

In this paper, an effective method was introduced to steganography of te...
research
04/03/2020

A Fast Fully Octave Convolutional Neural Network for Document Image Segmentation

The Know Your Customer (KYC) and Anti Money Laundering (AML) are worldwi...
research
02/14/2014

Authorship Analysis based on Data Compression

This paper proposes to perform authorship analysis using the Fast Compre...

Please sign up or login with your details

Forgot password? Click here to reset