Multi-reference Cosine: A New Approach to Text Similarity Measurement in Large Collections

10/07/2018
by   Hamid Mohammadi, et al.
0

The importance of an efficient and scalable document similarity detection system is undeniable nowadays. Search engines need batch text similarity measures to detect duplicated and near-duplicated web pages in their indexes in order to prevent indexing a web page multiple times. Furthermore, in the scoring phase, search engines need similarity measures to detect duplicated contents on web pages so as to increase the quality of their results. In this paper, a new approach to batch text similarity detection is proposed by combining some ideas from dimensionality reduction techniques and information gain theory. The new approach is focused on search engines need to detect duplicated and near-duplicated web pages. The new approach is evaluated on the NEWS20 dataset and the results show that the new approach is faster than the cosine text similarity algorithm in terms of speed and performance. On top of that, It is faster and more accurate than the other rival method, Simhash similarity algorithm.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/21/2021

The Impact of Main Content Extraction on Near-Duplicate Detection

Commercial web search engines employ near-duplicate detection to ensure ...
research
10/07/2018

A Fast Text Similarity Measure for Large Document Collections using Multi-reference Cosine and Genetic Algorithm

One of the important factors that make a search engine fast and accurate...
research
08/23/2023

DarkDiff: Explainable web page similarity of TOR onion sites

In large-scale data analysis, near-duplicates are often a problem. For e...
research
08/17/2019

Onset detection: A new approach to QBH system

Query by Humming (QBH) is an system to provide a user with the song(s) w...
research
01/04/2013

Similarity Assessment through blocking and affordance assignment in Textual CBR

It has been conceived that children learn new objects through their affo...
research
06/18/2018

The Off-Topic Memento Toolkit

Web archive collections are created with a particular purpose in mind. A...
research
01/30/2022

Similarity Search on Computational Notebooks

Computational notebook software such as Jupyter Notebook is popular for ...

Please sign up or login with your details

Forgot password? Click here to reset