A Method for Finding Similar Documents Relying on Adding Repetition of Symbols in Length Based Filtering

12/08/2017
by   Hossein Azgomi, et al.
0

A basic topic in mining of massive dataset is finding similar items. As an example, finding similar documents can be recommended. In this case many methods are existed. For example, Shingling method and length based filtering are one of them. In Shingling method, from each document, substrings have been selected with symbol name and, they are placed on one set. For finding similar documents, the similarities of sets that related with them have been calculated. In Length based filtering just documents which close these lengths have been compared. These methods don't consider repetition of symbols. With considering the repetition can calculate length of documents with more accurately. In this paper we suggested a method for finding similar documents with considering the repetition of symbols. This method separated documents to better form. The main goal of this paper is presentation a method for finding similar documents with take fewer comparisons and time indeed.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/20/2016

Topic Sensitive Neural Headline Generation

Neural models have recently been used in text summarization including he...
research
11/25/2019

FLATM: A Fuzzy Logic Approach Topic Model for Medical Documents

One of the challenges for text analysis in medical domains is analyzing ...
research
01/29/2020

Comparison of scanned administrative document images

In this work the methods of comparison of digitized copies of administra...
research
03/22/2020

Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles

Many digital libraries recommend literature to their users considering t...
research
06/12/2020

On the Impact of Finite-Length ProbabilisticShaping on Fiber Nonlinear Interference

The interplay of shaped signaling and fiber nonlinearities is reviewed i...
research
06/12/2020

On the Impact of Finite-Length Probabilistic Shaping on Fiber Nonlinear Interference

The interplay of shaped signaling and fiber nonlinearities is reviewed i...
research
02/08/2015

Improving Term Frequency Normalization for Multi-topical Documents, and Application to Language Modeling Approaches

Term frequency normalization is a serious issue since lengths of documen...

Please sign up or login with your details

Forgot password? Click here to reset