Method for Determining the Similarity of Text Documents for the Kazakh language, Taking Into Account Synonyms: Extension to TF-IDF

11/22/2022
by   Bakhyt Bakiyev, et al.
0

The task of determining the similarity of text documents has received considerable attention in many areas such as Information Retrieval, Text Mining, Natural Language Processing (NLP) and Computational Linguistics. Transferring data to numeric vectors is a complex task where algorithms such as tokenization, stopword filtering, stemming, and weighting of terms are used. The term frequency - inverse document frequency (TF-IDF) is the most widely used term weighting method to facilitate the search for relevant documents. To improve the weighting of terms, a large number of TF-IDF extensions are made. In this paper, another extension of the TF-IDF method is proposed where synonyms are taken into account. The effectiveness of the method is confirmed by experiments on functions such as Cosine, Dice and Jaccard to measure the similarity of text documents for the Kazakh language.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/09/2019

A new simple and effective measure for bag-of-word inter-document similarity measurement

To measure the similarity of two documents in the bag-of-words (BoW) vec...
research
07/12/2023

Testing different Log Bases For Vector Model Weighting Technique

Information retrieval systems retrieves relevant documents based on a qu...
research
05/02/2018

SynTF: Synthetic and Differentially Private Term Frequency Vectors for Privacy-Preserving Text Mining

Text mining and information retrieval techniques have been developed to ...
research
07/17/2020

Scalable Methods for Calculating Term Co-Occurrence Frequencies

Search techniques make use of elementary information such as term freque...
research
02/26/2020

A hypergeometric test interpretation of a common tf-idf variant

Term frequency-inverse document frequency, or tf-idf for short, is a num...
research
08/12/2021

TextBenDS: a generic Textual data Benchmark for Distributed Systems

Extracting top-k keywords and documents using weighting schemes are popu...
research
03/26/2021

A PSO Strategy of Finding Relevant Web Documents using a New Similarity Measure

In the world of the Internet and World Wide Web, which offers a tremendo...

Please sign up or login with your details

Forgot password? Click here to reset