Testing different Log Bases For Vector Model Weighting Technique

07/12/2023
by   Kamel Assaf, et al.
0

Information retrieval systems retrieves relevant documents based on a query submitted by the user. The documents are initially indexed and the words in the documents are assigned weights using a weighting technique called TFIDF which is the product of Term Frequency (TF) and Inverse Document Frequency (IDF). TF represents the number of occurrences of a term in a document. IDF measures whether the term is common or rare across all documents. It is computed by dividing the total number of documents in the system by the number of documents containing the term and then computing the logarithm of the quotient. By default, we use base 10 to calculate the logarithm. In this paper, we are going to test this weighting technique by using a range of log bases from 0.1 to 100.0 to calculate the IDF. Testing different log bases for vector model weighting technique is to highlight the importance of understanding the performance of the system at different weighting values. We use the documents of MED, CRAN, NPL, LISA, and CISI test collections that scientists assembled explicitly for experiments in data information retrieval systems.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/22/2022

Method for Determining the Similarity of Text Documents for the Kazakh language, Taking Into Account Synonyms: Extension to TF-IDF

The task of determining the similarity of text documents has received co...
research
02/26/2020

A hypergeometric test interpretation of a common tf-idf variant

Term frequency-inverse document frequency, or tf-idf for short, is a num...
research
07/13/2020

A supervised term-weighting technique for topic-based retrieval

This article presents a technique for term weighting that relies on a co...
research
08/12/2021

TextBenDS: a generic Textual data Benchmark for Distributed Systems

Extracting top-k keywords and documents using weighting schemes are popu...
research
08/25/2016

A Novel Term_Class Relevance Measure for Text Categorization

In this paper, we introduce a new measure called Term_Class relevance to...
research
09/14/2017

T^2K^2: The Twitter Top-K Keywords Benchmark

Information retrieval from textual data focuses on the construction of v...
research
05/02/2018

SynTF: Synthetic and Differentially Private Term Frequency Vectors for Privacy-Preserving Text Mining

Text mining and information retrieval techniques have been developed to ...

Please sign up or login with your details

Forgot password? Click here to reset