A hypergeometric test interpretation of a common tf-idf variant

02/26/2020
by   Paul Sheridan, et al.
0

Term frequency-inverse document frequency, or tf-idf for short, is a numerical measure that is widely used in information retrieval to quantify the importance of a term of interest in one out of many documents. While tf-idf was originally proposed as a heuristic, much work has been devoted over the years to placing it on a solid theoretical foundation. Following in this tradition, we here advance the first justification for tf-idf that is grounded in statistical hypothesis testing. More precisely, we first show that the hypergeometric test from classical statistics corresponds well with a common tf-idf variant on selected real-data information retrieval tasks. Then we set forth a mathematical argument that suggests the tf-idf variant functions as an approximation to the hypergeometric test (and vice versa). The hypergeometric test interpretation of this common tf-idf variant equips the working statistician with a ready explanation of tf-idf's long-established effectiveness.

READ FULL TEXT

page 13

page 19

page 20

research
07/12/2023

Testing different Log Bases For Vector Model Weighting Technique

Information retrieval systems retrieves relevant documents based on a qu...
research
08/20/2017

Modelling Word Burstiness in Natural Language: A Generalised Polya Process for Document Language Models in Information Retrieval

We introduce a generalised multivariate Polya process for document langu...
research
08/16/2021

Toward the Understanding of Deep Text Matching Models for Information Retrieval

Semantic text matching is a critical problem in information retrieval. R...
research
11/22/2022

Method for Determining the Similarity of Text Documents for the Kazakh language, Taking Into Account Synonyms: Extension to TF-IDF

The task of determining the similarity of text documents has received co...
research
06/25/2018

Evaluation of Information Retrieval Systems Using Structural Equation Modelling

The interpretation of the experimental data collected by testing systems...
research
03/07/2019

Quantum Latent Semantic Analysis

The main goal of this paper is to explore latent topic analysis (LTA), i...
research
12/25/2020

On partial information retrieval: the unconstrained 100 prisoner problem

We consider the classical 100 Prisoner problem and its variant, involvin...

Please sign up or login with your details

Forgot password? Click here to reset