A hypergeometric test interpretation of a common tf-idf variant
Term frequency-inverse document frequency, or tf-idf for short, is a numerical measure that is widely used in information retrieval to quantify the importance of a term of interest in one out of many documents. While tf-idf was originally proposed as a heuristic, much work has been devoted over the years to placing it on a solid theoretical foundation. Following in this tradition, we here advance the first justification for tf-idf that is grounded in statistical hypothesis testing. More precisely, we first show that the hypergeometric test from classical statistics corresponds well with a common tf-idf variant on selected real-data information retrieval tasks. Then we set forth a mathematical argument that suggests the tf-idf variant functions as an approximation to the hypergeometric test (and vice versa). The hypergeometric test interpretation of this common tf-idf variant equips the working statistician with a ready explanation of tf-idf's long-established effectiveness.
READ FULL TEXT