Improving Term Frequency Normalization for Multi-topical Documents, and Application to Language Modeling Approaches

02/08/2015
by   Seung-Hoon Na, et al.
0

Term frequency normalization is a serious issue since lengths of documents are various. Generally, documents become long due to two different reasons - verbosity and multi-topicality. First, verbosity means that the same topic is repeatedly mentioned by terms related to the topic, so that term frequency is more increased than the well-summarized one. Second, multi-topicality indicates that a document has a broad discussion of multi-topics, rather than single topic. Although these document characteristics should be differently handled, all previous methods of term frequency normalization have ignored these differences and have used a simplified length-driven approach which decreases the term frequency by only the length of a document, causing an unreasonable penalization. To attack this problem, we propose a novel TF normalization method which is a type of partially-axiomatic approach. We first formulate two formal constraints that the retrieval model should satisfy for documents having verbose and multi-topicality characteristic, respectively. Then, we modify language modeling approaches to better satisfy these two constraints, and derive novel smoothing methods. Experimental results show that the proposed method increases significantly the precision for keyword queries, and substantially improves MAP (Mean Average Precision) for verbose queries.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/29/2023

Adapting Learned Sparse Retrieval for Long Documents

Learned sparse retrieval (LSR) is a family of neural retrieval methods t...
research
01/02/2021

Cross-Document Language Modeling

We introduce a new pretraining approach for language models that are gea...
research
09/02/2023

MPTopic: Improving topic modeling via Masked Permuted pre-training

Topic modeling is pivotal in discerning hidden semantic structures withi...
research
12/15/2021

Value Retrieval with Arbitrary Queries for Form-like Documents

We propose value retrieval with arbitrary queries for form-like document...
research
09/17/2019

Extractive Summarization of Long Documents by Combining Global and Local Context

In this paper, we propose a novel neural single document extractive summ...
research
12/08/2017

A Method for Finding Similar Documents Relying on Adding Repetition of Symbols in Length Based Filtering

A basic topic in mining of massive dataset is finding similar items. As ...
research
03/10/2022

TIDF-DLPM: Term and Inverse Document Frequency based Data Leakage Prevention Model

Confidentiality of the data is being endangered as it has been categoriz...

Please sign up or login with your details

Forgot password? Click here to reset