1 Introduction
The highlyperformed retrieval models rely on two different factors  TF (term frequency) and IDF (inverse document frequency). Among them, TF factor becomes a nontrivial, since longlength documents may increase term frequency, different to shortlength ones, so that the naive estimation of term frequency would not be successful. Thus, term frequency of longlength documents should be seriously considered. Regarding this, Singhal observed the following two different types of reasons for making the length of a document long
[1] ^{1}^{1}1 Robertson and Walker mentioned two types of reasons as scope hypothesis and verbosity hypothesis, respectively [2]..
High term frequency: The same term repeatedly occurs in a longlength document. As a result, the term frequency factors may be large for long documents, increasing the average contribution of its terms towards the querydocument similarity.

More terms: Longlength document has large size of vocabulary. This increases the number of matches between a query and a long document, increasing the querydocument similarity, and the chances of retrieval of long documents in preference over shorter documents.
Without loss of meaning, we can conceptualize these two reasons as verbosity and multitopicality. First, verbosity means that the same topic is repeatedly mentioned by terms related to the topic, making term frequencies high. Second, multitopicality indicates that a document has a broad discussion of multitopics, rather than single topic, making more terms. Using these concepts, we divide longlength documents into two different ideal types  verbose documents and multitopical documents. Verbose document is the document which becomes long mainly due to verbosity, rather than multitopicality, while multitopical document is the document which follows typical characteristics of multitopicality, rather than verbosity.
Singhal preassumed that longlength documents should be penalized regardless of whether or not their types are verbosity (or multitopicality) [1]. Basically, their approach belongs to a simplified lengthdriven method which decreases the term frequency of all longlength documents according to documents’ length factor only. However, we insist that this Singhal’s preassumption would be failed. We argue that the penalization should be applied to verbose document only, not to multitopical document. As a main reason, terms in a multitopical document are less repeated than ones in a verbose document, since the length of the multitopical document is increased due to its broad topics. However, Singhal missed this point that these types of documents should be differently handled. Therefore, the retrieval function adopting Singhal’s penalization will make multitopical documents unreasonably lesspreferred, causing an unfair retrieval ranking.
To clearly support our argument for verbose document and multitopical document, we will exemplify two different situations to discuss different tendencies of term frequencies in verbose document and multitopical document. First, let us examine the situation by considering two different document samples of and which have the same term frequency ratio.
: Language modeling approach 
: Language modeling approach Language modeling approach 
is twice the concatenation of . Suppose that a query is given by “language modeling approach”. Then, a question arises as “which one of and is more relevant?”. By comparing the contained information, we know that two documents have the exactly same contents, although the length of is twice than that of . Thus, and should have the same relevance score. However, the absolute term frequency of is twice than that of , thus, the naive TF IDF prefers to . To avoid this unfair comparison, we should introduce a TF normalization. To this end, suppose that is the length of documents, and is the term frequency of a query term. Then, one reasonable strategy of TF normalization is to use , instead of . Then, the modified TF IDF produces the same score for and . Note that Singhal’s pivoted length normalization will also wellwork since can be wellreflected in Singhal’s original formula. Remark that is a verbose document, not a multitopical document, which is the main reason for the success of the normalization. Now, we examine the second situation by considering a multitopical document sample , which contains all topics of and as a subpart.
: Information retrieval model 
Language modeling approach 
Here, describes a broad topic  “information retrieval model”, and contain “language modeling approach” as a subtopic. Again, suppose that the same query of “language modeling approach” is given. Consider the question about “what relevance score should assigned to be, compared with and ?”. contains all contents of and , although is different from and . In this case, if user sees , he or she would think that is also relevant, because all relevant content   is embedded to . From this viewpoint, should have the same score as and (due to a partial relevance). However, if we apply the previous version of TFnormalization (i.e. ) to , then is muchless preferred to and , since its term frequency of a query term is the same as but its length is twice than that of . Of course, Singhal’s method will assign lessscore to than and . The mean reason of this failure is that is not a verbose document but a multitopical document. This result means that TF normalization problem is more complex, at least requiring the different strategies according to types of longlength documents. To avoid the unreasonable penalization for multitopical ones, TF normalization problem should be more deeply reinvestigated by discriminating multitopical documents from verbose documents.
To obtain a more accurate TF normalization, we propose a novel TF normalization method which is a type of axiomatic approach. We try to modify language modeling approach as a case study without the loss of its elegance and principle. To this end, we first formulate two constraints that the retrieval scoring functions should satisfy for verbose and multitopical documents, respectively. Then, we present the analysis result that previous language modeling approaches do not sufficiently satisfy these constraints. After that, we modify the language modeling approaches such that better satisfy these two constraints, derive a novel smoothing methods, and evaluate the proposed ones.
2 Formal Constraints of New TF Normalization, and Analysis of Previous Language Modeling Approaches
2.1 Constraints
From now on, we assume that is a measurement for calculating the number of topics in document . We define Kverbosity and Ntopicality as follows.
Definition (Kverbosity): Suppose that and are given. Let and be the term frequency of term in and , respectively. For all term , if = and = , then has Kverbosity to or is Kverbose to .
Definition (Ntopicality): Suppose that and are given as = . Let and be the length of and , respectively. If for all term in , = , then has Ntopicality to and is Ntopical to .
In our three samples from the introduction, has 2verbosity to , and has 2topicality to . Remind that we have argued that , and should have the same relevance score. This argument can be reformulated to following two constraints  VNC and TNC which the retrieval function should satisfy for two cases when one document has Kverbosity and Ntopicality to another document, respectively. Let be a similarity function between a document and a query .
VNC (Verbosity Normalization Constraint): Suppose a pair of and . If is Kverbose to , then = .
TNC (Topicality Normalization Constraint): Suppose a pair of and . If is Ntopicality to , then = .
These constraints can be directly utilized to derive a new class of retrieval function as Fang’s exploration [3]. Originally, Fang formulated two constraints related to term frequency  LNC1 and LNC2 [3]. Among them, LNC2 is highly relevant to VNC, where VNC is a more specific constraint  VNC entails LNC2, not vice versa. TNC is a new constraint which is not connected to Fang’s any constraint. Note that our exploration of a retrieval function is different from Fang’s one. We focus on only few constraints related to our issue, without identifying all constraints. Then, we select as the backbone model one among a previous wellperformed retrieval model, and modify it to better satisfy the focused few constraints, without losing the elegance and the principle of the original model. In this regard, our exploration method belongs to the partiallyaxiomatic approach  1) using partial constraints rather than full constraints, 2) using the restricted functional space which the backbone retrieval model can allows, rather than relying on full functional space. In contrast, Fang’s approach is the fullyaxiomatic approach [3, 4]
. In Fang’s approach, full constraints are completely identified as well as the focused constraints. A new class of retrieval function is explored as one in separate functional space which is not related to previous retrieval models. However, the fullyaxiomatic approach such as Fang’s exploration approach requires unprincipled heuristics which are not derived from a welldesigned retrieval model. A partiallyaxiomatic approach doesn’t need to discard the wellfounded retrieval model such as language modeling approach, enabling us to pursue a more elaborated retrieval model, without losing its mathematical elegance and principles.
2.2 Analysis of Language Modeling Approaches
We selected the language modeling approaches as the backbone retrieval model [5]. Our goal is to modify the language modeling approaches such that better satisfies the proposed two constraints  VNC and TNC. We investigate two popular smoothing methods  JelinekMercer smoothing (JM) and Dirichletprior smoothing (Dir) [6]. Before modifying them, we begin by discussing whether or not each smoothing method satisfies VNC and TNC in this subsection. Notations used in this paper are summarized as follows:
A given query  
Term frequency of in document  
Length of document  
Term frequency of of collection  
Total term frequency of collection  
Smoothed document language model of  
Unsmoothed document language model of (MLE)  
Collection language model (MLE) 
2.2.1 Analysis of JelinekMercer Smoothing
In JM (JelinerMercer Smoothing), a smoothed document model is obtained by the interpolation of MLE (Maximum Likelihood Estimation) of a document model and the collection model as follows
[6]:(1) 
where is a smoothing parameter. By using JM, , the similarity score of document for query can be written by using only querymatching terms as follows:
(2) 
Our analysis of whether or not JM satisfies VNC and TNC is given as follows:

JM satisfies VNC: Suppose that is Kverbose to . Then, MLEs of two document models are the same, resulting in the same scores.

JM does not satisfy TNC: Generally, JM prefers normal documents to multitopical documents, regardless of our definition of topicality measurement . This proof is skipped.
2.2.2 Analysis of DirichletPrior Smoothing
In Dir (Dirichletprior smoothing), a smoothed document model is estimated as posterior model when taking
as a prior probability of term
as follows [6]:(3) 
The equation is rewritten by
(4) 
If we set by , then Dir is equivalent to JMstyle smoothing using documentspecific smoothing parameter . based on Dir is formulated as follows:
The analysis on whether or not Dir satisfies VNC and TNC is somewhat complicated, due to its documentspecific smoothing parameter. We can easily show that Dir does not satisfy VNC and TNC. The following lists up the analysis result.

Dir doesn’t satisfy VNC: Generally, Dir makes inconsistent preferences according to whether or not a query term is topical. For a topical query term, Dir assigns the more score for verbose documents than normal documents. For a nontopical query terms, Dir assigns the less score for verbose documents than normal documents. The detailed proof is skipped.

Dir doesn’t satisfy TNC: The detailed proof is skipped.
3 Modification of Previous Retrieval Models
In the previous section, we have shown that two different smoothing methods do not satisfy two constraints well. In this section, we introduce the measurement of the number of topics, and modify the previous retrieval model such that it better satisfies VNC and TNC.
3.1 Measurement of The Number of Topics
To figure out which measurement is acceptable to calculate the number of topics in document , we propose two simple measurements for  The first one is vocabulary size, and the second one is information quantity.
Vocabulary Size: Generally, as there are more terms, a given document has more topics. Based on this idea, we can use the vocabulary size   which indicates the number of unique terms in a given document, as a measurement for the number of topics.
Information Quantity: Even though the vocabulary size is simple and reasonable, it cannot discriminate the mainly topical terms from the causallyoccurred terms. When using the vocabulary size, the number of topics may be unreasonably increased due to causally occurred terms. As for an alternative measurement, we consider the entropydriven value. Remind that entropy means the uncertainty of a generated sample. Entropy has the following positive properties for resolving the limitation of the vocabulary size. 1) As the number of possible events increases, entropy becomes larger. Here, events correspond to terms, hence the more terms are, the larger the entropy is likely to be. Thus, when a document has more topics, the content of the document can be described in more various ways, resulting in a larger entropy value. 2) Term generative probability of a document is used as the weight for calculating entropy value. As a term has more large probability, it makes more contribution to the finalentropy value. This property allows us to differentiate the effects of mainly topical terms and causally occurred terms.
The information quantity   is defined as an exponential function of entropy of a document as follows:
Some Useful Definitions: We define some useful notations. Let us define the normalized measurement of the number of topics  , and define the informative verbosity   as follows:
where is the mean of for all documents in a given test collection. Note that the informative verbosity indicates the average term frequency per unit information.
3.2 Modification of JM
3.2.1 First Modification of JM
Since JM exactly satisfies VNC, we would try to modify JM to additionally support TNC. The core idea of the modification of JM smoothing is a pseudo document. The pseudo document mainly consists of relevant parts to a query, which is constructed by extracting relevant parts from nonrelevant parts. Then, the score of a document is calculated by using the pseudo document model, instead of original document model.
Thus, the pseudodocument makes us take a dynamic viewpoint of document representation where a document is dynamically changed according to a query. Note that a pseudo document is an imaginary concept, which is not really constructed at real time. All we require is generative probabilities for query terms from the pseudo document model.
To estimate probability of query terms in a pseudo document, we simplify the estimation problem by using probability in original document. In other words, for terms in the pseudo document having nonzero probabilities, their probabilities are assumed to be proportional to the probabilities of terms in the original document. As a result, the estimation problem is completed only if we determine the length of the pseudo document from the original length .
Intuitively, the length of the pseudo document will be smaller, as topics are more. This intuition makes the length of the pseudo document proportional to . Thus, if is the language model of pseudo document, then the probability of pseudo document model is
It is rewritten by using instead of , and the constant as follows:
If we assume that the constant is independent to any document and query, then is not a tuning parameter since it can be included in smoothing parameter .
Let us derive a modified JM by substituting the original document model to this pseudo document model in Eq. (2). Then, is reformulated as follows:
(5) 
where is another smoothing parameter for the pseudo document model. Since is independent to any document and query, we can select such that is , in order to eliminate constant . Then, Eq. (5) is rewritten by
(6) 
By using MLE of the original document model , Eq. (6) is rewritten by
(7) 
Eq. (7) is the final modified JM, which is called JMV. JMV satisfies both of VNC and TNC.

JMV satisfies VNC: Let be Kverbose to . Then, = and = . Thus, = .

JMV satisfies TNC: Let be Ntopical to . Then, = and = . It makes that = . Therefore, = .
3.2.2 Second Modification of JM
In our preliminary experiments, we found that JMV performs well for keyword queries (i.e. title query), but is not reliable for verbose queries (i.e. description query), by showing serious sensitivity according to smoothing parameter . To discuss the reason of this result, we focus on the main differences of keyword query and verbose query. First, there are common terms in a verbose query. Different from topical terms, common terms can be shared by all topics. A common term always verbosely acts regardless of verbose documents and multitopical documents. Thus, the previous TF normalization would prefer multitopical documents for queries including common terms. Second, verbose queries often contain noise terms such as “relevant”, “find” and “documents”. When a document has more topics, it will increase the chance of existence of such noise terms. However, when our previous TF normalization is applied, noise term becomes very serious, because the number of topics is further multiplied to the normalized term frequency. Thus, the previous TF normalization would increase the scores of multitopical documents for noise queries. These two differences may be the reason why Singhal et. al. penalized even multitopical documents, as well as verbose documents [1]. However, we already discussed that their approach is not acceptable to topical terms.
To handle the problems of verbosetype queries, our TF normalization should be restricted to only documentspecific terms, not to noise terms or common terms. As a query term is more topical term in a given document, we hope to perform more TF normalization, and vice versa. To this end, we define as term specificity of in document . As for this paper uses a probabilistic metric which is defined as follows:
where is an additional smoothing parameter, which has 0.25 as the default value. By using the term specificity , we newly modify the pseudo document model as follows:
(8) 
Since is between 0 and 1, the normalization is perfectly reflected when is 1, while it is weaken as is close to 0. One problem arises when is smaller than 1. In this case, as is larger, the effect of normalization becomes weaker. To resolve this problem, we considered the exceptional TF normalization, making the normalization proportional to even when is smaller than 1. In preliminary experiments, we found that the final retrieval performance is almost not changed, even after the exceptional TF normalization is applied. Thus, we select Eq. (8) for second modification. We call it JMV2.
4 Modification of Dir
Our goal for Dir modification is to provide VNC. We introduce the concept of pseudo document model to modify Dir. Different from the pseudo document for JM modification that consists of queryrelevant parts only, the pseudo document for Dir modification consists of all topics in the original document, but has a different length from the original length. Note that the change of the length only makes different models, since the smoothed model   is different according to the document length. In fact, the lengthdependence was the main reason why Dir does not satisfy VNC.
We assume that the pseudo document model is proportional to original MLE document model. In addition, we set the length of the pseudo document by . Remind that informative verbosity   is defined as . That is, the pseudo document with length of compacts the original document with length by time. Therefore, each term of document has the following term frequency in the pseudo document.
(9) 
As a result, the pseudo document model becomes lengthindependent model, even though MLE of pseudo document model is the same as the original document model. By using pseudo document model, Dir produces the following smoothed model.
(10) 
By substituting Eq. (9) to Eq. (10), Eq. (10) becomes
(11) 
This final modified model can be viewed as JMstyle smoothing using documentspecific smoothing paramter with , which is not dependent to the length any more. We call this modification DirV. We can easily prove that DirV additionally satisfies VNC.

DirV satisfies VNC: Let be Kverbose to . Then, two MLE models are equal (i.e = ). is since and are the same. Thus, DirV gives the same score for and .

DirV does not satisfy TNC: For DirV, we do not have a special consideration for supporting TNC.
5 Experimentation
5.1 Experimental Setting
For evaluation, we used five TREC test collections. The standard method was applied to extract index terms; We first separated words based on space character, eliminated stopwords, and then applied Porter’s stemming. Table 1 summarizes the basic information of each test collection. In columns, #Q, Topics, #R, #Doc, avglen, and #Terms are the number of topics, corresponding query topic IDs, the number of relevant documents, the number of documents, the average length of documents, and the number of terms, respectively.
Collection  # Q  Topics  # R  # Doc  avglen  # Term 

TREC7  50  350400  4,674  528,155  154.6  970,977 
TREC8  50  401450  4,728  
WT2G  50  401450  2,279  247,491  254.99  2,585,383 
TREC9  50  451500  2,617  1,692,096  165.16  13,018,003 
TREC10  50  501550  3,363 
According to Zhai’s work [6], we used the following three different types of queries:
1) Short keyword (SK): Using only the title of the topic description.
2) Short Verbose (SV): Using only the description field (usually one sentence).
3) Long Verbose (LV): Using the title, description and the narrative field (more than 50 words on average).
As for retrieval evaluation, we used MAP (Mean Average Precision), Pr@5 (Precision at 5 documents), and Pr@10 (Precision at 10 documents).
5.2 Experimental Results
Table 2 shows the best performances (MAP, Pr@5, Pr@10) of DirV and JMV2, compared with Dir. As for topic measurement , we selected the information quantity () since JMV2 and DirV using the information quantity is better than those using vocabulary size. We used MLE (Maximum Likelihood Estimation) for to calculate the information quantity without any smoothing. We selected Dir as the baseline due to its superiority over JM in all test collections. To obtain the best performance of each run, we searched 20 different values between 0.01 and 0.99 for , and 22 values between 100 and 30,000 for . To check whether or not the proposed method (DirV and JMV2) significantly improves the baseline, we performed the Wilcoxon sign ranked test to examine at 95% and 99% confidence levels. We attached and to the performance number of each cell in the table when the test passes at 95% and 99% confidence level, respectively. The results are summarized as follows:

DirV significantly improves MAP of Dir for verbose type of query (SV and LV). Exceptionally, TREC10 did not show an improvement for verbose type of query.

DirV does not significantly improve MAP of Dir for keyword type of query (SK), but improves precisions (Pr@5 or Pr@10). Especially, on TREC7 and TREC8, Pr@10 is significantly improved over Dir. Although other test collections do not statistically show a significant improvement, there is large portion of the numerical increase.

DirV or JMV2 show improvement on a specific test collection even for keyword type of query. For DirV, TREC10 is such a collection by showing a significant improvement of MAP. For JMV2, WT2G is such a test collection.

Overall, DirV is slightly better than JMV2 in most of test collections. WT2G is an exceptional collection to show that JMV2 significantly improves DirV.
MAP  Dir  DirV  JMV2  
SK  SV  LV  SK  SV  LV  SK  SV  LV  
TREC7  0.1786  0.1790  0.2209  0.1835  0.1967  0.2348  0.1825  0.1926  0.2250 
TREC8  0.2481  0.2294  0.2598  0.2492  0.2393  0.2621  0.2505  0.2354  0.2500 
WT2G  0.3101  0.2854  0.2863  0.3125  0.3103  0.3267  0.3278  0.3112  0.3263 
TREC9  0.2038  0.1990  0.2468  0.2040  0.2336  0.2581  0.2068  0.2245  0.2494 
TREC10  0.1950  0.1865  0.2347  0.2049  0.2248  0.2640  0.2091  0.2133  0.2555 
Pr@5  Dir  DirV  JMV2  
SK  SV  LV  SK  SV  LV  SK  SV  LV  
TREC7  0.4400  0.4280  0.5240  0.4560  0.4840  0.5680  0.4680  0.4920  0.5800 
TREC8  0.4920  0.4320  0.5120  0.5120  0.5040  0.5360  0.5240  0.4880  0.5280 
WT2G  0.5160  0.5120  0.5280  0.5360  0.5520  0.5720  0.5400  0.5560  0.5920 
TREC9  0.3000  0.3480  0.4160  0.3320  0.4240  0.4320  0.3440  0.3720  0.3880 
TREC10  0.3520  0.4040  0.4720  0.3840  0.4520  0.4920  0.3800  0.4200  0.4880 
Pr@10  Dir  DirV  JMV2  
SK  SV  LV  SK  SV  LV  SK  SV  LV  
TREC7  0.3980  0.4120  0.4420  0.4180  0.4420  0.4720  0.4100  0.4440  0.4800 
TREC8  0.4460  0.4120  0.4660  0.4740  0.4380  0.4780  0.4700  0.4400  0.4480 
WT2G  0.4660  0.4220  0.4240  0.4840  0.4840  0.4800  0.4920  0.4900  0.4820 
TREC9  0.2560  0.2860  0.3160  0.2780  0.3260  0.3540  0.2780  0.3160  0.3220 
TREC10  0.3060  0.3500  0.4040  0.3300  0.3820  0.4340  0.3300  0.3700  0.4340 
6 Conclusion
This paper introduced a new issue for TF normalization by considering two different types of longlength documents  verbose documents and multitopical documents. We proposed a novel TF normalization method which uses a partiallyaxiomatic approach. To this end, we formulated two desirable constraints, which the retrieval function should satisfy, and showed that previous language modeling approaches do not satisfy these constraints well. Then, we derived novel smoothing methods for language modeling approaches, without losing basic principles, and showed that the proposed methods satisfies these constraints more effectively. Experimental results on five standard TREC collections show that the proposed methods are better than previous smoothing methods, especially for verbose type of query. JMV2 significantly improved JM for all type of queries, and DirV eliminated the limitation of Dir by providing the robustness of performances for verbose type of query, as well as improving precisions (Pr@5 or Pr@10) for keyword type of query. This is comparable to recent results using more complicated queryspecific smoothing based on Poisson language model [7].
To handle longlength documents, passagebased retrieval could be applied [8]. However, passagebased retrieval has a burden of decreasing efficiency, since it requires additional process such as indexing of position information, presegmenting individual passages, and more importantly the additional overhead at online retrieval time. Contrast to the complicated method such as the passage retrieval, this paper handles multitopical documents in a simplified manner by investigating a more accurate TF normalization without additional cost of efficiency.
Acknowledgement. This work was supported by the Korea Science and Engineering Foundation (KOSEF) through the Advanced Information Technology Research Center (AITrc), also in part by the BK 21 Project and MIC & IITA through IT Leading R&D Support Project in 2007.
References
 [1] Singhal, A., Buckley, C., Mitra, M.: Pivoted document length normalization. In: SIGIR ’96. (1996) 21–29
 [2] Robertson, S.E., Walker, S.: Some simple effective approximations to the 2poisson model for probabilistic weighted retrieval. In: SIGIR ’94. (1994) 232–241
 [3] Fang, H., Tao, T., Zhai, C.: A formal study of information retrieval heuristics. In: SIGIR ’04. (2004) 49–56
 [4] Fang, H., Zhai, C.: An exploration of axiomatic approaches to information retrieval. In: SIGIR ’05. (2005)
 [5] Ponte, J.M., Croft, W.B.: A language modeling approach to information retrieval. In: SIGIR ’98. (1998) 275–281
 [6] Zhai, C., Lafferty, J.: A study of smoothing methods for language models applied to ad hoc information retrieval. In: SIGIR ’01. (2001) 334–342
 [7] Mei, Q., Fang, H., Zhai, C.: A study of poisson query generation model for information retrieval. In: SIGIR ’07. (2007) 319–326
 [8] Kaszkiel, M., Zobel, J.: Effective ranking with arbitrary passages. Journal of the American Society for Information Science and Technology (JASIST) 52(4) (2001) 344–364
Comments
There are no comments yet.