1 Introduction
Pairwise similarity measurements of documents is a fundamental task in many text mining problems such as querybyexample, document classification and clustering.
In the bagofwords (BoW) (Salton and McGill, 1986; Manning et al., 2008) vector space model, a document is represented by an dimensional vector where is the number of terms in a given dictionary, i.e., ; and it has the following two representations:

Termfrequencybased representation: each ( is a set of nonnegative integers) is the occurrence frequency of term in document .

Binary representation: each where 0 represents the absence of term in document and 1 represents the presence of in .
Because the number of terms in a document is significantly less than that in the dictionary, every document is represented as a sparse BoW vector, where many entries are zero. Because of sparsity, Euclidean distance is not a good similarity measure and the angular distance, aka cosine distance, is a preferred choice of interdocument similarity measure (Salton and McGill, 1986; Salton and Buckley, 1988).
Because all terms in a document are not equally important to represent its subject, different ‘term weighting’ schemes (Manning et al., 2008; Salton and Buckley, 1988) are used to adjust vector components based on the importance of their terms.
The idea of term weighting was first introduced in the field of Information Retrieval (IR) where the task is to measure the relevance of documents in a given collection for a given query phrase consisting of a few terms. It is based on the following two assumptions (Manning et al., 2008; Salton and Buckley, 1988; Zobel and Moffat, 1998):

A term is important in a document if it occurs multiple times in the document.

A rare term (that occurs in a few documents in the collection) is more important than frequent terms (that occur in many documents in the collection).
The importance of terms in a document are estimated independent of the query. Because a query in the IR task is short and each term generally occurs only once, it is not an issue that the weights are determined independent of the query.
However, it can be counterproductive in the querybyexample task where the query itself is a document, and terms often occur more than once in the query document. For example, to a query document , a document having more occurrences of the terms in may not be more similar than which has exactly the same occurrences of terms in .
Prior research in the BoW interdocument similarity measurement task were focused on developing effective term weighting schemes to improve the task specific performances of existing measures such as cosine and Best Match 25 (BM25) (Salton and Buckley, 1988; Robertson et al., 1994; Joachims, 1997; Singhal, 1997; Roberston and Zaragoza, 2009; Paltoglou and Thelwall, 2010; Han et al., 2012; Wang and Zhang, 2013). In contrast, we investigate an alternative similarity measure where an adjustment of vector components using term weighting is not required.
This paper makes the following contributions:

Identify the shortcomings of the underlying assumptions of term weighting schemes employed in existing measures; and provide an alternative which is more congruous with the requirements of interdocument similarity measurements.

Introduce a new simple but effective interdocument similarity measure which is based on the new assumption and does not require explicit term weighting. It uses a more nuanced probabilistic approach than those used in term weighting to measure the similarity of two documents w.r.t each term occurring in the two documents under measurement.

Compare the performance of the new measure with existing measures (which use different term weighting schemes) in the querybyexample task. Our result reveals that the new measure produces (i) better results than existing measures in the binary BoW representation; and (ii) competitive and more consistent results to existing measures in the termfrequencybased BoW representation.
The rest of the paper is organized as follows. Related work in the areas of term weighting and interdocument similarity measures are discussed in Section 2. Issues of term weighting in interdocument similarity measurements are discussed Section 3. The proposed new interdocument similarity measure is presented in Section 4, followed by empirical results in Section 5, related discussion in Section 6, and the last section presents the conclusions.
The key notations used in this paper are defined in Table 1.
A collection of documents (i.e., )  
BoW vector of a document  
The term in the dictionary  
The number of documents in having  
The set of terms in  
The importance or weight of in  
Term frequency factor of in  
Inverse document frequency factor of  
The similarity of two documents and  
The length of document (i.e., )  
The average length of documents in  
is more similar to than w.r.t  
is equally similar to as w.r.t 
2 Related work
In this section, we present the pertinent details of term weighting and some widely used existing BoW interdocument similarity measures.
2.1 Term weighting
In the field of IR, there has been considerable research investigating the effective term weighting scheme. The importance of a term in document , , is estimated using different variants and combinations of two factors (Manning et al., 2008; Salton and Buckley, 1988; Joachims, 1997; Robertson et al., 1994; Singhal, 1997; Roberston and Zaragoza, 2009; Paltoglou and Thelwall, 2010; Han et al., 2012; Wang and Zhang, 2013): (i) documentbased factor based on the frequency of in , ; and (ii) collectionbased factor based on the number of documents where occurs, .
The most widely used term weighting scheme is tfidf (term frequency  inverse document frequency) where (Manning et al., 2008; Salton and Buckley, 1988); and it consists of:

Documentbased factor: if , and 0 otherwise;

Collectionbased factor: .
In the IR task, the idea of tfidf term weighting is based on the following assumptions (Zobel and Moffat, 1998):

Documents with multiple occurrences of query terms are more relevant than documents with single occurrence of query terms [the tf assumption].

Documents having rare query terms (occurring in a few documents in the collection) are more relevant to the query than documents having frequent query terms (occurring in many documents in the collection) [the idf assumption].
The tf factor considers the importance of in a document. Even though a document with multiple occurrences of a query term is more likely to be relevant to the given query, a document having higher occurrences of one query term is not necessarily more relevant than a document having lower occurrences of two query terms. Therefore, the logarithmic scaling of raw term frequencies is used to reduce the over influence of high frequencies of query terms (Manning et al., 2008; Salton and Buckley, 1988).
The idf factor considers the importance of in the given collection. Basically, it ranks the importance of terms in the given dictionary based on the number of documents where they occur. Terms occurring only in a few documents (i.e., rare terms) are considered to be more important in documents; and they are given more weights than the terms occurring in many documents (i.e., frequent terms) (Manning et al., 2008; Salton and Buckley, 1988).
2.2 Interdocument similarity measures
Here, we discuss three commonly used measures to estimate the similarity of two document vectors and , where is a real domain.
2.2.1 Cosine similarity
The cosine similarity measure with the tfidf term weighting is the most commonly used interdocument similarity measure. Using term weighted vectors, the cosine similarity of two documents and is estimated as:
(1) 
Note that the two terms in the denominator of Eqn 1 are the Euclidean lengths (norms) of the term weighted vectors.
It is important to normalize the similarity of documents by their lengths, otherwise it favors longer documents which have higher probability of having more terms in common with the query document over shorter documents
(Salton and McGill, 1986; Manning et al., 2008; Salton and Buckley, 1988; Singhal et al., 1996).2.2.2 Best Match 25 (BM25)
BM25 (Roberston and Zaragoza, 2009; Jones et al., 2000) is the stateoftheart document ranking measure in IR. It is based on the probabilistic framework of term weighting by Robertson et al. (1994). Han et al. (2012) used BM25 to measure the similarity of two documents and as follows:
(2) 
where is the normal length of document (i.e., norm of the unweighted vector); is the average normal document length; and are free parameters that control the influence of the term frequencies and document lengths; and is the idf factor of term defined as follows:
(3) 
It uses different variants of tf and idf factors in the similarity measure. The pivoted normal document length (Singhal et al., 1996) is used in the tf factor so that longer documents which have higher probability of having more terms in common with the query document are not favored over shorter documents.
2.2.3 Jaccard similarity
The Jaccard similarity (Jaccard, 1901) of two documents and is estimated as follows:
(4) 
where is the set of terms in document and is the cardinality of a set.
It only considers the number of terms occurring in both and and does not take into account the importance of terms in documents. The similarity is normalized by the number of distinct terms occurring in either or to take into account that and have higher chance of having terms in common if they have more terms.
The weighted or generalized version of Jaccard similarity (Chierichetti et al., 2010) of two documents using term weighted vectors is defined as follows:
(5) 
The similarity of and w.r.t depends on the importance of in the two documents. The similarity is normalized by the sum of maximum weights of all .
3 Issues of the tfidf assumptions in interdocument similarity measurements
Even though the tf and idf assumptions discussed in Section 2.1 are intuitive in the IR task to rank documents for a given query phrase of a few terms, they can be counterintuitive in the querybyexample task which requires interdocument similarity measurements to rank documents in w.r.t a given query document.
In the literature, the querybyexample task is treated as the IR task where query is a document; and the same idea of the tfidf term weighting is used. However, there is a fundamental difference between the two tasks — unlike in the typical IR task where the query comprises of a few distinct terms (i.e., each term generally occurs only once in the query phrase), the query in the querybyexample task is a long document which often has multiple occurrences of terms.
3.1 Issue of the tf assumption
For a query document with terms , a document having more occurrences of terms in than in , may not be more similar to than another document , which has similar occurrences of terms in as in . For example, let’s assume and have frequencies of as and , respectively. If has , it is difficult to say that is more similar to than w.r.t , simply because of (and ). It might be the case that is exactly the same document as .
Because of the tfbased term weighting factor, can be more similar to than itself using some existing measure such as BM25^{†}^{†}It depends on the lengths of documents and parameters and .. Thus, the tf assumption can be counterintuitive in interdocument similarity measurements.
3.2 Issue of the idf assumption
Similarly, having rare terms of may not be more similar to than having frequent terms of . For example, let’s assume the scenario presented in Table 2:
1  1  1  
0  1  10  10 
tfidf term weighting  Sp (the proposed approach)  
(i) Underlying Assumptions  
tf:  if  if  
idf:  if (for ;  
; )  where  
(ii) Example discussed in Section 3.1 ()  
because (even though  because  
)  
(iii) Example discussed in Section 3.2 (Table 2)  
because (i) ;  because (i)  
and  
(ii) (even though  (ii)  
and ) 
Because of , the term will be completely ignored. However, is more useful than because is the only document in which has as many occurrences of as . Even though there is no discrimination between documents w.r.t (all documents with have its frequency of 1), is assigned more weight with than with . As a result, and become equally similar to w.r.t and even though has exactly the same occurrences of and as . This example shows that the idf assumption can be counterintuitive in document similarity measurements.
4 Our proposal to overcome the issues of tfidf based term weighting in document similarity measurements
The main problem of the tfidf term weighting in interdocument similarity measurements is that the importance of in , , is estimated without considering the frequency of in , . It is not an issue in the IR task because is almost always 1 if occurs in the given query phrase . In a query document, can be larger than 1. Therefore, judging the importance of in , without considering , can be counterproductive in interdocument similarity measurements.
A more fittothepurpose approach would be to evaluate the importance of in by examining the similarity of w.r.t. . However, as discussed in Section 3.2, simply having similar occurrences of (i.e., ) is not sufficient to consider them to be similar. The similarity shall also consider how rare the frequency of is in the collection.
Putting the above requirements together, the similarity of and w.r.t shall be based on the number of documents in which have similar occurrence frequencies of as in both and . More formally, and are more likely to be similar w.r.t if is small. The first part in Table 3 compares the underlying assumptions of the tfidf term weighting (used in existing measures) and the proposed approach called Sp, to be introduced in the next subsection.
This approach addresses the limitations of both the tf and idf assumptions discussed in Section 3. The results of the new approach using the same examples discussed in Sections 3.1 and 3.2 are provided in the second part of Table 3. The comparisons demonstrate that the new approach provides more intuitive outcomes than the tfidf term weighting.
4.1 Sp: A new document similarity measure
Recently, Aryal et al. (2014, 2017) introduced a data dependent measure where the similarity of two data objects and depends on the distribution of data between and (Aryal et al., 2014, 2017). The intuition is that and are more likely to be similar if there are less data between them. For example, two individuals earning 800k and 900k are judged to be more similar by human than two individuals earning 50k and 150k because a lot more people earn in [50k, 150k] than [800k, 900k].
Using the similar idea, the similarity of two documents and can be estimated as:
(6) 
where is a normalization term to account for the probability of a term occurring in both and . It reduces the bias towards documents having more terms because they have a higher probability of having terms, which also exist in a query document, than documents having fewer terms.
The number of distinct terms is used as a normalization factor (as in the traditional Jaccard similarity) because it is not sensitive to multiple occurrences of the terms in a document which do not occur in the query document. In the IR task, Singhal et al. (1996) have shown that it is more effective than the cosine or normal length normalizations which penalize more to documents having multiple occurrences of the terms which are not in the query phrase.
Sp can be interpreted as a simple probabilistic measure where the similarity of two documents w.r.t is assigned based on the probability of the frequency of to be in [, ], (where
be a random variable representing the occurrence frequency of term
in a document). In practice, , which is the inverse of the term used in Eqn 6.4.2 Characteristics of Sp
The proposed measure has the following characteristics:

Term weighting is not required:
Unlike in existing measures such as cosine and BM25, and are not used directly in the similarity measure. They are used just to define and ; and the similarity is based on which is invariant to the monotonic scaling of frequency values. Hence, Sp does not require additional term weighting to adjust frequency values.

Selfsimilarity is data dependent and the upper bound of similarity:
Unlike cosine and both variants of Jaccard similarity where the selfsimilarity of documents is fixed with the maximum of 1, Sp has data dependent self similarity because depends on the for all . Thus, and can be different.
The similarity in Sp is bounded by its selfsimilarity i.e., . Although BM25 also has data dependent selfsimilarity, it is possible to have similarity of different documents to be larger than the selfsimilarity, i.e., there may be with ^{†}^{†}It depends on the lengths of and and parameters and ..

Relationship with the traditional Jaccard similarity and idf term weighting:
The formulation of Sp (Eqn 6) looks similar to the formulation of the traditional Jaccard similarity (Eqn 4) except that the similarity of and w.r.t is based on in Sp, whereas it is the fixed constant of 1 in the traditional Jaccard similarity.
In the binary BoW vector representation, when and , Sp assigns the similarity of and w.r.t based on , whereas in the traditional Jaccard similarity, it is 1, irrespective of whether is rare or frequent in .
In the termfrequencybased BoW representation, Sp is different from the idf weighting because is based on , whereas Sp is based on , where and when .
4.3 Computational complexity
In the termfrequencybased BoW vector representation, it appears that computing naively can be expensive as it requires a range search to find the number of documents having the frequencies of between and . Since, all are integers (term occurrence frequency counts), it can be computed in constant time by the following simple preprocessing.
Let be the maximum occurrence frequency of term in any document in the given collection . We can maintain a cumulative frequency count array of size where contains the number of documents having occurrences of fewer than or equal to .
Using , it can be estimated in constant time as . Note that can not be 0 because is computed only if (i.e., and ) and thus .
The above preprocessing requires time complexity and space complexity, where is the average maximum frequency of terms.
With the above preprocessing, the runtime of computing the similarity of and using Sp is the same as that of the existing similarity measures which is .
5 Empirical evaluation
In this section, we present the results of experiments conducted to evaluate the task specific performances of Sp, BM25, weighted Jaccard and cosine similarity in the querybyexample task to retrieve documents similar to a given query document. We did experiments with both the termfrequencybased and binary BoW vector representations. We used different combinations of tf and idf based term weighting factors with the weighted Jaccard and cosine similarity measures.
5.1 Datasets and experimental setup
We used 10 datasets from 6 benchmark document collections. The characteristics of data sets are provided in Table 4. NG20, R8, R52 and Webkb are from CardosoCachopo (2007)^{†}^{†}BoW vectors available at: http://web.ist.utl.pt/acardoso/datasets/; and others are from Han and Karypis (2000)^{†}^{†}BoW vectors available at: http://www.cs.waikato.ac.nz/ml/weka/datasets.html.
Name  Collection  

Fbis  2,463  2,000  17  TREC collection 
La1s  3,204  13,195  6  TREC collection 
La2s  3,075  12,432  6  TREC collection 
New3s  9,558  26,832  44  TREC collection 
Ng20  18,821  5,489  20  20 Newsgroup collection 
Ohscal  11,162  11,465  10  Ohsumed patients records 
R8  7,674  3,497  8  Reuters collection 
R52  9,100  7,379  52  Reuters collection 
Wap  1,560  8,460  20  Yahoo web pages 
Webkb  4,199  1,817  4  University web pages 
Each dataset was divided into two subsets and using a 10fold cross validation which have 90% and 10% of the documents, respectively. was used as a given collection from which similar documents were extracted for each query document in . For each , documents in were ranked in descending order of their similarities to using different contending similarity measures. The top documents were presented as the similar documents to .
For performance evaluation, a document was considered to be similar to if they have the same class label. In order to demonstrate the consistency of a measure at different top retrieved results, we evaluated the precision at the top retrieved results ( in terms of percentage) with and used the mean average precision up to . The performance evaluation criterion is: .
We repeated the experiment 10 times using each of the 10 folds as and the remaining 9 folds as . The average
and standard error over the 10 runs were reported. The average
of two measures were considered to be significantly different if their confidence intervals based on two standard errors were not overlapping.
The free parameters and in BM25 were set to 1.2 and 0.95, respectively, as recommended by Paltoglou and Thelwall (2010) and Jones et al. (2000).
All the experimental setups and similarity measures were implemented in Java using the WEKA platform (Hall et al., 2009). All the experiments were conducted in a Linux machine with 2.27 GHz processor and 16 GB memory. We discuss the experimental results with the termfrequencybased and binary BoW vector representations separately in the following two subsections.
5.2 Results in the termfrequencybased BoW vector representation
Here we used two term weighting schemes  tf factor only and tfidf factors, with weighted Jaccard and cosine. The six contending measures are: Sp, BM25, Cos.tfidf (cosine with tfidf), Cos.tf (cosine with tf only), WJac.tfidf (weighted Jaccard with tfidf) and WJac.tf (weighted Jaccard with tf only).
The average and standard error over 10 runs of six contending measures are provided in Table 5 and the summarized results in terms of pairwise winlossdraw counts of contending measures based on the two standard errors significance test over the 10 datasets used in the experiment are provided in Table 6.
BM25  Cos.tfidf  Cos.tf  WJac.tfidf  WJac.tf  Sp  

Fbis  65.120.62  68.420.61  68.280.58  68.480.49  66.750.54  67.770.51 
La1s  74.410.32  75.970.42  73.080.49  79.180.33  77.540.47  79.360.32 
La2s  76.420.49  78.110.42  75.240.44  81.060.42  79.450.37  80.890.40 
New3s  67.010.18  68.310.19  70.190.19  69.360.16  68.450.15  68.980.16 
Ng20  76.470.19  74.810.24  67.800.28  73.670.23  64.280.24  72.300.20 
Ohscal  59.720.22  53.590.21  61.060.26  59.680.21  60.810.20  60.140.19 
R52  85.500.20  80.800.27  86.570.15  84.550.21  84.720.19  84.390.22 
R8  91.050.14  86.140.22  92.930.19  91.030.18  91.940.21  91.400.17 
Wap  19.670.42  65.330.34  61.970.41  70.540.46  65.100.48  70.920.50 
Webkb  70.280.23  68.550.24  73.040.27  73.900.31  75.250.25  74.910.33 
Sp  WJac.tf  WJac.tfidf  Cos.idf  Cos.tfidf  

BM25  820  820  622  703  550 
Cos.tfidf  811  622  811  541  
Cos.tf  541  451  541  
WJac.tfidf  325  361  
WJac.tf  523 
Table 5 shows that Sp and Cos.tf produced the best or competitive to the best result in five datasets each; followed by WJac.tfidf in four; whereas Cos.tfidf, BM25 and WJac.tf were best or competitive to the best measure in only one dataset each.
The first column in Table 6 shows that Sp had more wins than losses over all contending measures. It had one more wins than losses against the closest contenders Cos.tf and WJac.tfidf.
Of the two cosine measures, Cos.tf had more wins than losses to Cos.tfidf. This shows that the idf term weighting can be counterproductive with cosine in interdocument similarity measurements. It is mainly due to the cosine normalization which penalizes more to documents having rare terms (with high idf weights) which are not in . In comparison to BM25, Cos.tf produced better results with seven wins and no loss; and Cos.tfidf was competitive with five wins versus five losses.
It is interesting to note that, in the Wap dataset, BM25 produced significantly worse result than other contenders. It is due to the idf factor used in BM25. If a term occurs in more than half of the documents in (i.e., ), is negative and has negative contribution in the similarity of two documents. When was replaced by the traditional in the formulation of BM25 (Eqn 2), it produced = 67.04% which was still worse than those of Sp and WJac.tfidf.
In weighted Jaccard similarity, WJac.tfidf produced better retrieval results than WJac.tf. It is interesting to note that WJac.tfidf produced better retrieval results than Cos.tfidf, Cos.tf and BM25. This could be mainly due to the vector length normalization used in BM25 and cosine that penalizes more to documents having higher frequencies of terms which are not in .
It is interesting to note that Sp and WJac.tfidf produced more consistent results than the other contending measures. They did not produce the worst result in any dataset whereas WJac.tf produced the worst result in one dataset (NG20) followed by Cos.tf in two datasets (La1s and La2s); BM25 in three datasets (Fbis, New3s and Wap); and Cos.tfidf in four datasets (Ohscal, R8, R52 and Webkb).
In terms of runtime, all measures had runtime in the same order of magnitude. For example, in the NG20 dataset, the average total runtime of one run (including preprocessing) using Sp took 15935 seconds; whereas BM25, Cos.tfidf and WJac.tfidf took 27432, 16089 and 14875 seconds, respectively.
5.3 Results in the binary BoW vector representation
Here, six contending measures are: Sp, BM25, Cos.idf (cosine with idf), Cos (cosine without idf), WJac.idf (weighted Jaccard with idf) and WJac (weighted Jaccard without idf). Note that WJac which is not using any term weighting is equivalent to the traditional Jaccard similarity defined in Eqn 4.
The average and standard error over 10 runs of the six contending measures are provided in Table 7; and the summarized results in terms of pairwise winlossdraw counts of contending measures based on the two standard errors significance test over the 10 datasets used in the experiment are provided in Table 8.
BM25  Cos.idf  Cos  WJac.idf  WJac  Sp  

Fbis  67.900.50  66.460.50  63.240.56  67.170.46  64.580.52  66.940.47 
La1s  74.780.25  76.780.34  75.960.38  78.540.34  77.550.39  79.040.30 
La2s  76.710.48  78.480.42  77.550.38  80.020.39  79.120.35  80.540.40 
New3s  69.610.20  66.730.16  64.880.15  67.760.15  65.660.16  68.130.16 
Ng20  74.370.16  73.800.17  64.120.20  72.260.19  63.070.21  72.610.20 
Ohscal  58.950.19  55.060.17  58.560.18  58.660.21  58.450.17  59.230.19 
R52  83.870.24  79.010.28  84.190.20  83.230.22  83.360.21  83.800.22 
R8  90.540.16  86.030.19  91.600.17  90.240.17  91.100.20  90.920.18 
Wap  16.470.34  66.970.47  59.160.44  70.180.54  65.090.48  70.020.53 
Webkb  73.290.39  70.860.23  75.610.27  74.190.37  75.590.29  74.970.35 
Sp  WJac  WJac.idf  Cos  Cos.idf  

BM25  523  550  432  442  370 
Cos.idf  811  541  811  460  
Cos  721  532  631  
WJac.idf  505  262  
WJac  802 
Table 7 shows that Sp produced the best or competitive to the best result in six datasets; followed by BM25 in five; WJac.idf in four; Cos in two; and WJac in one dataset only. Cos.idf did not produce competitive result to the best performing measure in any dataset.
In terms of pairwise winlossdraw counts as shown in the first column in Table 8, Sp had many more wins than losses against all other contending measures.
It is interesting to note that BM25, Cos.idf and Cos using the binary BoW representation produced better retrieval results than their respective counterparts using the termfrequencybased BoW representation in some datasets. For example: (i) BM25 in Fbis, New3s and Webkb; (ii) Cos.idf in La1s, Ohscal, Wap and Webkb; and (iii) Cos in La1s, La2s and Webkb. In contrast, WJac.idf, WJac and Sp using binary BoW vectors did not produce better retrieval results than their respective counterparts using termfrequencybased BoW vectors.
Like in the termfrequencybased BoW representation, all measures had runtimes in the same order of magnitude.
6 Discussion
Even though some studies have used different variants of tf and idf term weighting factors with the most widely used cosine similarity, the tf and idf factors discussed in Section 2.1 have been shown to be the most consistent in the IR task (Singhal, 1997).
For the tf factor, instead of using the logarithmic scaling of , some researchers have used other scaling approaches such as augmented (Salton and Buckley, 1988) and Okapi (Robertson et al., 1994). Similarly, for the idf factor, instead of using , some researchers have used the probabilistic idf factor based on (Robertson et al., 1994; Singhal, 1997). Note that BM25 (Eqn 2) uses tf factor similar to Okapi and idf factor similar to the probabilistic idf factor (Roberston and Zaragoza, 2009).
In the supervised text mining task of document classification, different approaches utilizing class information are proposed to estimate the collectionbased term weighting factors (Wang and Zhang, 2013; Debole and Sebastiani, 2003; Lan et al., 2009). Inverse category frequency (icf) (Wang and Zhang, 2013) has been shown to produce better classification result than the traditional idf factor with the cosine similarity measure. It considers the distribution of a term among classes rather than among documents in the given collection. The intuition behind icf is that the fewer classes a term occurs in, the more discriminating power the term contributes to classification (Wang and Zhang, 2013). If and are the total number of classes and the number of classes in which occurs at least once in at least one document, then the icf factor is estimated as: .
We have evaluated the performance of Sp in the kNN document classification task with existing measures using the supervised term weighting scheme of icf
(Wang and Zhang, 2013). Sp produced either better or competitive classification results with existing measures using supervised or unsupervised term weighting in the 5NN classification task. The classification results are provided in the Appendix.Even though the weighted Jaccard similarity has been used in other application domains (Chierichetti et al., 2010), it is not widely used in the literature to measure similarities of BoW documents. Our experimental results in Section 5 show that the weighted Jaccard similarity with tfidf term weighting scheme can be an effective alternative of cosine and BM25 in interdocument similarity measurements.
Sp has superior performance over all contenders in the binary BoW vector representation. It can be very useful in application domains such as legal and medical where the exact term frequency information may not available due to privacy issue because it is possible to infer information in a document from its term frequencies (Zhu et al., 2008).
7 Concluding remarks
For the purpose of interdocument similarity measurements task, we identify the limitations of the underlying assumptions of the most widely used tfidf term weighting scheme employed in existing measures such as cosine and BM25; and provide an alternative which is more intuitive in this task.
Based on the new assumption, we introduce a new simple but effective interdocument similarity measure called Sp.
Our empirical evaluation in the querybyexample task shows that:

Sp produces better or at least competitive results to the existing similarity measures with the stateoftheart term weighting schemes in the termfrequencybased BoW representations. Sp produces more consistent results than the existing measures across different datasets.

Sp produces better results than the existing similarity measures with or without idf term weighting in the the case of binary BoW representation.
When cosine and BM25 are employed, our result shows that it is important to use an appropriate BoW vector representation (binary or termfrequencybased) and also an appropriate term weighting scheme. Using inappropriate representation and term weighting scheme can result in poor performance.
In contrast, using Sp, users do not have to worry about applying any additional term weighting to measure the similarity of two documents and still get better or competitive results in comparison to the best results obtained by cosine or BM25.
Acknowledgement
The preliminary version of this paper was published in the Proceedings of the 11th Asia Information Retrieval Societies Conference 2015 (Aryal et al., 2015).
References
 Aryal et al. [2014] S. Aryal, K. M. Ting, G. Haffari, and T. Washio. Mpdissimilarity: A data dependent dissimilarity measure. In Proceedings of the IEEE International conference on data mining (ICDM), pages 707–712, 2014.
 Aryal et al. [2015] S. Aryal, K. M. Ting, G. Haffari, and T. Washio. Beyond tfidf and cosine distance in documents dissimilarity measure. In Proceedings of the 11th Asia Information Retrieval Societies Conference, pages 400–406, 2015.
 Aryal et al. [2017] S. Aryal, K. M. Ting, T. Washio, and G. Haffari. Datadependent dissimilarity measure: an effective alternative to geometric distance measures. Knowledge and Information Systems, 53(2):479––506, 2017.
 CardosoCachopo [2007] A. CardosoCachopo. Improving Methods for Singlelabel Text Categorization. PhD thesis, Instituto Superior Tecnico, Technical University of Lisbon, Lisbon, Portugal, 2007.
 Chierichetti et al. [2010] F. Chierichetti, R. Kumar, S. Pandey, and S. Vassilvitskii. Finding the Jaccard Median. In Proceedings of the Twentyfirst Annual ACMSIAM Symposium on Discrete Algorithms, pages 293–311. Society for Industrial and Applied Mathematics, 2010.
 Debole and Sebastiani [2003] F. Debole and F. Sebastiani. Supervised term weighting for automated text categorization. In Proceedings of the 2003 ACM Symposium on Applied Computing, pages 784–788, New York, NY, USA, 2003. ACM.
 Hall et al. [2009] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The weka data mining software: An update. SIGKDD Exploration Newsletter, 11(1):10–18, Nov. 2009.
 Han and Karypis [2000] E.H. Han and G. Karypis. Centroidbased document classification: Analysis and experimental results. In Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery, pages 424–431, London, UK, 2000. SpringerVerlag.
 Han et al. [2012] X. Han, S. Li, and Z. Shen. A kNN method for large scale hierarchical text classification at LSHTC3. In Proceedings of the Workshop on Large Scale Hierarchical Classification, pages 1–12, 2012.
 Jaccard [1901] P. Jaccard. Étude comparative de la distribution florale dans une portion des Alpes et des Jura. Bulletin de la Société Vaudoise des Sciences Naturelles, 37:547–579, 1901.

Joachims [1997]
T. Joachims.
A probabilistic analysis of the rocchio algorithm with tfidf for text
categorization.
In
Proceedings of the Fourteenth International Conference on Machine Learning
, pages 143–151, San Francisco, CA, USA, 1997. Morgan Kaufmann Publishers Inc.  Jones et al. [2000] K. S. Jones, S. Walker, and S. E. Robertson. A probabilistic model of information retrieval: Development and comparative experiments. Information Processing and Management, 36(6):779–808, 2000.
 Lan et al. [2009] M. Lan, C. L. Tan, J. Su, and Y. Lu. Supervised and traditional term weighting methods for automatic text categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(4):721–735, 2009.
 Manning et al. [2008] C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA, 2008.

Paltoglou and Thelwall [2010]
G. Paltoglou and M. Thelwall.
A study of information retrieval weighting schemes for sentiment analysis.
In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 1386–1395. Association for Computational Linguistics, 2010.  Roberston and Zaragoza [2009] S. Roberston and H. Zaragoza. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3(4):333–389, 2009.
 Robertson et al. [1994] S. E. Robertson, S. Walker, S. Jones, M. HancockBeaulieu, and M. Gatford. Okapi at trec3. In Proceedings of the Third Text Retrieval Conference (TREC 1994), pages 109–126, 1994.
 Salton and Buckley [1988] G. Salton and C. Buckley. Termweighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513–523, 1988.
 Salton and McGill [1986] G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGrawHill, Inc., New York, NY, USA, 1986.
 Singhal et al. [1996] A. Singhal, C. Buckley, and M. Mitra. Pivoted document length normalization. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 21–29, New York, NY, USA, 1996. ACM.
 Singhal [1997] A. K. Singhal. Term Weighting Revisited. PhD thesis, The Faculty of the Graduate School, Cornell University, 1997.
 Wang and Zhang [2013] D. Wang and H. Zhang. Inversecategoryfrequency based supervised term weighting schemes for text categorization. Journal of Information Science and Engineering, 29(2):209–225, 2013.
 Zhu et al. [2008] X. Zhu, A. B. Goldberg, M. Rabbat, and R. Nowak. Learning Bigrams from Unigrams. In Proceedings of ACL08: HLT, pages 656–664. Association for Computational Linguistics, 2008.
 Zobel and Moffat [1998] J. Zobel and A. Moffat. Exploring the similarity space. SIGIR Forum, 32(1):18–34, 1998.
Appendix A: kNN classification results
In order to predict a class label for a test document , its nearest neighbour (or most similar) documents were searched in the given labelled training set of documents using a contending similarity measure and the majority class among the NNs was predicted as the class label for .
All classification experiments were conducted using a 10fold cross validation (10 runs with each one out of the 10 folds as the test set and the remaining 9 folds as the training set). The average classification accuracy and standard error over a 10fold cross validation were reported. All collectionbased term weighting factors (idf and icf) were computed from the training set and used in both the training and test documents. The parameter was set to a commonly used value of 5 (i.e., 5NN classification was used).
We discuss the 5NN classification results with the termfrequencybased and binary BoW vector representations separately in the following two subsections.
Termfrequencybased BoW vector representation
We used term weighting based on tf only, tfidf and tficf with weighted Jaccard and cosine resulting in eight contending measures: Sp, BM25, Cos.tficf, Cos.tfidf, Cos.tf, WJac.tficf, WJac.tfidf and WJac.tf.
The average classification accuracies and standard errors over a 10fold cross validation of the eight contending measures are provided in Table 10 and the summarized results in terms of pairwise winlossdraw counts of contending measures based on the two standard errors significance test in the 10 datasets used in the experiment are provided in Table 11.
BM25.tf  Cos.tf.icf  Cos.tf.idf  Cos.tf  WJac.tf.icf  WJac.tf.idf  WJac.tf  Sp  

Fbis  76.981.04  80.830.84  79.331.05  80.230.78  79.290.95  79.540.91  79.210.84  79.210.80 
La1s  83.680.58  83.430.84  86.700.66  82.300.80  87.300.80  88.890.61  87.050.70  88.480.46 
La2s  86.280.50  84.520.47  87.930.77  84.230.61  88.460.63  90.110.41  88.030.56  89.590.48 
New3s  79.310.30  79.540.34  78.990.34  81.470.29  80.900.30  80.860.33  80.600.36  80.550.36 
Ng20  88.550.15  87.570.22  86.920.22  84.740.34  86.280.27  87.410.25  83.050.28  86.620.17 
Ohscal  72.630.29  72.040.43  66.950.45  74.250.46  74.500.29  72.360.21  74.220.33  73.190.34 
R52  92.300.20  91.200.28  87.720.54  92.180.18  91.690.22  91.170.22  90.630.21  90.940.25 
R8  95.190.21  95.340.17  90.800.25  95.810.23  95.390.20  94.980.23  95.270.31  95.280.27 
Wap  17.760.79  75.900.46  76.920.76  72.440.58  80.700.80  82.310.92  76.220.58  82.500.79 
Webkb  81.160.38  81.860.48  77.920.43  81.580.57  84.140.42  83.330.61  84.400.38  84.330.53 
BM25  Cos.icf  Cos.idf  Cos  WJac.icf  WJac.idf  WJac  Sp  

Fbis  79.290.65  79.620.99  77.750.95  78.200.88  79.821.08  79.860.98  78.280.81  79.050.82 
La1s  84.270.63  87.450.79  87.700.63  85.890.72  88.050.73  88.700.54  87.550.62  88.670.49 
La2s  86.080.58  87.480.49  89.430.42  86.670.50  88.520.68  89.990.43  88.000.60  90.150.48 
New3s  80.720.42  79.100.36  78.310.41  78.190.35  79.790.36  80.030.41  78.740.29  80.150.40 
Ng20  87.590.13  87.190.20  87.250.19  82.800.20  85.640.12  86.610.18  82.160.20  86.840.24 
Ohscal  72.020.32  72.360.31  68.540.32  72.890.24  73.650.28  72.360.32  72.890.29  72.790.33 
R52  91.200.24  89.740.38  86.140.43  90.210.27  90.510.25  89.810.24  89.750.17  90.800.19 
R8  94.800.21  95.050.13  90.980.44  94.990.23  95.100.22  94.540.30  94.860.17  95.050.31 
Wap  15.510.69  76.860.65  78.271.01  69.680.82  80.000.56  81.920.96  76.280.62  81.600.81 
Webkb  83.970.49  84.680.54  81.450.60  84.260.49  84.880.44  84.160.47  84.730.42  84.710.53 
Sp  WJac.tf  WJac.tfidf  WJac.tficf  Cos .tf  Cos.tfidf  Cos.tficf  

BM25  622  721  622  721  523  451  235 
Cos.tficf  613  523  505  612  424  253  
Cos.tfidf  802  514  901  613  541  
Cos.tf  541  433  532  514  
WJac.tficf  226  037  325  
WJac.tfidf  118  253  
WJac.tf  415 
Sp  WJac  WJac.idf  WJac.icf  Cos  Cos.idf  Cos.icf  

BM25  415  433  325  433  334  361  334 
Cos.icf  406  018  325  316  046  154  
Cos.idf  604  433  712  712  442  
Cos  604  325  505  604  
WJac.icf  316  055  334  
WJac.idf  109  046  
WJac  604 
The 5NN classification accuracies in Table 5 show that Sp, WJac.tfidf, WJac.tficf and Cos.tf produced the best or competitive to the best result in five datasets each followed by WJac.tf in four; Cos.tficf and BM25 in two datasets each; and Cos.tfidf in one dataset only.
The pairwise winlossdraw counts of Sp in the first column of Table 11 shows that it had more wins than losses over all contending measures except Wjac.tfidf and Wjac.tficf where it had competitive results with the same number of wins and losses.
It is interesting to note that Sp and all three variants of weighted Jaccard similarity produced better classification results than all three variants of cosine and BM25. Like in the similar document retrieval task discussed in Section 5, BM25 produced the worst classification accuracy in the Wap dataset because of . The classification accuracy was increased to 79.42% when was replaced by the traditional idf .
The supervised term weighting using icf (tficf) did not always produce better classification results than the traditional tfidf based term weighting with both cosine and weighted Jaccard. It had five wins and two losses with cosine whereas it had two wins and three losses with weighted Jaccard.
Binary BoW vector representation
We used weighted Jaccard and cosine similarities with and without idf and icf weighting resulting in eight contending measures: Sp, BM25, Cos.idf, Cos.icf, Cos, WJac.idf, WJac.icf and WJac.
The average classification accuracies and standard errors over a 10fold cross validation of the eight contending measures are provided in Table 10 and the summarized results in terms of pairwise winlossdraw counts of contending measures based on the two standard errors significance test in the 10 datasets used in the experiment are provided in Table 12.
The 5NN classification accuracies in Table 10 show that Sp produced the best or competitive to the best result in eight datasets. The closest contenders BM25 and WJac.idf produced the best or competitive to the best result in six datasets each followed by WJac.icf in five; Cos.icf in four; WJac and Cos in three datasets each; and Cos.idf in two datasets only.
In terms of pairwise win:loss:draw counts as shown in the first column in Table 12, Sp had more wins than losses against all other contending measures. It had one win and no loss against WJac.idf and three wins and one loss against WJac.icf.
Like in the termfrequencybased BoW representation, the supervised weighting scheme based on icf did not always produce better classification results than the traditional idf based term weighting scheme with both cosine and weighted Jaccard in the binary BoW vector presentation as well. It had five wins and one loss with cosine whereas it had three wins and three losses with the weighted Jaccard.
It is interesting to note that BM25, Cos.icf, Cos.idf and Cos which are using the binary BoW vector representation produced better classification accuracies than their respective counterparts using the termfrequencybased BoW representation in some datasets; e.g., BM25 was better in three datasets (Fbis, New3s, WebKb); Cos.icf and Cos in three datasets (La1s, La2s, Webkb); and Cos.idf in two datasets (La2s, Webkb). However, all three variants of weighted Jaccard and Sp with the termfrequencybased BoW representation produced either better or competitive results with the binary BoW representation.
Comments
There are no comments yet.