Scalable Methods for Calculating Term Co-Occurrence Frequencies

by   Bodo Billerbeck, et al.

Search techniques make use of elementary information such as term frequencies and document lengths in computation of similarity weighting. They can also exploit richer statistics, in particular the number of documents in which any two terms co-occur. In this paper we propose alternative methods for computing this statistic, a challenging task because the number of distinct pairs of terms is vast – around 100,000 in a typical 1000-word news article, for example. In contrast, we do not employ approximation algorithms, as we want to be able to find exact counts. We explore their efficiency, finding that a naïve approach based on a dictionary is indeed very slow, while methods based on a combination of inverted indexes and linear scanning provide both massive speed-ups and better observed asymptotic behaviour. Our careful implementation shows that, with our novel list-pairs approach it is possible to process over several hundred thousand documents per hour.



There are no comments yet.


page 1

page 2

page 3

page 4


A new simple and effective measure for bag-of-word inter-document similarity measurement

To measure the similarity of two documents in the bag-of-words (BoW) vec...

Estimation of the covariance structure from SNP allele frequencies

We propose two new statistics, V and S, to disentangle the population hi...

UMDuluth-CS8761 at SemEval-2018 Task 9: Hypernym Discovery using Hearst Patterns, Co-occurrence frequencies and Word Embeddings

Hypernym Discovery is the task of identifying potential hypernyms for a ...

Two Steps Feature Selection and Neural Network Classification for the TREC-8 Routing

For the TREC-8 routing, one specific filter is built for each topic. Eac...

A Method for Finding Similar Documents Relying on Adding Repetition of Symbols in Length Based Filtering

A basic topic in mining of massive dataset is finding similar items. As ...

A Novel Term_Class Relevance Measure for Text Categorization

In this paper, we introduce a new measure called Term_Class relevance to...

Efficient Calculation of Bigram Frequencies in a Corpus of Short Texts

We show that an efficient and popular method for calculating bigram freq...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.