Scalable Methods for Calculating Term Co-Occurrence Frequencies

07/17/2020
by   Bodo Billerbeck, et al.
0

Search techniques make use of elementary information such as term frequencies and document lengths in computation of similarity weighting. They can also exploit richer statistics, in particular the number of documents in which any two terms co-occur. In this paper we propose alternative methods for computing this statistic, a challenging task because the number of distinct pairs of terms is vast – around 100,000 in a typical 1000-word news article, for example. In contrast, we do not employ approximation algorithms, as we want to be able to find exact counts. We explore their efficiency, finding that a naïve approach based on a dictionary is indeed very slow, while methods based on a combination of inverted indexes and linear scanning provide both massive speed-ups and better observed asymptotic behaviour. Our careful implementation shows that, with our novel list-pairs approach it is possible to process over several hundred thousand documents per hour.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/09/2019

A new simple and effective measure for bag-of-word inter-document similarity measurement

To measure the similarity of two documents in the bag-of-words (BoW) vec...
research
01/22/2022

Estimation of the covariance structure from SNP allele frequencies

We propose two new statistics, V and S, to disentangle the population hi...
research
11/22/2022

Method for Determining the Similarity of Text Documents for the Kazakh language, Taking Into Account Synonyms: Extension to TF-IDF

The task of determining the similarity of text documents has received co...
research
05/25/2018

UMDuluth-CS8761 at SemEval-2018 Task 9: Hypernym Discovery using Hearst Patterns, Co-occurrence frequencies and Word Embeddings

Hypernym Discovery is the task of identifying potential hypernyms for a ...
research
07/11/2000

Two Steps Feature Selection and Neural Network Classification for the TREC-8 Routing

For the TREC-8 routing, one specific filter is built for each topic. Eac...
research
08/25/2016

A Novel Term_Class Relevance Measure for Text Categorization

In this paper, we introduce a new measure called Term_Class relevance to...
research
07/13/2020

A supervised term-weighting technique for topic-based retrieval

This article presents a technique for term weighting that relies on a co...

Please sign up or login with your details

Forgot password? Click here to reset