Probabilistic Models for Query Approximation with Large Sparse Binary Datasets

01/16/2013
by   Dmitry Y. Pavlov, et al.
0

Large sparse sets of binary transaction data with millions of records and thousands of attributes occur in various domains: customers purchasing products, users visiting web pages, and documents containing words are just three typical examples. Real-time query selectivity estimation (the problem of estimating the number of rows in the data satisfying a given predicate) is an important practical problem for such databases. We investigate the application of probabilistic models to this problem. In particular, we study a Markov random field (MRF) approach based on frequent sets and maximum entropy, and compare it to the independence model and the Chow-Liu tree model. We find that the MRF model provides substantially more accurate probability estimates than the other methods but is more expensive from a computational and memory viewpoint. To alleviate the computational requirements we show how one can apply bucket elimination and clique tree approaches to take advantage of structure in the models and in the queries. We provide experimental results on two large real-world transaction datasets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/25/2022

Tree decompositions with bounded independence number: beyond independent sets

We continue the study of graph classes in which the treewidth can only b...
research
02/26/2020

Revisiting compact RDF stores based on k2-trees

We present a new compact representation to efficiently store and query l...
research
03/20/2013

Symbolic Probabilistic Inference with Evidence Potential

Recent research on the Symbolic Probabilistic Inference (SPI) algorithm[...
research
03/01/1998

Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets

This paper introduces new algorithms and data structures for quick count...
research
01/20/2022

JEDI: These aren't the JSON documents you're looking for... (Extended Version*)

The JavaScript Object Notation (JSON) is a popular data format used in d...
research
01/11/2022

ATRAPOS: Evaluating Metapath Query Workloads in Real Time

Heterogeneous information networks (HINs) represent different types of e...
research
05/06/2019

Learning Clique Forests

We propose a topological learning algorithm for the estimation of the co...

Please sign up or login with your details

Forgot password? Click here to reset