Finding the True Frequent Itemsets

01/07/2013
by   Matteo Riondato, et al.
0

Frequent Itemsets (FIs) mining is a fundamental primitive in data mining. It requires to identify all itemsets appearing in at least a fraction θ of a transactional dataset D. Often though, the ultimate goal of mining D is not an analysis of the dataset per se, but the understanding of the underlying process that generated it. Specifically, in many applications D is a collection of samples obtained from an unknown probability distribution π on transactions, and by extracting the FIs in D one attempts to infer itemsets that are frequently (i.e., with probability at least θ) generated by π, which we call the True Frequent Itemsets (TFIs). Due to the inherently stochastic nature of the generative process, the set of FIs is only a rough approximation of the set of TFIs, as it often contains a huge number of false positives, i.e., spurious itemsets that are not among the TFIs. In this work we design and analyze an algorithm to identify a threshold θ̂ such that the collection of itemsets with frequency at least θ̂ in D contains only TFIs with probability at least 1-δ, for some user-specified δ. Our method uses results from statistical learning theory involving the (empirical) VC-dimension of the problem at hand. This allows us to identify almost all the TFIs without including any false positive. We also experimentally compare our method with the direct mining of D at frequency θ and with techniques based on widely-used standard bounds (i.e., the Chernoff bounds) of the binomial distribution, and show that our algorithm outperforms these methods and achieves even better results than what is guaranteed by the theoretical analysis.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/03/2002

Mining All Non-Derivable Frequent Itemsets

Recent studies on frequent itemset mining algorithms resulted in signifi...
research
06/16/2020

MCRapper: Monte-Carlo Rademacher Averages for Poset Families and Approximate Pattern Mining

We present MCRapper, an algorithm for efficient computation of Monte-Car...
research
08/11/2021

Parallel algorithms for mining of frequent itemsets

In the recent decade companies started collecting of large amount of dat...
research
04/21/2009

Fast Algorithms for Mining Interesting Frequent Itemsets without Minimum Support

Real world datasets are sparse, dirty and contain hundreds of items. In ...
research
01/30/2017

Comparing Dataset Characteristics that Favor the Apriori, Eclat or FP-Growth Frequent Itemset Mining Algorithms

Frequent itemset mining is a popular data mining technique. Apriori, Ecl...
research
01/20/2022

FreSCo: Mining Frequent Patterns in Simplicial Complexes

Simplicial complexes are a generalization of graphs that model higher-or...
research
06/18/2018

Mining frequent items in unstructured P2P networks

Large scale decentralized systems, such as P2P, sensor or IoT device net...

Please sign up or login with your details

Forgot password? Click here to reset