Using Background Knowledge to Rank Itemsets

02/08/2019
by   Nikolaj Tatti, et al.
0

Assessing the quality of discovered results is an important open problem in data mining. Such assessment is particularly vital when mining itemsets, since commonly many of the discovered patterns can be easily explained by background knowledge. The simplest approach to screen uninteresting patterns is to compare the observed frequency against the independence model. Since the parameters for the independence model are the column margins, we can view such screening as a way of using the column margins as background knowledge. In this paper we study techniques for more flexible approaches for infusing background knowledge. Namely, we show that we can efficiently use additional knowledge such as row margins, lazarus counts, and bounds of ones. We demonstrate that these statistics describe forms of data that occur in practice and have been studied in data mining. To infuse the information efficiently we use a maximum entropy approach. In its general setting, solving a maximum entropy model is infeasible, but we demonstrate that for our setting it can be solved in polynomial time. Experiments show that more sophisticated models fit the data better and that using more information improves the frequency prediction of itemsets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/18/2019

Comparing Apples and Oranges: Measuring Differences between Data Mining Results

Deciding whether the results of two different mining algorithms provide ...
research
02/05/2019

Discovering bursts revisited: guaranteed optimization of the model parameters

One of the classic data mining tasks is to discover bursts, time interva...
research
02/18/2019

Comparing Apples and Oranges: Measuring Differences between Exploratory Data Mining Results

Deciding whether the results of two different mining algorithms provide ...
research
06/16/2020

Tell Me Something I Don't Know: Randomization Strategies for Iterative Data Mining

There is a wide variety of data mining methods available, and it is gene...
research
02/04/2019

Ranking Episodes using a Partition Model

One of the biggest setbacks in traditional frequent pattern mining is th...
research
11/09/2020

Binary Matrix Factorisation via Column Generation

Identifying discrete patterns in binary data is an important dimensional...
research
04/24/2019

Maximum Entropy Based Significance of Itemsets

We consider the problem of defining the significance of an itemset. We s...

Please sign up or login with your details

Forgot password? Click here to reset