Probably the Best Itemsets

02/07/2019
by   Nikolaj Tatti, et al.
UAntwerpen
0

One of the main current challenges in itemset mining is to discover a small set of high-quality itemsets. In this paper we propose a new and general approach for measuring the quality of itemsets. The method is solidly founded in Bayesian statistics and decreases monotonically, allowing for efficient discovery of all interesting itemsets. The measure is defined by connecting statistical models and collections of itemsets. This allows us to score individual itemsets with the probability of them occuring in random models built on the data. As a concrete example of this framework we use exponential models. This class of models possesses many desirable properties. Most importantly, Occam's razor in Bayesian model selection provides a defence for the pattern explosion. As general exponential models are infeasible in practice, we use decomposable models; a large sub-class for which the measure is solvable. For the actual computation of the score we sample models from the posterior distribution using an MCMC approach. Experimentation on our method demonstrates the measure works in practice and results in interpretable and insightful itemsets for both synthetic and real-world data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

07/24/2020

Robust and Reproducible Model Selection Using Bagged Posteriors

Bayesian model selection is premised on the assumption that the data are...
01/16/2013

Being Bayesian about Network Structure

In many domains, we are interested in analyzing the structure of the und...
12/12/2012

Asymptotic Model Selection for Naive Bayesian Networks

We develop a closed form asymptotic formula to compute the marginal like...
10/28/2015

Flexibly Mining Better Subgroups

In subgroup discovery, also known as supervised pattern mining, discover...
05/20/2016

Coresets for Scalable Bayesian Logistic Regression

The use of Bayesian methods in large-scale data settings is attractive b...
04/15/2019

Discovering Episodes with Compact Minimal Windows

Discovering the most interesting patterns is the key problem in the fiel...
10/14/2015

A Bayesian Network Model for Interesting Itemsets

Mining itemsets that are the most interesting under a statistical model ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Discovering frequent itemsets is one of the most active fields in data mining. As a measure of quality, frequency possesses a lot of positive properties: it is easy to interpret and as it decreases monotonically there exist efficient algorithms for discovering large collections of frequent itemsets [2]. However, frequency also has serious drawbacks. A frequent itemset may be uninteresting if its elevated frequency is caused by frequent singletons. On the other hand, some non-frequent itemsets could be interesting. Another drawback is the problem of pattern explosion when mining with a low threshold.

Many different quality measures have been suggested to overcome the mentioned problems (see Section 5 for a more detailed discussion). Usually these measures compare the observed frequency to some expected value derived, for example, from the independence model. Using such measures we may obtain better results. However, these approaches still suffer from pattern explosion. To point out the problem, assume that two items, say and are correlated, and hence are considered significant. Then any itemset containing and will be also considered significant.

Example 1

Assume a dataset with items such that and always yield identical value and the rest of the items are independently distributed. Assume that we apply some statistical method to evaluate the significance of itemsets by using the independence model as the ground truth. If an itemset contains

, its frequency will be higher than then the estimate of the independence model. Hence, given enough data, the P-value of the statistical test will go to

, and we will conclude that the itemset is interesting. Consequently we will find interesting itemsets.

In this work we approach the problem of defining quality measure from a novel point of view. We construct a connection between itemsets and statistical models and use this connection to define a new quality measure for itemsets. To motivate this approach further, let us consider the following example.

Example 2

Consider a binary dataset with items, say , generated from the independence model. We argue that if we know that the data comes from the independence model, then the only interesting itemsets are the singletons. The reasoning behind this claim is that the frequencies of singletons correspond exactly to the column margins, the parameters of the independence model. Once we know the singleton frequencies, there is nothing left in the data that would be statistically interesting.

Let us consider a more complicated example. Assume that data is generated from a Chow-Liu tree model [4], say

Again, if we know that data is generated from this model, then we argue that the interesting itemsets are , , , , , , , , and . The reasoning here is the same as with the independence model. If we know the frequencies of these itemsets we can derive the parameters of the distribution. For example, .

Let us now demonstrate that this approach will produce much smaller and more meaningful output than the method given in Example 1.

Example 3

Consider the data given in Example 1. To fully describe the data we only need to know the frequencies of the singletons and the fact that and are identical. This information can be expressed by outputting the frequencies of singleton itemsets and the frequency of itemset . This will give us interesting patterns in total.

Our approach is to extend the idea pitched in the preceding example to a general itemset mining framework. In the example we knew which model generated the data, in practice, we typically do not. To solve this we will use the Bayesian approach, and instead of considering just one specific model, we will consider a large collection of models, namely exponential models. A virtue of these models is that we can naturally connect each model to certain itemsets. A model

has a posterior probability

, that is, how probable is the model given the data. The score of a single itemset then is just the probability of it being a parameter of a random model given the data. This setup fits perfectly the given example. If we have strong evidence that data is coming from the independence model, say , then the posterior probability will be close to , and the posterior probability of any other model will be close to . Since the independence model is connected to the singletons, the score for singletons will be and the score for any other itemset will be close to .

Interestingly, using statistical models for defining significant itemsets provides an approach to the problem of the pattern set explosion (see Section 3.2 for more technical details). Bayesian model selection has an in-built Occam’s razor, favoring simple models over complex ones. Our connection between models and itemsets is such that simple models will correspond to small collections of itemsets. In result, only a small collection of itemsets will be considered interesting, unless the data provides sufficient amount of evidence.

Our contribution in the paper is two-fold. First, we introduce a general framework of using statistical models for scoring itemsets in Section 2. Secondly, we provide an example of this framework in Section 3 by using exponential models and provide solid theoretical evidence that our choices are well-founded. We provide the sampling algorithm in Section 4. We discuss related work in Section 5 and present our experiments in Section 6. Finally, we conclude our work with Section 7. The proofs are given in Appendix. The implementation is provided for research purposes111http://adrem.ua.ac.be/implementations.

2 Significance of Itemsets by Statistical Models

As we discussed in the introduction, our goal is to define a quality measure for itemsets using statistical models. In this section we provide a general framework for such a score. We will define the actual models in the next section.

We begin with some preliminary definitions and notations. In our setup a binary dataset is a collection of

transactions, binary vectors of length

. We assume that these vectors are independently generated from some unknown distribution. Such a dataset can be easily represented by a binary matrix of size . By an attribute

we mean a Bernoulli random variable corresponding to the

th column of the data. We denote the set of attributes by .

An itemset is simply a subset of . Given an itemset and a transaction we denote by the projection of into . We say that covers if all elements in are equal to 1.

We say that a collection of itemsets is downward closed if for each member any sub-itemset is also included. This property plays a crucial point in mining frequent patterns since it allows effective candidate pruning in level-wise approach and branch pruning in a DFS approach.

Our next step is to define the correspondence between statistical models and families of itemsets. Assume that we have a set

of statistical models for the data. We will discuss in later sections what specific models we are interested in, but for the moment, we will keep our discussion on a high level. Each model

has a posterior probability , that is, how probable the model is given data . To link the models to families of itemsets we assume that we have a function that identifies a model with a downward closed family of itemsets. As we will see later on, there is one particular natural choice for such a function.

Now that we have our models that are connected to certain families of itemsets, we are ready to define a score for individual itemsets. The score for an itemset is the posterior probability of being a member in a family or itemsets,

(1)

The motivation for such score is as follows. If we are sure that some particular model is the correct model for , then the posterior probability for that model will be close to and the posterior probabilities for other models will be close to . Consequently, the score for an itemset will be close to if , and otherwise.

Naturally, the pivotal choice of this score lies in the mapping . Such a mapping needs to be statistically well-founded, and especially the size of an itemset family should be reflected in the complexity of the corresponding model. We will see in the following section that a particular choice for the model family and mapping has these properties, and leads to certain important qualities.

Proposition 4

The score decreases monotonically, that is, implies .

We are allowing only to map on downward closed families. Hence the inequality

holds, and consequently we have . This completes the proof.

3 Exponential Models

In this section we will make our framework more concrete by providing a specific set of statistical models and the function identifying the models with families of itemsets. We will first give the definition of the models and the mapping. After this, we point out the main properties of our model and justify our choices. However, as, it turns out that computing the score for these models is infeasible, so instead, we solve this problem by considering decomposable models.

3.1 Definition of the models

Models of exponential form have been studied exhaustively in statistics, and have been shown to have good theoretical and practical properties. In our case, using exponential models provide a natural way of describing the dependencies between the variables. In fact, the exponential model class contains many natural models such as, the independence model, the Chow-Liu tree model, and the discrete Gaussian model. Finally, such models have been used successfully for predicting itemset frequencies [14] and ranking itemsets [18].

In order to define our model, let be a downward closed family of itemsets containing all singletons. For an itemset we define an indicator function mapping a transaction into binary value. If the transaction covers , then , and otherwise. We define an exponential model associated with to be the collection of distributions having the exponential form

where is a parameter, a real value, for an itemset . Model also contains all the distributions that can be obtained as a limit of the distribution having the exponential form. This technicality is needed to handle distributions with zero probabilities. Since the indicator function is equal to for any , the parameter acts like a normalization constant. The rest of the parameters form a parameter vector of length . Naturally, we set .

Example 5

Assume that consists only of singleton itemsets. Then the corresponding model has the form

(2)

Since depends only on , the model is actually the independence model. The other extreme is when consists of all itemsets. Then we can show that the corresponding model contains all possible distributions, that is, the model is in fact the parameter-free model.

As an intermediate example, the tree model in Example 2 is also an exponential model with a corresponding family .

The intuition behind the model is that when an itemset is an element of , then the dependencies between the items in are considered important in the corresponding model. For example, if consists only of singletons, then there are no important correlations, hence the corresponding model should be the independence model. On the other hand, in the tree model given in Example 2 the important correlations are the parent-child item pairs, namely, . These are exactly, the itemsets (along with the singletons) that correspond to the model.

Our choice for the models is particularly good since the complexity of models reflects the size of itemset family. Since Bayesian approach has an in-built tendency to punish complex families (we will see this in Section 3.2), we are punishing large families of itemsets. If the data states that simple models are sufficient, then the probability of complex models will be low, and consequently the score for large itemsets will also be low. In other words, we casted the problem of pattern set explosion into a model overfitting problem and used Occam’s razor to punish the complex models!

3.2 Computing the Model

Now that we have defined our model , our next step is to compute the posterior probability . That is, the probability of given the data set . We select the model prior to be uniform. Recall that in Eq. 2 a model has a set of parameters, that is, to pinpoint a single distribution in we need a set of parameters . Following the Bayesian approach to compute we need to marginalize out the nuisance parameters ,

In the general case, this integral is too complex to solve analytically so we employ the popular BIC estimate [15],

(3)

where is a constant and is the maximum likelihood estimate of the model parameters. This estimate is correct when approaches infinity [15]. So instead of computing a complex integral our challenge is to discover the maximum likelihood estimate and compute the likelihood of the data. Unfortunately, using such model is an NP-hard problem (see, for example, [17]). We will remedy this problem in Section 3.4 by considering a large subclass of exponential models for which the maximum likelihood can be easily computed.

3.3 Justifications for Exponential Model

In this section we will provide strong theoretical justification for our choices and show that our score fulfills the goals we set in the introduction.

We saw in Example 2 that if the data comes from the independence model, then we only need the frequencies of the singleton itemsets to completely explain the underlying model. The next theorem shows that this holds in general case.

Theorem 6

Assume that data is generated from a distribution that comes from an exponential model . Let be the family of itemsets. We can derive the maximum likelihood estimate from the frequencies of . Moreover, as the number of transactions goes to infinity, we can derive the true distribution from the frequencies of .

The preceding theorem showed that is sufficient family of itemsets in order to derive the correct true distribution. The next theorem shows that we favor small families: if the data can be explained with a simpler model, that is, using less itemsets, then the simpler model will be chosen and, consequently, redundant itemsets will have a low score.

Theorem 7

Assume that data is generated from a distribution that comes from a model . Assume also that if any other model, say , contains this distribution, then . Then the following holds: as the number of data points in goes into infinity, if , otherwise .

3.4 Decomposable Models

We saw in Section 3.2

that in practice we cannot compute the score for general exponential models. In this section we study a subclass of exponential models, for which we can easily compute the needed score. Roughly speaking, a decomposable model is an exponential model where the corresponding maximal itemsets can be arranged to a specific tree, called junction tree. By considering only decomposable models we obviously will lose some models, for example, the discrete Gaussian model, that is, a model corresponding to all itemsets of size 1 and 2 is not decomposable. On the other hand, many interesting and practically relevant models are decomposable, for example Chow-Liu trees. Finally, these models are closely related to Bayesian networks and Markov Random Fields (see 

[5] for more details).

To define a decomposable model, let be a downward closed family of itemsets. We write to be the set of maximal itemsets from . Assume that we can build a tree using itemsets from as nodes with the following property: If have a common item, say , then and are connected in (by a unique path) and every itemset along that path contains . If this property holds for , then is called junction tree and is decomposable. We will use to denote the edges of the tree.

Not all families have junction trees and some families may have multiple junction trees.

Example 8

(a) Decomposable family of itemsets

(b) Decomposable family after merge

(c) Decomposable family after the second merge
Figure 1: Figure 1(a) shows that the itemset family given in Example 2 is decomposable. Figure 1(b) shows the junction tree for the family after and Figure 1(c) shows the junction tree after .

Let be the family of itemsets connected to the Chow-Liu model given in Example 2. The maximal itemsets are . Figure 1(a) shows a junction tree for this family, making the family decomposable. On the other hand, family is not decomposable since there is no junction tree for this family.

The most important property of decomposable families is that we can compute the maximum likelihood efficiently. We first define the entropy of an itemset , denoted by , as

where is the empirical distribution of the data.

Theorem 9

Let be a decomposable family and let be its junction tree. The maximum log-likelihood is equal to

Example 10

Assume that our model space consists only of two models, namely the tree model given in Example 2 and the independence model, which we denote . Assume also that we have a dataset with transactions,

To compute the probabilities and , we need to know the entropies of certain itemsets

The log-likelihood of the independence model is equal to

We use the junction tree given in Figure 1(a) and Theorem 9 to compute the log-likelihood of ,

Note that and . Thus Eq. 3 implies that

We get the final probabilities by noticing that so that we have and . Consequently, the scores for itemsets are equal to , , for , and otherwise.

4 Sampling Models

Now that we have means for computing the posterior probability of a single decomposable model, our next step is to compute the score of an itemset namely, the sum in Eq. 1. The problem is that this sum has an exponential number of terms, and hence we cannot solve by enumerating all possible families. We approach this problem from a different point of view. Instead of computing the score for each itemset individually, we will divide our mining method into two steps:

  1. Sample random decomposable models from the posterior distribution .

  2. Estimate the true score of an itemset by computing the number of sampled families of itemsets in which the itemset occurs.

4.1 Moving from One Model to Another

In order to sample we will use a MCMC approach by modifying the current decomposable family by two possible operations, namely

  • Merge: Select two maximal itemsets, say and . Let . Since and are maximal, and . Select and . Add a new itemset into the family along with all possible sub-itemsets. We will use notation to denote this operation.

  • Split: Select an itemset . Select two items . Delete and all sub-itemsets containing and simultaneously. We will denote this operation by .

Naturally, not all splits and merges are legal, since some operations may result in a family that is not decomposable, or even downward closed.

Example 11

The family given in Figure 1(b) is obtained from the family given in Figure 1(a) by performing . Moreover, (Figure 1(c)) is obtained from by performing . Conversely, we can go back by performing first and second.

The next theorem tells us which splits are legal.

Theorem 12

Let be decomposable family and let and let . Then the resulting family after a split operation is decomposable if and only, there are no other maximal itemsets in containing and simultaneously.

Example 13

All possible split combinations are legal in families and given in Figure 1(a) and Figure 1(b). However, for given in Figure 1(c) is illegal since contains and . Similarly, the operation is illegal.

In order to identify legal merges, we will need some additional structures. Let be a downward closed family and let be its maximal itemsets. Let be an itemset. We construct a reduced family, denoted by with the following procedure. Let us first define

To obtain the reduced family from , assume there are two itemsets such that . We remove these two sets from and replace them with . This is continued until no such replacements are possible. We ignore any reduced family that contains or itemsets. The reason for this will be seen in Theorem 15, which implies that such families will not induce any legal merges.

Example 14

The non-trivial reduced families of the family given in Figure 1(a) are and . Similarly, the reduced families for the family given in Figure 1(b) are , and . Finally, the reduced families for the family given in Figure 1(c) are and .

The next theorem tells us when is legal.

Theorem 15

Let be decomposable family. A merge operation is legal, that is, is still decomposable after adding if and only if there are sets , , such that and .

Example 16

Family in Figure 1(b) is obtained from the family in Figure 1(a) by . This is legal operation since . Similarly, merge transforming to is legal since . However, this merge would not be legal in since we do not have in .

4.2 MCMC Sampling Algorithm

Sampling requires a proposal distribution . Let be a current model. We denote the number of legal operations, either a split or a merge, by . Let be a model obtained by sampling uniformly one of the legal operations and applying it to . The probability of reaching from with a single step is . Similarly, the probability of reaching from with a single step is . Consequently, if we sample uniformly from the interval and accept the step moving from into if and only if is smaller than

(4)

then the limit distribution of the MCMC will be the posterior distribution provided that the MCMC chain is ergodic. The next theorem shows that this is the case.

Theorem 17

Any decomposable model can be reached from any other model by a sequence of legal operations.

Our first step is to compute the ratio of the models given in Eq 4. To do that we will use the BIC estimate given in Eq. 3 and Theorem 9. Let us first define a function

where is an itemset and are items.

Theorem 18

Let be a decomposable model and let be a model obtained by a legal split. Let be the BIC estimate of and let be the BIC estimate of . Then

Similarly, if , then

To compute the gain we need the entropies for 4 itemsets. Let be an itemset. To compute we first order the transactions in such that the values corresponding to are in lexicographical order. This is done with a radix sort given in Algorithm 1. This sort is done in time. After the data is sorted we can easily compute the entropy with a single data scan: Set and . If the values of of the current transaction is equal to the previous transaction we increase by , otherwise we add to and set to . Once the scan is finished, will be equal to . The pseudo code for computing the entropy is given in Algorithm 2.

1 if  or  then return ;
2;
3 first item in ;
4 ; ;
5 ; ;
return concatenated with .
Algorithm 1 Sort. Routine for sorting the transactions. Used by Entropy as a pre-step for computing the entropy.
1 ;
2 ; ;
3 first transaction in ;
4 foreach  do
5       if  then
6             ;
7             ;
8             ;
9            
10      else
11             ;
12            
13      
14;
15 return ;
Algorithm 2 Entropy. Computes the entropy of from the dataset .

Our final step is to compute and actually sample the operations. To do that we first write for the number of possible Split operations and let be the number of possible Split operations using itemset . Similarly, we write for the number of legal merges using and also for the amount of legal merges in total.

Given a maximal itemset we build an occurrence table, which we denote by , of size . For , the entry of the table is the number of maximal itemsets containing and . If , then Theorem 12 states that is legal. Consequently, to sample a split operation we first select a maximal itemset weighted by . Once is selected we select uniformly one legal pair .

To sample legal merges, recall that involves selecting two maximal itemsets and such that , , and . Instead of selecting these itemsets, we will directly sample an itemset and then select two items and . This sampling will work only if two legal merges and result in two different outcomes whenever .

Theorem 19

Let and be two different itemsets and let , and be items. Assume that is a legal merge for . Define for . Then .

The construction of a reduced family states that, if , , then . It follows from Theorem 15 that

To sample a merge we first sample an itemset weighted by . Once is selected, we sample two different itemsets (weighted by and ). Finally, we sample and .

Sampling for a merge operation is feasible only if the number of reduced families for which the merge degree is larger than zero is small.

Theorem 20

Let be the number of items. There are at most maximal itemsets. There are at most itemsets for which the degree .

Pseudo-code for a sampling step is given in Algorithm 3.

1 random integer between 1 and ;
2 if  then
3       Sample from weighted by ;
4       Sample such that is the only maximal itemset containing both and ;
5       ;
6       ;
7      
8else
9       Sample weighted by ;
10       Sample weighted by ;
11       Sample , , weighted by ;
12       Sample and ;
13       ;
14       ;
15      
16 random real number from ;
17 if  then return ;
18 ;
19 else return ;
20 ;
Algorithm 3 MCMC step for sampling decomposable models.

4.3 Speeding Up the Sampling

We have demonstrated what structures we need to compute so that we can sample legal operations. After a sample, we can reconstruct these structures from scratch. In this section we show how to optimize the sampling by constructing the structures incrementally using Algorithms 47.

First of all, we store only maximal itemsets of . Theorem 20 states that there can be only such sets, hence split and merge operations can be done efficiently.

During a split or a merge, we need to update what split operations are legal after the split. We do this by updating an occurrence table . An update takes time. The next theorem shows which maximal itemsets we need to update for legal split operations after a merge.

Theorem 21

Let be a downward closed family of itemsets and let be the family after performing . Let be a maximal itemset in . Then legal split operations using remain unchanged during the merge unless is the unique itemset among maximal itemsets in containing either or .

The following theorem tells us how reduced families should be updated after a merge operation. To ease the notation, let us denote by the unique itemset (if such exists) in containing .

Theorem 22

Let be a downward closed family of itemsets and let be the family after performing . Then the reduced families are updated as follows:

  1. Itemsets and in are merged into one itemset in .

  2. Itemset is added into . Itemset is added into .

  3. Let and let . The itemset containing in is augmented with item . Similarly, itemset containing in is augmented with item .

  4. Otherwise, or and .

Theorems 21 and 22 only covered the updates during merges. Since and are opposite operations we can derive the needed updates for splits from the preceding theorems.

Corollary 23 (of Theorem 21)

Let be a downward closed family of itemsets and let be the family after performing . Let be a maximal itemset in . Then legal split operations using remain unchanged during the merge unless is the unique itemset among maximal itemsets in containing either or .

Corollary 24 (of Theorem 22)

Let be a downward closed family of itemsets and let be the family after performing . Let . Then the reduced families are updated as follows:

  1. Itemset containing in is split into two parts, and .

  2. Itemset is removed from . Itemset is removed from .

  3. Let and let . Item is removed from the itemset containing in . Similarly, item is removed from the itemset containing in .

  4. Otherwise, or and .

We keep in memory only those families that have positive merge degree. Theorem 20 tells us that there are only such families. By studying the code in the update algorithm we see that, except in two cases, the update of a family is either a insertion/deletion of an element into an itemset or a merge of two itemsets. The first complex case is given on Line 7 in MergeSide which corresponds to Case 2 in Theorem 22. The problem is that this family may have contained only itemset before the merge, hence we did not store it. Consequently, we need to recreate the missing itemset, and this is done in time. The second case occurs on Line 5 in SplitSide. This corresponds to the case where we need to break the itemset containing and apart during a split (Case 1 in Corollary 24). This is done by constructing the new sets from scratch. The construction needs time, where is the size of largest itemset in .

1 Update ;
2 Remove from ;
3 ;
4 ; ;
Algorithm 4 . Routine for updating the structures during .
1 ;
2 while changes do
3       ;
4      
5Add into ;
6 if  exists then
7       Remove from ;
8      
9for , exists do
10       (any) item in ;
11       Remove from ;
12      
13if there is unique s.t.  then
14       Update ;
15      
Algorithm 5 Subroutine used by SplitUpdate.
1 Merge and in ;
2 itemset in such that ;
3 itemset in such that ;
4 Build from and ;
5 ; ;
6 Update ;
Algorithm 6 . Routine for updating the structures during .
1 ;
2 if  and does not exists then
3       ;
4      
5Add into ;
6 for , exists do
7       (any) item in ;
8       Augment with ;
9      
10if there is unique s.t.  then
11       Update ;
12      
Algorithm 7 Subroutine used by MergeUpdate.

5 Related Work

Many quality measures have been suggested for itemsets. A major part of these measures are based on how much the itemset deviates from some null hypothesis. For example, itemset measures that use the independence model as background knowledge have been suggested in 

[1, 3]. More flexible models have been proposed, such as, comparing itemsets against graphical models [11] and local Maximum Entropy models [12, 18]. In addition, mining itemsets with low entropy has been suggested in [9].

Our main theoretical advantage over these approaches is that we look at the itemsets as a whole collection. For example, consider that we discover that item and deviate greatly from the null hypothesis. Then any itemset containing both and will also be deemed interesting. The reason for this is that these methods are not adopting to the discovered fact that and are correlated, but instead they continue to use the same null hypothesis. We, on the other hand, avoid this problem by considering models: if itemset is found interesting that information is added into the statistical model. If this new model then explains bigger itemsets containing and , then we have no reason to add these itemsets, into the model, and hence such itemsets will not be considered interesting.

The idea of mining a pattern set as a whole in order to reduce the number of patterns is not new. For example, pattern reduction techniques based on minimum description length principle has been suggested [16, 20, 10]. Discovering decomposable models have been studied in [19]. In addition, a framework that incrementally adopts to the patterns approved by the user has been suggested in [8]. Our main advantage is that these methods require already discovered itemset collection as an input, which can be substantially large for low thresholds. We, on the other hand, skip this step and define the significance for itemsets such that we can mine the patterns directly.

6 Experiments

In this section we present our empirical evaluation of the measure. We first describe the datasets and the setup for the experiments, then present the results with synthetic datasets and finally the results with real-world datasets.

6.1 Setup for the Experiments

We used synthetic datasets and real-world datasets.

The first three synthetic datasets, called Ind, contained independent items and , , and transactions, respectively. We set the frequency for the individual items to be . The next three synthetic datasets, called Path, also contained items. In these datasets, an item were generated from the previous one with . The probability of the first item was set to . We set the number of transactions for these datasets to , , and , respectively.

Our first real-world dataset Paleo222NOW public release 030717 available from [7]. contains information of species fossils found in specific paleontological sites in Europe [7]. The dataset Courses contains the enrollment records of students taking courses at the Department of Computer Science of the University of Helsinki. Finally, our last dataset is Dna is DNA copy number amplification data collection of human neoplasms [13]. We used first items from this data and removed empty transactions. The basic characteristics of the datasets are given in Table 1.

For each data we sampled the models from the posterior distribution using techniques described in Section 3.4. We used singleton model as a starting point and did

restarts. The number of required MCMC steps is hard to predictm since the structure of the state space of decomposable models is complex. Further it also depends on the actual data. Hence, we settle for heuristic: for each restart we perform

MCMC steps, where is the number of items. Doing so we obtained random models for each dataset. The execution times for sampling are given in Table 1. Let be the discovered models. We estimated the itemset score and mined interesting itemsets using a simple depth-first approach.

Name K # of steps time
Ind 15
Path 15
Dna 1160 100
Paleo 501 139
Courses 3506 90
Table 1: Basic characteristics of the datasets. The fourth column contains the number of sample steps and the last column is the execution time.

6.2 Synthetic datasets

Our main purpose for the experiments with synthetic datasets is to demonstrate how the score behaves as a function of number of data points. To this end, we plotted the number of significant itemsets, that is itemsets whose score was higher than the threshold , as a function of the threshold . The results are shown in Figures 2(a) and 2(b).

(a) Ind
(b) Path
(c) Real datasets
Figure 2: Number of significant itemsets as a function of the threshold. The smallest threshold used for Ind and Path is . The smallest threshold used for real datasets is .

Ideally, for Ind, the dataset with independent variables we should have only significant itemsets, that is, the singletons, for any . Similarly, for Path we should have itemsets, the singletons and the pairs of form . We can see from Figures 2(a) and 2(b) that as we increase the number of transactions in data, the number of significant itemsets approaches these ideal cases, as predicted by Theorem 7. The convergence to the ideal case is faster in Path than in Ind

. The reason for this can be explained by the curse of dimensionality. In

Ind we have combinations of pairs of items. There is a high probability that some of these item pairs appear to be correlated. On the other hand, for Path, let us assume that we have the correct model. That is, the singletons and the pairs . The only valid itemsets of size that we can add to this model are of the form . There are only of such sets, hence the probability of finding such itemset important is much lower. Interestingly, in Path we actually benefit from the fact that we are using decomposable models instead of general exponential models.

6.3 Use cases with real-world datasets

Our first experiment with real-world data is to study the number of significant itemsets as a function of the threshold . Figure 2(c) shows the number of significant itemsets for all three datasets. We see that the number of significant itemsets increases faster than for the synthetic datasets as the threshold decreases. The main reason for this difference, is that with real-world datasets we have more items and less transactions. This is seen especially in the Paleo dataset for which the number of significant itemsets increases steeply between the interval when compared to Dna and Courses.

Our next experiment is to compare the score against baselines, namely, the frequency and entropy . These comparisons are given in Figures 3(a) and 3(b). In addition, we computed the correlation coefficients (given in Table 2). From results, we see that

has a positive correlation with frequency and a negative correlation with entropy. The correlation with entropy is expected, since low-entropy implies that the empirical distribution of an itemset is different than the uniform distribution. Hence, using the frequency of such an itemset should improve the model and consequently the itemset is considered interesting.

(a) score as a function of frequency
(b) score as a function of entropy
(c) index difference vs. score
Figure 3: Score