1 Introduction
Discovering frequent itemsets is one of the most active fields in data mining. As a measure of quality, frequency possesses a lot of positive properties: it is easy to interpret and as it decreases monotonically there exist efficient algorithms for discovering large collections of frequent itemsets [2]. However, frequency also has serious drawbacks. A frequent itemset may be uninteresting if its elevated frequency is caused by frequent singletons. On the other hand, some nonfrequent itemsets could be interesting. Another drawback is the problem of pattern explosion when mining with a low threshold.
Many different quality measures have been suggested to overcome the mentioned problems (see Section 5 for a more detailed discussion). Usually these measures compare the observed frequency to some expected value derived, for example, from the independence model. Using such measures we may obtain better results. However, these approaches still suffer from pattern explosion. To point out the problem, assume that two items, say and are correlated, and hence are considered significant. Then any itemset containing and will be also considered significant.
Example 1
Assume a dataset with items such that and always yield identical value and the rest of the items are independently distributed. Assume that we apply some statistical method to evaluate the significance of itemsets by using the independence model as the ground truth. If an itemset contains
, its frequency will be higher than then the estimate of the independence model. Hence, given enough data, the Pvalue of the statistical test will go to
, and we will conclude that the itemset is interesting. Consequently we will find interesting itemsets.In this work we approach the problem of defining quality measure from a novel point of view. We construct a connection between itemsets and statistical models and use this connection to define a new quality measure for itemsets. To motivate this approach further, let us consider the following example.
Example 2
Consider a binary dataset with items, say , generated from the independence model. We argue that if we know that the data comes from the independence model, then the only interesting itemsets are the singletons. The reasoning behind this claim is that the frequencies of singletons correspond exactly to the column margins, the parameters of the independence model. Once we know the singleton frequencies, there is nothing left in the data that would be statistically interesting.
Let us consider a more complicated example. Assume that data is generated from a ChowLiu tree model [4], say
Again, if we know that data is generated from this model, then we argue that the interesting itemsets are , , , , , , , , and . The reasoning here is the same as with the independence model. If we know the frequencies of these itemsets we can derive the parameters of the distribution. For example, .
Let us now demonstrate that this approach will produce much smaller and more meaningful output than the method given in Example 1.
Example 3
Consider the data given in Example 1. To fully describe the data we only need to know the frequencies of the singletons and the fact that and are identical. This information can be expressed by outputting the frequencies of singleton itemsets and the frequency of itemset . This will give us interesting patterns in total.
Our approach is to extend the idea pitched in the preceding example to a general itemset mining framework. In the example we knew which model generated the data, in practice, we typically do not. To solve this we will use the Bayesian approach, and instead of considering just one specific model, we will consider a large collection of models, namely exponential models. A virtue of these models is that we can naturally connect each model to certain itemsets. A model
has a posterior probability
, that is, how probable is the model given the data. The score of a single itemset then is just the probability of it being a parameter of a random model given the data. This setup fits perfectly the given example. If we have strong evidence that data is coming from the independence model, say , then the posterior probability will be close to , and the posterior probability of any other model will be close to . Since the independence model is connected to the singletons, the score for singletons will be and the score for any other itemset will be close to .Interestingly, using statistical models for defining significant itemsets provides an approach to the problem of the pattern set explosion (see Section 3.2 for more technical details). Bayesian model selection has an inbuilt Occam’s razor, favoring simple models over complex ones. Our connection between models and itemsets is such that simple models will correspond to small collections of itemsets. In result, only a small collection of itemsets will be considered interesting, unless the data provides sufficient amount of evidence.
Our contribution in the paper is twofold. First, we introduce a general framework of using statistical models for scoring itemsets in Section 2. Secondly, we provide an example of this framework in Section 3 by using exponential models and provide solid theoretical evidence that our choices are wellfounded. We provide the sampling algorithm in Section 4. We discuss related work in Section 5 and present our experiments in Section 6. Finally, we conclude our work with Section 7. The proofs are given in Appendix. The implementation is provided for research purposes^{1}^{1}1http://adrem.ua.ac.be/implementations.
2 Significance of Itemsets by Statistical Models
As we discussed in the introduction, our goal is to define a quality measure for itemsets using statistical models. In this section we provide a general framework for such a score. We will define the actual models in the next section.
We begin with some preliminary definitions and notations. In our setup a binary dataset is a collection of
transactions, binary vectors of length
. We assume that these vectors are independently generated from some unknown distribution. Such a dataset can be easily represented by a binary matrix of size . By an attributewe mean a Bernoulli random variable corresponding to the
th column of the data. We denote the set of attributes by .An itemset is simply a subset of . Given an itemset and a transaction we denote by the projection of into . We say that covers if all elements in are equal to 1.
We say that a collection of itemsets is downward closed if for each member any subitemset is also included. This property plays a crucial point in mining frequent patterns since it allows effective candidate pruning in levelwise approach and branch pruning in a DFS approach.
Our next step is to define the correspondence between statistical models and families of itemsets. Assume that we have a set
of statistical models for the data. We will discuss in later sections what specific models we are interested in, but for the moment, we will keep our discussion on a high level. Each model
has a posterior probability , that is, how probable the model is given data . To link the models to families of itemsets we assume that we have a function that identifies a model with a downward closed family of itemsets. As we will see later on, there is one particular natural choice for such a function.Now that we have our models that are connected to certain families of itemsets, we are ready to define a score for individual itemsets. The score for an itemset is the posterior probability of being a member in a family or itemsets,
(1) 
The motivation for such score is as follows. If we are sure that some particular model is the correct model for , then the posterior probability for that model will be close to and the posterior probabilities for other models will be close to . Consequently, the score for an itemset will be close to if , and otherwise.
Naturally, the pivotal choice of this score lies in the mapping . Such a mapping needs to be statistically wellfounded, and especially the size of an itemset family should be reflected in the complexity of the corresponding model. We will see in the following section that a particular choice for the model family and mapping has these properties, and leads to certain important qualities.
Proposition 4
The score decreases monotonically, that is, implies .
We are allowing only to map on downward closed families. Hence the inequality
holds, and consequently we have . This completes the proof.
3 Exponential Models
In this section we will make our framework more concrete by providing a specific set of statistical models and the function identifying the models with families of itemsets. We will first give the definition of the models and the mapping. After this, we point out the main properties of our model and justify our choices. However, as, it turns out that computing the score for these models is infeasible, so instead, we solve this problem by considering decomposable models.
3.1 Definition of the models
Models of exponential form have been studied exhaustively in statistics, and have been shown to have good theoretical and practical properties. In our case, using exponential models provide a natural way of describing the dependencies between the variables. In fact, the exponential model class contains many natural models such as, the independence model, the ChowLiu tree model, and the discrete Gaussian model. Finally, such models have been used successfully for predicting itemset frequencies [14] and ranking itemsets [18].
In order to define our model, let be a downward closed family of itemsets containing all singletons. For an itemset we define an indicator function mapping a transaction into binary value. If the transaction covers , then , and otherwise. We define an exponential model associated with to be the collection of distributions having the exponential form
where is a parameter, a real value, for an itemset . Model also contains all the distributions that can be obtained as a limit of the distribution having the exponential form. This technicality is needed to handle distributions with zero probabilities. Since the indicator function is equal to for any , the parameter acts like a normalization constant. The rest of the parameters form a parameter vector of length . Naturally, we set .
Example 5
Assume that consists only of singleton itemsets. Then the corresponding model has the form
(2) 
Since depends only on , the model is actually the independence model. The other extreme is when consists of all itemsets. Then we can show that the corresponding model contains all possible distributions, that is, the model is in fact the parameterfree model.
As an intermediate example, the tree model in Example 2 is also an exponential model with a corresponding family .
The intuition behind the model is that when an itemset is an element of , then the dependencies between the items in are considered important in the corresponding model. For example, if consists only of singletons, then there are no important correlations, hence the corresponding model should be the independence model. On the other hand, in the tree model given in Example 2 the important correlations are the parentchild item pairs, namely, . These are exactly, the itemsets (along with the singletons) that correspond to the model.
Our choice for the models is particularly good since the complexity of models reflects the size of itemset family. Since Bayesian approach has an inbuilt tendency to punish complex families (we will see this in Section 3.2), we are punishing large families of itemsets. If the data states that simple models are sufficient, then the probability of complex models will be low, and consequently the score for large itemsets will also be low. In other words, we casted the problem of pattern set explosion into a model overfitting problem and used Occam’s razor to punish the complex models!
3.2 Computing the Model
Now that we have defined our model , our next step is to compute the posterior probability . That is, the probability of given the data set . We select the model prior to be uniform. Recall that in Eq. 2 a model has a set of parameters, that is, to pinpoint a single distribution in we need a set of parameters . Following the Bayesian approach to compute we need to marginalize out the nuisance parameters ,
In the general case, this integral is too complex to solve analytically so we employ the popular BIC estimate [15],
(3) 
where is a constant and is the maximum likelihood estimate of the model parameters. This estimate is correct when approaches infinity [15]. So instead of computing a complex integral our challenge is to discover the maximum likelihood estimate and compute the likelihood of the data. Unfortunately, using such model is an NPhard problem (see, for example, [17]). We will remedy this problem in Section 3.4 by considering a large subclass of exponential models for which the maximum likelihood can be easily computed.
3.3 Justifications for Exponential Model
In this section we will provide strong theoretical justification for our choices and show that our score fulfills the goals we set in the introduction.
We saw in Example 2 that if the data comes from the independence model, then we only need the frequencies of the singleton itemsets to completely explain the underlying model. The next theorem shows that this holds in general case.
Theorem 6
Assume that data is generated from a distribution that comes from an exponential model . Let be the family of itemsets. We can derive the maximum likelihood estimate from the frequencies of . Moreover, as the number of transactions goes to infinity, we can derive the true distribution from the frequencies of .
The preceding theorem showed that is sufficient family of itemsets in order to derive the correct true distribution. The next theorem shows that we favor small families: if the data can be explained with a simpler model, that is, using less itemsets, then the simpler model will be chosen and, consequently, redundant itemsets will have a low score.
Theorem 7
Assume that data is generated from a distribution that comes from a model . Assume also that if any other model, say , contains this distribution, then . Then the following holds: as the number of data points in goes into infinity, if , otherwise .
3.4 Decomposable Models
We saw in Section 3.2
that in practice we cannot compute the score for general exponential models. In this section we study a subclass of exponential models, for which we can easily compute the needed score. Roughly speaking, a decomposable model is an exponential model where the corresponding maximal itemsets can be arranged to a specific tree, called junction tree. By considering only decomposable models we obviously will lose some models, for example, the discrete Gaussian model, that is, a model corresponding to all itemsets of size 1 and 2 is not decomposable. On the other hand, many interesting and practically relevant models are decomposable, for example ChowLiu trees. Finally, these models are closely related to Bayesian networks and Markov Random Fields (see
[5] for more details).To define a decomposable model, let be a downward closed family of itemsets. We write to be the set of maximal itemsets from . Assume that we can build a tree using itemsets from as nodes with the following property: If have a common item, say , then and are connected in (by a unique path) and every itemset along that path contains . If this property holds for , then is called junction tree and is decomposable. We will use to denote the edges of the tree.
Not all families have junction trees and some families may have multiple junction trees.
Example 8
The most important property of decomposable families is that we can compute the maximum likelihood efficiently. We first define the entropy of an itemset , denoted by , as
where is the empirical distribution of the data.
Theorem 9
Let be a decomposable family and let be its junction tree. The maximum loglikelihood is equal to
Example 10
Assume that our model space consists only of two models, namely the tree model given in Example 2 and the independence model, which we denote . Assume also that we have a dataset with transactions,
To compute the probabilities and , we need to know the entropies of certain itemsets
The loglikelihood of the independence model is equal to
We use the junction tree given in Figure 1(a) and Theorem 9 to compute the loglikelihood of ,
Note that and . Thus Eq. 3 implies that
We get the final probabilities by noticing that so that we have and . Consequently, the scores for itemsets are equal to , , for , and otherwise.
4 Sampling Models
Now that we have means for computing the posterior probability of a single decomposable model, our next step is to compute the score of an itemset namely, the sum in Eq. 1. The problem is that this sum has an exponential number of terms, and hence we cannot solve by enumerating all possible families. We approach this problem from a different point of view. Instead of computing the score for each itemset individually, we will divide our mining method into two steps:

Sample random decomposable models from the posterior distribution .

Estimate the true score of an itemset by computing the number of sampled families of itemsets in which the itemset occurs.
4.1 Moving from One Model to Another
In order to sample we will use a MCMC approach by modifying the current decomposable family by two possible operations, namely

Merge: Select two maximal itemsets, say and . Let . Since and are maximal, and . Select and . Add a new itemset into the family along with all possible subitemsets. We will use notation to denote this operation.

Split: Select an itemset . Select two items . Delete and all subitemsets containing and simultaneously. We will denote this operation by .
Naturally, not all splits and merges are legal, since some operations may result in a family that is not decomposable, or even downward closed.
Example 11
The next theorem tells us which splits are legal.
Theorem 12
Let be decomposable family and let and let . Then the resulting family after a split operation is decomposable if and only, there are no other maximal itemsets in containing and simultaneously.
Example 13
In order to identify legal merges, we will need some additional structures. Let be a downward closed family and let be its maximal itemsets. Let be an itemset. We construct a reduced family, denoted by with the following procedure. Let us first define
To obtain the reduced family from , assume there are two itemsets such that . We remove these two sets from and replace them with . This is continued until no such replacements are possible. We ignore any reduced family that contains or itemsets. The reason for this will be seen in Theorem 15, which implies that such families will not induce any legal merges.
Example 14
The next theorem tells us when is legal.
Theorem 15
Let be decomposable family. A merge operation is legal, that is, is still decomposable after adding if and only if there are sets , , such that and .
4.2 MCMC Sampling Algorithm
Sampling requires a proposal distribution . Let be a current model. We denote the number of legal operations, either a split or a merge, by . Let be a model obtained by sampling uniformly one of the legal operations and applying it to . The probability of reaching from with a single step is . Similarly, the probability of reaching from with a single step is . Consequently, if we sample uniformly from the interval and accept the step moving from into if and only if is smaller than
(4) 
then the limit distribution of the MCMC will be the posterior distribution provided that the MCMC chain is ergodic. The next theorem shows that this is the case.
Theorem 17
Any decomposable model can be reached from any other model by a sequence of legal operations.
Our first step is to compute the ratio of the models given in Eq 4. To do that we will use the BIC estimate given in Eq. 3 and Theorem 9. Let us first define a function
where is an itemset and are items.
Theorem 18
Let be a decomposable model and let be a model obtained by a legal split. Let be the BIC estimate of and let be the BIC estimate of . Then
Similarly, if , then
To compute the gain we need the entropies for 4 itemsets. Let be an itemset. To compute we first order the transactions in such that the values corresponding to are in lexicographical order. This is done with a radix sort given in Algorithm 1. This sort is done in time. After the data is sorted we can easily compute the entropy with a single data scan: Set and . If the values of of the current transaction is equal to the previous transaction we increase by , otherwise we add to and set to . Once the scan is finished, will be equal to . The pseudo code for computing the entropy is given in Algorithm 2.
Our final step is to compute and actually sample the operations. To do that we first write for the number of possible Split operations and let be the number of possible Split operations using itemset . Similarly, we write for the number of legal merges using and also for the amount of legal merges in total.
Given a maximal itemset we build an occurrence table, which we denote by , of size . For , the entry of the table is the number of maximal itemsets containing and . If , then Theorem 12 states that is legal. Consequently, to sample a split operation we first select a maximal itemset weighted by . Once is selected we select uniformly one legal pair .
To sample legal merges, recall that involves selecting two maximal itemsets and such that , , and . Instead of selecting these itemsets, we will directly sample an itemset and then select two items and . This sampling will work only if two legal merges and result in two different outcomes whenever .
Theorem 19
Let and be two different itemsets and let , and be items. Assume that is a legal merge for . Define for . Then .
The construction of a reduced family states that, if , , then . It follows from Theorem 15 that
To sample a merge we first sample an itemset weighted by . Once is selected, we sample two different itemsets (weighted by and ). Finally, we sample and .
Sampling for a merge operation is feasible only if the number of reduced families for which the merge degree is larger than zero is small.
Theorem 20
Let be the number of items. There are at most maximal itemsets. There are at most itemsets for which the degree .
Pseudocode for a sampling step is given in Algorithm 3.
4.3 Speeding Up the Sampling
We have demonstrated what structures we need to compute so that we can sample legal operations. After a sample, we can reconstruct these structures from scratch. In this section we show how to optimize the sampling by constructing the structures incrementally using Algorithms 4–7.
First of all, we store only maximal itemsets of . Theorem 20 states that there can be only such sets, hence split and merge operations can be done efficiently.
During a split or a merge, we need to update what split operations are legal after the split. We do this by updating an occurrence table . An update takes time. The next theorem shows which maximal itemsets we need to update for legal split operations after a merge.
Theorem 21
Let be a downward closed family of itemsets and let be the family after performing . Let be a maximal itemset in . Then legal split operations using remain unchanged during the merge unless is the unique itemset among maximal itemsets in containing either or .
The following theorem tells us how reduced families should be updated after a merge operation. To ease the notation, let us denote by the unique itemset (if such exists) in containing .
Theorem 22
Let be a downward closed family of itemsets and let be the family after performing . Then the reduced families are updated as follows:

Itemsets and in are merged into one itemset in .

Itemset is added into . Itemset is added into .

Let and let . The itemset containing in is augmented with item . Similarly, itemset containing in is augmented with item .

Otherwise, or and .
Theorems 21 and 22 only covered the updates during merges. Since and are opposite operations we can derive the needed updates for splits from the preceding theorems.
Corollary 23 (of Theorem 21)
Let be a downward closed family of itemsets and let be the family after performing . Let be a maximal itemset in . Then legal split operations using remain unchanged during the merge unless is the unique itemset among maximal itemsets in containing either or .
Corollary 24 (of Theorem 22)
Let be a downward closed family of itemsets and let be the family after performing . Let . Then the reduced families are updated as follows:

Itemset containing in is split into two parts, and .

Itemset is removed from . Itemset is removed from .

Let and let . Item is removed from the itemset containing in . Similarly, item is removed from the itemset containing in .

Otherwise, or and .
We keep in memory only those families that have positive merge degree. Theorem 20 tells us that there are only such families. By studying the code in the update algorithm we see that, except in two cases, the update of a family is either a insertion/deletion of an element into an itemset or a merge of two itemsets. The first complex case is given on Line 7 in MergeSide which corresponds to Case 2 in Theorem 22. The problem is that this family may have contained only itemset before the merge, hence we did not store it. Consequently, we need to recreate the missing itemset, and this is done in time. The second case occurs on Line 5 in SplitSide. This corresponds to the case where we need to break the itemset containing and apart during a split (Case 1 in Corollary 24). This is done by constructing the new sets from scratch. The construction needs time, where is the size of largest itemset in .
5 Related Work
Many quality measures have been suggested for itemsets. A major part of these measures are based on how much the itemset deviates from some null hypothesis. For example, itemset measures that use the independence model as background knowledge have been suggested in
[1, 3]. More flexible models have been proposed, such as, comparing itemsets against graphical models [11] and local Maximum Entropy models [12, 18]. In addition, mining itemsets with low entropy has been suggested in [9].Our main theoretical advantage over these approaches is that we look at the itemsets as a whole collection. For example, consider that we discover that item and deviate greatly from the null hypothesis. Then any itemset containing both and will also be deemed interesting. The reason for this is that these methods are not adopting to the discovered fact that and are correlated, but instead they continue to use the same null hypothesis. We, on the other hand, avoid this problem by considering models: if itemset is found interesting that information is added into the statistical model. If this new model then explains bigger itemsets containing and , then we have no reason to add these itemsets, into the model, and hence such itemsets will not be considered interesting.
The idea of mining a pattern set as a whole in order to reduce the number of patterns is not new. For example, pattern reduction techniques based on minimum description length principle has been suggested [16, 20, 10]. Discovering decomposable models have been studied in [19]. In addition, a framework that incrementally adopts to the patterns approved by the user has been suggested in [8]. Our main advantage is that these methods require already discovered itemset collection as an input, which can be substantially large for low thresholds. We, on the other hand, skip this step and define the significance for itemsets such that we can mine the patterns directly.
6 Experiments
In this section we present our empirical evaluation of the measure. We first describe the datasets and the setup for the experiments, then present the results with synthetic datasets and finally the results with realworld datasets.
6.1 Setup for the Experiments
We used synthetic datasets and realworld datasets.
The first three synthetic datasets, called Ind, contained independent items and , , and transactions, respectively. We set the frequency for the individual items to be . The next three synthetic datasets, called Path, also contained items. In these datasets, an item were generated from the previous one with . The probability of the first item was set to . We set the number of transactions for these datasets to , , and , respectively.
Our first realworld dataset Paleo^{2}^{2}2NOW public release 030717 available from [7]. contains information of species fossils found in specific paleontological sites in Europe [7]. The dataset Courses contains the enrollment records of students taking courses at the Department of Computer Science of the University of Helsinki. Finally, our last dataset is Dna is DNA copy number amplification data collection of human neoplasms [13]. We used first items from this data and removed empty transactions. The basic characteristics of the datasets are given in Table 1.
For each data we sampled the models from the posterior distribution using techniques described in Section 3.4. We used singleton model as a starting point and did
restarts. The number of required MCMC steps is hard to predictm since the structure of the state space of decomposable models is complex. Further it also depends on the actual data. Hence, we settle for heuristic: for each restart we perform
MCMC steps, where is the number of items. Doing so we obtained random models for each dataset. The execution times for sampling are given in Table 1. Let be the discovered models. We estimated the itemset score and mined interesting itemsets using a simple depthfirst approach.Name  K  # of steps  time  

Ind  –  15  –  
Path  –  15  –  
Dna  1160  100  
Paleo  501  139  
Courses  3506  90 
6.2 Synthetic datasets
Our main purpose for the experiments with synthetic datasets is to demonstrate how the score behaves as a function of number of data points. To this end, we plotted the number of significant itemsets, that is itemsets whose score was higher than the threshold , as a function of the threshold . The results are shown in Figures 2(a) and 2(b).



Ideally, for Ind, the dataset with independent variables we should have only significant itemsets, that is, the singletons, for any . Similarly, for Path we should have itemsets, the singletons and the pairs of form . We can see from Figures 2(a) and 2(b) that as we increase the number of transactions in data, the number of significant itemsets approaches these ideal cases, as predicted by Theorem 7. The convergence to the ideal case is faster in Path than in Ind
. The reason for this can be explained by the curse of dimensionality. In
Ind we have combinations of pairs of items. There is a high probability that some of these item pairs appear to be correlated. On the other hand, for Path, let us assume that we have the correct model. That is, the singletons and the pairs . The only valid itemsets of size that we can add to this model are of the form . There are only of such sets, hence the probability of finding such itemset important is much lower. Interestingly, in Path we actually benefit from the fact that we are using decomposable models instead of general exponential models.6.3 Use cases with realworld datasets
Our first experiment with realworld data is to study the number of significant itemsets as a function of the threshold . Figure 2(c) shows the number of significant itemsets for all three datasets. We see that the number of significant itemsets increases faster than for the synthetic datasets as the threshold decreases. The main reason for this difference, is that with realworld datasets we have more items and less transactions. This is seen especially in the Paleo dataset for which the number of significant itemsets increases steeply between the interval – when compared to Dna and Courses.
Our next experiment is to compare the score against baselines, namely, the frequency and entropy . These comparisons are given in Figures 3(a) and 3(b). In addition, we computed the correlation coefficients (given in Table 2). From results, we see that
has a positive correlation with frequency and a negative correlation with entropy. The correlation with entropy is expected, since lowentropy implies that the empirical distribution of an itemset is different than the uniform distribution. Hence, using the frequency of such an itemset should improve the model and consequently the itemset is considered interesting.



Comments
There are no comments yet.