Bayesian Mixture Models for Frequent Itemset Discovery

09/26/2012 ∙ by Ruefei He, et al. ∙ The University of Manchester 0

In binary-transaction data-mining, traditional frequent itemset mining often produces results which are not straightforward to interpret. To overcome this problem, probability models are often used to produce more compact and conclusive results, albeit with some loss of accuracy. Bayesian statistics have been widely used in the development of probability models in machine learning in recent years and these methods have many advantages, including their abilities to avoid overfitting. In this paper, we develop two Bayesian mixture models with the Dirichlet distribution prior and the Dirichlet process (DP) prior to improve the previous non-Bayesian mixture model developed for transaction dataset mining. We implement the inference of both mixture models using two methods: a collapsed Gibbs sampling scheme and a variational approximation algorithm. Experiments in several benchmark problems have shown that both mixture models achieve better performance than a non-Bayesian mixture model. The variational algorithm is the faster of the two approaches while the Gibbs sampling method achieves a more accurate results. The Dirichlet process mixture model can automatically grow to a proper complexity for a better approximation. Once the model is built, it can be very fast to query and run analysis on (typically 10 times faster than Eclat, as we will show in the experiment section). However, these approaches also show that mixture models underestimate the probabilities of frequent itemsets. Consequently, these models have a higher sensitivity but a lower specificity.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Transaction data sets are binary data sets with rows corresponding to transactions and columns corresponding to items or attributes. Data mining techniques for such data sets have been developed for over a decade. Methods for finding correlations and regularities in transaction data can have many commercial and practical applications, including targeted marketing, recommender systems, more effective product placement, and many others.

Retail records and web site logs are two examples of transaction data sets. For example, in a retailing application, the rows of the data correspond to purchases made by various customers, and the columns correspond to different items for sale in the store. This kind of data is often sparse, i.e., there may be thousands of items for sale, but a typical transaction may contain only a handful of items, as most of the customers buy only a small fraction of the possible merchandise. In this paper we will only consider binary transaction data, but transaction data can also contain the numbers of each item purchased (multi-nomial data). An important correlation which data mining seeks to elucidate is which items co-occur in purchases and which items are mutually exclusive, and never (or rarely) co-occur in transactions. This information allows prediction of future purchases from past ones.

Frequent itemset mining and association rule mining Agrawal et al. (1993) are the key approaches for finding correlations in transaction data. Frequent itemset mining finds all frequently occurring item combinations along with their frequencies in the dataset with a given minimum frequency threshold. Association rule mining uses the results of frequent itemset mining to find the dependencies between items or sets of items. If we regard the minimum frequency threshold as an importance standard, then the set of frequent itemsets contains all the “important” information about the correlation of the dataset. The aim of frequent itemset mining is to extract useful information from the kinds of binary datasets which are now ubiquitous in human society. It aims to help people realize and understand the various latent correlations hidden in the data and to assist people in decision making, policy adjustment and the performance of other activities which rely on correct analysis and knowledge of the data.

However, the results of such mining are difficult to use. The threshold or criterion of mining is hard to choose for a compact but representative set of itemsets. To prevent the loss of important information, the threshold is often set quite low, causing a huge set of itemsets which brings difficulties in interpretation. These properties of large scale and weak interpretability block a wider use of the mining technique and are barriers to a further understanding of the data itself. Traditionally, Frequent Itemset Mining (FIM) suffers from three difficulties. The first is scalability, often the data sets are very large, the number of frequent item-sets of the chosen support is also large, and there may be a need to run the algorithm multiple times to find the appropriate frequency threshold. The second difficulty is that the support-confidence framework is often not able to provide the information that people really need. Therefore people seek other criteria or measurements for more “interesting” results. The third difficulty is in interpreting the results or getting some explanation of the data. Therefore the recent focus of research of FIM has been in the following 3 directions.

  1. Looking for more compact but representative forms of the itemsets - in other words, mining compressed itemsets. The research in this direction consists of two types: lossless compression such as closed itemset mining Pasquier et al. (1999) and lossy compression such as maximal itemset mining Calders and Goethals (2002). In closed itemset mining, a method is proposed to mine the set of closed itemsets which is a subset of the set of frequent itemsets. This can be used to derive the whole set of frequent itemsets without loss of information. In maximal itemset mining, the support information of the itemsets is ignored and only a few longest itemsets are used to represent the whole set of frequent itemsets.

  2. Looking for better standards and qualifications for filtering the itemsets so that the results are more “interesting” to users. Work in this direction focuses on how to extract the information which is both useful and unexpected as people want to find a measure that is closest to the ideal of “interestingness”. Several objective and subjective measures are proposed such as lift Machine (1996), Brin et al. (1997) and the work of Jaroszewicz (2004)

    in which they use a Bayesian network as background knowledge to measure the interestingness of frequent itemsets.

  3. Looking for mathematical models which reveal and describe both the structure and the inner-relationship of the data more accurately, clearly and thoroughly. There are two ways of using probability models in FIM. The first is to build a probability model that can organize and utilize the results of mining such as the Maximal Entropy model Tatti (2008). The second is to build a probability model that is directly generated from the data itself which can not only predict the frequent itemsets, but also explain the data. An example of such model is the Mixture model.

These three directions influence each other and form the main stream of current FIM research. Of the three, the probability model solution considers the data as a sampled result from the underlying probability model and tries to explain the system in an understandable, structural and quantified way. With a good probability model, we can expect the following advantages in comparison with normal frequent itemset mining:

  1. The model can reveal correlations and dependencies in the dataset, whilst frequent itemsets are merely collections of facts awaiting interpretation. A probability model can handle several kinds of probability queries, such as joint, marginal and conditional probabilities, whilst frequent itemset mining and association rule mining focus only on high marginal and conditional probabilities. The prediction is made easy with a model. However, in order to predict with frequent itemsets, we still need to organize them and build a structured model first.

  2. It is easier to observe interesting dependencies between the items, both positive and negative, from the model’s parameters than it is to discriminate interesting itemsets or rules from the whole set of frequent itemsets or association rules. In fact, the parameters of the probability model trained from a dataset can be seen as a collection of features of the original data. Normally, the size of a probability model is far smaller than the set of frequent itemsets. Therefore the parameters of the model are highly representative. Useful knowledge can be obtained by simply “mining” the parameters of the model directly.

  3. As the scale of the model is often smaller than the original data, it can sometimes serve as a proxy or a replacement for the original data. In real world applications, the original dataset may be huge and involve large time costs in querying or scanning the dataset. One may also need to run multiple queries on the data, e.g. FIM queries with different thresholds. In such circumstances, if we just want an approximate estimation, a better choice is obviously to use the model to make the inference. As we will show in this paper, when we want to predict all frequent itemsets, generating them from the model is much faster than mining them from the original dataset because the model prediction is irrelevant to the scale of the data. And because the model is independent from the minimum frequency threshold, we only need to train the model once and can do the prediction on multiple thresholds but consuming less time.

Several probability models have been proposed to represent the data. Here we give a brief review.

The simplest and most intuitive model is the Independent model. This assumes that the probability of an item appearing in a transaction is independent of all the other items in that transaction. The probabilities of the itemsets are products of the probabilities of the corresponding items. This model is obviously too simple to describe the correlation and association between items, but it is the starting point and base line of many more effective models.

The Multivariant Tree Distribution model Chow and Liu (1968), also called the Chow-Liu Tree, assumes that there are only pairwise dependencies between the variables, and that the dependency graph on the attributes has a tree structure. There are three steps in building the model: computing the pairwise marginals of the attributes, computing the mutual information between the attributes and applying Kruskal’s algorithm Kruskal (1956)

to find the minimum spanning tree of the full graph, whose nodes are the attributes and the weights on the edges are the mutual information between them. Given the tree, the marginal probability of an itemset can be first decomposed to a production of factors via the chains rule and then calculated with the standard belief propagation algorithm

Pearl (1988).

The Maximal Entropy model tries to find a distribution that maximizes the entropy within the constraints of frequent itemsets Pavlov et al. (2003); Tatti (2008) or other statistics Tatti and Mampaey (2010). The algorithm for solving the Maximal Entropy model is the Iterative Scaling algorithm. The Iterative Scaling algorithm is a process of finding the probability of a given itemset query. The algorithm starts from an “ignorant” initial state and updates the parameters by enforcing them satisfying the related constraints iteratively until convergence. Finally the probability of the given query can be calculated via the parameters.

The Bernoulli Mixture model Pavlov et al. (2003); Everitt and Hand (1981) is based on the assumption that there are latent or unobserved types controlling the distribution of the items. Within each type, the items are independent. In other words, the items are conditionally independent given the type. This assumption is a natural extension of the Independent model. The Bernoulli Mixture model is a widely used model for statistical and machine learning tasks. The idea is to use an additive mixture of simple distributions to approximate a more complex distribution. This model is the focus of this paper.

When applying a mixture model to data, one needs to tune the model to the data. There are two ways to do this. In a Maximum-Likelihood Mixture Model, which in our paper we will call the non-Bayesian Mixture Model, the probability is characterised by a set of parameters. These are set by optimizing them to maximize the likelihood of the data. Alternatives are Bayesian Mixture models

. In these, the parameters are treated as random variables which themselves need to be described via probability distributions. Our work is focused on elucidating the benefits of Bayesian mixtures over non-Bayesian mixtures for frequent itemset mining.

Compared with non-Bayesian machine learning methods, Bayesian approaches have several valuable advantages. Firstly, Bayesian integration does not suffer from over-fitting, because it does not fit parameters directly to the data; it integrates overall parameters and is weighted by how well they fit the data. Secondly, prior knowledge can be incorporated naturally and all uncertainty is manipulated in a consistent manner. One of the most prominent recent developments in this field is the application of Dirichlet process (DP) Ferguson (1973) mixture model, a nonparametric Bayesian technique for mixture modelling, which allows for the automatic determination of an appropriate number of mixture components. Here, the term “nonparametric” means the number of mixture components can grow automatically to the necessary scale. The DP is an infinite extension of the Dirichlet distribution which is the prior distribution for finite Bayesian mixture models. Therefore the DP mixture model can contain as many components as necessary to describe an unknown distribution. By using a model with an unbounded complexity, under-fitting is mitigated, whilst the Bayesian approach of computing or approximating the full posterior over parameters mitigates over-fitting.

The difficulty of such Bayesian approaches is that finding the right model for the data is often computational intractable. A standard methodology for DP mixture model is the Monte Carlo Markov chain (MCMC) sampling. However, MCMC approach can be slow to converge and its convergence can be difficult to diagnose. An alternative is the variational inference method developed in recent years

Wainwright and Jordan (2003). In this paper, we develop both finite and infinite Bayesian Bernoulli mixture models for transaction data sets with both MCMC sampling and variational inference and use them to generate frequent itemsets. We perform experiments to compare the performance of the Bayesian mixture models and the non-Bayesian mixture model. Experimental results show that Bayesian mixture model can achieve a better precision. The DP mixture model can find a proper number of mixtures automatically.

In this paper, we extend the non-Bayesian mixture model to a Bayesian mixture model. The assumption and the structure of the Bayesian model is proposed. The corresponding algorithms for inference via MCMC sampling and variational approximation are also described. For the sampling approach, we implemented Gibbs sampling algorithm Geman and Geman (1984) for the finite Bayesian mixture model (GSFBM) which is a multi-variant Markov Chain Monte Carlo (MCMC) sampling Metropolis and Ulam (1949); Metropolis et al. (1953); Hastings (1970) scheme. For the variational approximation, we implement the variational EM algorithm for the finite Bayesian mixture model (VFBM) by approximating the true posterior with a factorized distribution function. We also extend the finite Bayesian mixture model to the infinite. The Dirichlet process prior is introduced to the model so that the model obtains the ability to fit a proper complexity itself. This model solves the problem of finding the proper number of components used in traditional probability models. For this model, we also implement two algorithms. The first one is Gibbs sampling for the Dirichlet Process mixture model (GSDPM). The second one is the truncated variational EM algorithm for the Dirichlet Process mixture model (VDPM). The word “truncated” means we approximate the model with a finite number of components.

The rest of the paper is organized as follows. In the next section, we define the problem, briefly review the development of the FIM mining and introduce the notations used in this paper. In section 3, we introduce non-Bayesian Bernoulli mixture model and its inference by EM algorithm. In section 4 and 5, we develop the Bayesian mixture models, including how to do inference via Gibbs sampling and variational EM and how to use the model for predictive inference. Then, in section 6, we use 4 benchmark transaction data sets to test the model, and compare the performances with the non-Bayesian mixture model. We also compare the MCMC approach and the EM approach by their result accuracies and time costs. Finally, we conclude this paper with a discussion of further works.

2 Problem and Notations

Let be the set of items, where is the number of items. Set is called an itemset with length k, or a k-itemset.

A transaction data set over is a collection of transactions: . A transaction is a

dimension vector:

where . A transaction is said to support an itemset if and only if . A transaction can also be written as an itemset. Then supports if . The frequency of an itemset is:

An itemset is frequent if its frequency meets a given minimum frequency threshold: . The aim of frequent itemset mining is to discover all the frequent itemsets along with their frequencies.

From a probabilistic view, the data set could be regarded as a sampling result from an unknown distribution. Our aim is to find or approximate the probabilistic distribution which generated the data, and use this to predict all the frequent itemsets. Inference is the task of restricting the possible probability models from the data. In the Bayesian approach, this usually means putting a probability over unknown parameters. In the non-Bayesian approach, this usually means finding the best or most-likely parameters.

3 Bernoulli Mixtures

In this section, we describe the non-Bayesian mixture model. Consider a grocery store where the transactions are purchases of the items the store sells. The simplest model would treat each item as independent, so the probability of a sale containing item A and item B is just the product of the two probabilities separately. However, this would fail to model non-trivial correlations between the items. A more complex model assumes a mixture of independent models. The model assumes the buyers of the store can be characterized into different types representing different consumer preferences. Within each type, the probabilities are independent. In other words, the items are conditionally independent, when conditioned on the component, or type, which generated the given transaction. However, although we observe the transaction, we don’t not observe the type. Thus, we must employ the machinery of inference to deal with this.

Suppose there are components or types, then each transaction is generated by one of the components following a multinomial distribution with parameter , where . Here we introduce a component indicator indicating which components the transactions are generated from: if is generated from the th component. According to the model assumption, once the component is selected, the probabilities of the items are independent from each other. That is, for transaction :


where representing all the parameters of the model. Thus, the probability of a transaction given by the mixture model is:


Since the transactions are binary vectors, we assume the conditional probability of each item follows a Bernoulli distribution with parameter



A graphic representation of this model is shown in Figure 1 where circles denote random variables, arrows denote dependencies, and the box (or plate) denote replication over all data points. In Figure 1, the distribution of each transaction depends on the selection of and model parameter , and depends on . This process will repeated times to generate the whole data set.

In this model, we need to estimate and from the data. If we knew which component generated each transaction this would be easy. For example, we could estimate as the frequency at which occurs in component and would be the frequency at which component occurs in the data. Unfortunately, we do not know which component generated each transaction; it is an unobserved variable. The EM algorithm Dempster et al. (1977) is often used for the parameter estimation problem for models with hidden variables in general, for mixture models in particular. We describe this in more detail in Appendix 1. For a detailed explanation, see section 9.3.3 of Bishop (2007). The EM algorithm is given in Algorithm 1.

Figure 1: non-Bayesian mixture graphic representation
  initialize and
     for  to  do
         for  to  do
         end for
     end for
  until convergence
Algorithm 1 EM algorithm for Bernoulli Mixtures

Another problem of this algorithm is the selection of . The choice of will greatly influence the quality of the result. If the is too small, the model cannot provide accurate enough result. On the opposite, if the is too large, it may cause over-fitting problems. There is no single procedure to find out the correct . People often try several increasing s and determine the proper by comparing their result qualities and preventing over-fitting by cross-validation or some other criteria such as the Bayesian Information Criterion Schwarz (1978).

Predicting frequent itemsets by this model is quite straightforward. For any itemset , calculating its probability is done by only taking into account the items occurring in and ignoring (e.g. marginalizing over) the items which are not in :


The number of free parameters used for prediction is .

The last issue is how to generate the full set of frequent itemsets. In frequent itemset mining algorithms, obtaining the frequencies of the itemsets from the data set is always a time consuming problem. Most algorithms such as Apriori Agrawal and Srikant (1994) require multiple scans of the data set, or use extra memory cache for maintaining special data structure such as tid_lists for Eclat Zaki (2000) and FP-tree for FP-growth Han et al. (2000). In the Bernoulli mixture model approach, with a prepared model, both time and memory cost can be greatly reduced with some accuracy loss since the frequency counting process has been replaced by a simple calculation of summation and multiplication. To find the frequent itemsets using any of the probability models in this paper, simply mine the probability models instead of the data. To do this, one can use any frequent itemset datamining algorithm; we use Eclat. However, instead of measuring the frequency of the itemsets, calculate their probabilities from the probability model.

Typically this results in a great improvement in the complexity of the determination of itemset frequency. For a given candidate itemset, to check the exact frequency of the itemset, we need to scan the original dataset for Apriori, or check the cached data structure in memory for Eclat. In both algorithms, the time complexities are where is the number of transactions of the dataset. However, the calculation in mixture model merely need times multiplication and times addition, where is the length of the itemset. Normally, is much smaller than .

The exact search strategy with Bernoulli mixture model is similar to Eclat or Apriori based on the Apriori principle Agrawal and Srikant (1994): All frequent itemsets’ sub-itemsets are frequent, all infrequent itemsets’ super-itemsets are infrequent. Following this principle, the searching space could be significantly reduced. In our research we use the Eclat lattice decomposing framework to organize the searching process. We do not plan to discuss this framework in detail in this paper. A more specific explanation is given by Zaki (2000).

4 The Finite Bayesian Mixtures

4.1 Definition of the model

For easier model comparison, we use the same notation in non-Bayesian model, finite Bayesian model and the later infinite Bayesian model when this causes no ambiguity. The difference between Bayesian mixture models and non-Bayesian mixture models is that Bayesian mixtures try to form a smooth distribution over the model parameters by introducing appropriate priors. The original mixture model introduced in previous section is a two-layer model. The top layer is the multinomial distribution for choosing the mixtures, and the next layer is the Bernoulli distribution for items. In Bayesian mixture we introduce a Dirichlet distribution Ferguson (1973) as the prior of the multinomial parameter

and Beta distributions as the priors of the Bernoulli parameters

. The new model assumes that the data was generated as follows.

  1. Assign

    as the hyperparameters of the model, where

    , and are all positive scalars. These will be chosen apriori.

  2. Choose Dir() where


    with , denotes sampling, and Dir is the Dirichlet distribution.

  3. For each item and component choose Beta() where


    with where and Beta denotes the Beta distribution.

  4. For each transaction :

    1. Choose a component Multinomial(), where

    2. Then we can generate data by:


Figure 2 is a graphic representation for Bayesian mixtures.

Figure 2: finite Bayesian mixture graphic representation

This process can be briefly written as:


In other words, the assumption is that the data was generated by first doing the first two steps to get the parameters, then doing the second two steps times to generate the data. Since important variables of the model are not known, namely , , and , the Bayesian principles say that we should compute distributions over these, and then integrate them out to get quantities of interest. However, this is not tractable. Therefore, we implement two common approximation schemes: Gibbs sampling and variational Bayes.

4.2 Finite Bayesian mixtures via Gibbs sampling

One approach for Bayesian inference is to approximate probabilistic integrals by sums of finite samples from the probability distribution you are trying to Gibbs sampling is an example of the Markov chain Monte Carlo method, which is a method of sampling from a probability. Gibbs sampling works by sampling one component at a time. We will use a

collapsed Gibbs sampler, which means we will not use sampling to estimate all parameters. We will use sampling to infer the components which generated each data point and integrate out the other parameters.

We first introduce the inference of the model via the Gibbs sampling. Similar to the non-Bayesian mixture model, we need to work on the distribution of the component indicator

. According to the model, the joint distribution of



where is the number of points assigned to th component, the integral over means the integral over a -dimension simplex and the indicator function means:

The conditional probability of the th assignment given the other assignments are:


where is the number of points assigned to th component except the th point. The posterior distribution of the Bernoulli parameter is the following if we know the component assignment:



Combining Equation (11) and (13

), we can calculate the posterior probability of the

th assignment by integrating out :


where , and are calculated excluding the th point and the integral over means integral over a -dimension vector . Equation (14) shows how to sample the component indicator based on the other assignments of the transactions.The whole process of the collapsed Gibbs sampling for the finite Bayesian mixture model is shown in Algorithm 2. Initialization of parameters and is discussed in section 6.

  input parameters , ,
  input parameter as the number of components
  initialize to be a random assignment
     for  to  do
         For all update by
         For all calculate multinomial probabilities based on
         Normalize over
         Sample based on
     end for
  until convergence
Algorithm 2 collapsed Gibbs sampling for finite Bayesian mixture model

The predictive inference after Gibbs sampling is quite straightforward. We can estimate the proportion and the conditional probability parameters by the sampling results. The proportion is inferred from the component indicator we sampled:


The conditional Bernoulli parameters are estimated as following:


For a given itemset , its predictive probability is:


In practice, the parameters and only need to be calculated only once for prediction. The model contains free parameters.

4.3 Finite Bayesian Mixture Model via Variational Inference

In this section we describe the variational EM algorithm Beal (2003); Bishop (2007) for this model. Based on the model assumption, the joint probability of the transaction , components indicator and the model parameters and is:


For the whole data set:


Integrating over , , summing over and taking the logarithm, we obtain the log-likelihood of the data set:


Here the integral over means integral over a -dimension simplex. The integral over means integral over a vector . The summing over is summing over all possible configurations. This integral is intractable because of the coupling of and . This approximate distribution is chosen so that: the variables are decoupled, and the approximate distribution is a close as possible to the true distribution. In other words, the task is to find the decoupled distribution most like the true distribution, and use the approximate distribution to do inference.

We assume the distribution has the following form:


Here , and are free variational parameters corresponding to the hyperparameters , and , and is the multinomial parameter for decoupling and . We use this function to approximate the true posterior distribution of the parameters. To achieve this, we need to estimate the values of , and . Similar to non-Bayesian mixture EM, we expand the log-likelihood and optimize its lower bound. The optimization process is quite similar to the calculations we did in non-Bayesian EM part. In the optimization, we use the fact that if Dir() where is the digamma function. This yields:


Equation (22) to (25) form an iterated optimization procedure. A brief demonstration of this procedure is given by Algorithm 3.

  input parameters , and
  input parameters as the number of components
  initialize to be a random assignment
     For all update by
     for  to  do
        for  to  do
           Update according to (25)
        end for
        Normalize over
     end for
  until convergence
Algorithm 3 Variational EM for Finite Bayesian Bernoulli Mixtures

For any itemset , its predictive probability given by the model is:


In Equation (26), we use the decoupled to replace the true posterior distribution so that the integral is solvable. Equation (26) shows that when doing predictive inference, we only need to take care the value of , and proportionally. Therefore the number of parameters is exactly the same as the non-Bayesian model.

5 The Dirichlet Process Mixture Model

The finite Bayesian mixture model is still restricted by the fact that the number of components must be chosen in advance. Ferguson Ferguson (1973) proposed the Dirichlet Process (DP) as the infinite extension of the Dirichlet distribution. Applying the DP as the prior of the mixture model allows us to have an arbitrary number of components, growing as necessary during the learning process. In the finite Bayesian mixture model, the Dirichlet distribution is a prior for choosing components. Here the components are in fact distributions drawn from a base distribution Beta(, ). In Dirichlet distribution, the number of components is a fixed number . So each time we draw a distribution, the result is equal to one of the distributions drawn from the base distribution with probabilities given by the Dirichlet distribution. Now we relax the number of components as unlimited and keep the discreteness of the components, which means that each time we draw a distribution (component), the result is either equal to an existed distribution or a new draw from the base distribution. This new process is called the Dirichlet Process Ferguson (1973) and the drawing scheme is the Blackwell-MacQueen’s Pólya urn scheme Blackwell and Macqueen (1973):


The previous model should also be rewritten as:


5.1 The Dirichlet Process Mixture Model via Gibbs Sampling

Based on the Pólya urn scheme we can allow to grow. Following this scheme, every time we draw a distribution, there is a chance that the distribution comes from the base distribution, therefore adding a new component to the model. This scheme makes the has the potential to grow to any positive integer.

Assume at a certain stage, the actual number of components is . Based on Equation (27):

Then the probability that the th point is in a new component is:

The rest of the posterior probability remains the same, as there is no involved:


For the new component, and we have,


Equation (29) and (30) form a collapsed Gibbs sampling scheme. At the beginning, all data points are assigned to one initial component. Then for each data point in the data set, the component indicator is sampled according to the posterior distribution provided by Equation (29) and (30). After the indicator is sampled, the relevant parameters , and are updated for next data point. The whole process will keep running until some convergence condition is met. Algorithm 4 describes the method.

  input parameters , ,
     for  to  do
         For all , update by
         Calculate multinomial probabilities based on
         Normalize over
         Sample based on
         if component selected then
         end if
     end for
  until convergence
Algorithm 4 collapsed Gibbs sampling for Dirichlet process mixture model

The predictive inference is generally the same as the finite version.

5.2 DP Mixtures via Variational Inference

Although the Gibbs sampler can provide a very accurate approximation to the posterior distribution for the component indicators, it needs to update the relative parameters for every data point. Thus it is computational expensive and not very suitable for large scale problems. In 1994, Sethuraman developed the stick-breaking representation Sethuraman (1994) of DP which captures the DP prior most explicitly among other representations. In the stick-breaking representation, an unknown random distribution is represented as a sum of countably infinite atomic distributions. The stick-breaking representation provide a possible way for doing the inference of DP mixtures by variational methods. A variational method for DP mixture has been proposed by Blei and Jordan (2005). They showed that the variational method produced comparable result to MCMC sampling algorithms including the collapsed Gibbs sampling, but is much faster.

In the transaction data set background, the target distribution is the distribution of the transaction and the atomic distributions are the conditional distributions such as . Based on the stick-breaking representation, the Dirichlet process mixture model is the following.

  1. Assign as the hyperparameter of the Dirichlet process, , as the hyperparameters of the base Beta distribution, where they are all positive scalars.

  2. Choose

  3. Choose

  4. For each transaction :

    1. Choose a component Multinomial() where

    2. Then we can generate data by:


The stick-breaking construction for the DP mixture is depicted in Figure 3.

Figure 3: Graphic representation of DP mixture in stick-breaking representation

With the model assumption, the joint probability of the data set , components indicators and the model parameters and is:


Integrating over , , summing over and applying the logarithm, we obtain the log-likelihood of the data set:


Here the integral over means integral over a vector . The integral over means integral over a vector . The summing over is summing over all possible configurations. This integral is intractable because of the integral over infinity dimensions and the coupling of and .

Notice the following limit with a given truncation :


Equation (35) shows that for a large enough truncation level , all the components beyond the th component could be ignored as the sum of their proportion is very close to 0, which means that it is possible to approximate the infinite situation by a finite number of components. The difference with finite Bayesian model is that in finite Bayesian mixture, the number of component is finite; but in truncated DP mixture, the number of component is infinite. We only use a finite distribution to approximate it. Therefore we can use a finite and fully decoupled function as the approximation of true posterior distribution. We propose the following factorized family of variational distribution:


Here , , and are free variational parameters corresponding to the hyperparameters 1, , and , and is the multinomial parameter for decoupling and . As we are assuming the proportion of the components beyond is 0, the value of in the approximation is always 1. We use this function to approximate the true posterior distribution of the parameters. To achieve this, we need to estimate the values of , , and . A detailed computation of the optimization is given by Blei and Jordan (2005). The optimization yields:


Equation (37) to (41) form an iterated optimization procedure. A brief demonstration of this procedure is given by Algorithm 5.

  input parameters , and
  input parameter as the truncated number of components
  initialize to be a random assignment
     For all update by
     for  to  do
        for  to  do
           Update according to (41)
        end for
        Normalize over
     end for
  until convergence
Algorithm 5 Variational EM for DP Bernoulli Mixtures

The predictive inference is given by Equation (42). Same as we did in finite model, we use the decoupled function to replace the true posterior distribution so that we can do the integral analytically. In fact we only need to use the value of as the proportion of each component. Thus the number of parameters used for prediction is still the same as the finite model if we set the truncation level to be the same value as the number of components in the finite model.


6 Empirical Results and Discussion

In this section, we compare the performances of proposed models with the non-Bayesian mixture model using 5 synthetic data sets and 4 real benchmark data sets. We generate five synthetic datasets from five mixture models with 15, 25, 50, 75 and 140 components respectively and apply the four methods to the synthetic datasets to see how closely the new models compare with the original mixture model. For the real data sets, we choose the mushroom, chess, Anonymous Microsoft Web data Frank and Asuncion (2010) and accidents Geurts et al. (2003). The data sets mushroom and chess we used were transformed to discrete binary data sets by Roberto Bayardo and the transformed version can be downloaded at In Table 1 we summarize the main characteristics of these 4 data sets.

chess 3197 75 118252 49.32%
mushroom 8125 119 186852 19.33%
MS Web Data 37711 294 113845 1.03%
accidents 341084 468 11500870 7.22%
Table 1: General Characteristics of the testing data sets: is the number of records, is the number of items, is the number of “1”s and the reflects the sparseness of the data set which is calculated by

We randomly sampled a proportion of the data sets for training and used the rest for testing. For synthetic data sets, mushroom and chess, we sampled half of the data and used the rest for testing. For MS Web data, the training and testing data sets were already prepared as 32711 records for training and 5000 records for testing. For accidents, as this data set is too large to fit into the memory, we sampled 1/20 as training data and sampled another 1/20 for testing. We use the following 3 evaluation criteria for model comparison.

  1. We measure the difference between the predicted set of frequent itemsets and the true set of frequent itemsets by calculating the false negative rate () and the false positive rate (). They are calculated by

    where is the number of itemsets that the model failed to predict, is the number of itemsets that the model falsely predicted and is the number of itemsets that the model predicted correctly. Note that gives the recall and gives the precision of the data-miner.

  2. For any true frequent itemset , we calculate the relative error by:

    where is the probability predicted by the model. The overall quality of the estimation is:


    where is the total number of true frequent itemsets.

  3. To test whether the model is under-estimating or over-estimating, we define the empirical mean of relative difference for a given set as:


The parameter settings of the experiments are as follows. As the aim of applying the algorithms on the synthetic datasets is to see how closely the new models compare with the original mixture model, we assume that we already know the correct model before learning. Therefore for the synthetic data sets, we choose the number of components the same as the original mixture model except the DP mixture via Gibbs sampling. For the real datasets, we used 15, 25, 50 and 75 components respectively for the finite Bayesian models and the truncated DP mixture model. For the DP mixture model via Gibbs sampler, we don’t need to set

. For each parameter configuration, we repeat 5 times to reduce the variance. The hyper-parameters for both finite and infinite Bayesian models are set as follows:

equals 1.5, equals the frequency of the items in the whole data sets and equals .

chess mushroom MS Web accidents
threshold 50% 20% 0.5% 30%
lengthtotal 1262028 53575 570 146904
1 37