Finding the True Frequent Itemsets

01/07/2013 ∙ by Matteo Riondato, et al. ∙ Brown University 0

Frequent Itemsets (FIs) mining is a fundamental primitive in data mining. It requires to identify all itemsets appearing in at least a fraction θ of a transactional dataset D. Often though, the ultimate goal of mining D is not an analysis of the dataset per se, but the understanding of the underlying process that generated it. Specifically, in many applications D is a collection of samples obtained from an unknown probability distribution π on transactions, and by extracting the FIs in D one attempts to infer itemsets that are frequently (i.e., with probability at least θ) generated by π, which we call the True Frequent Itemsets (TFIs). Due to the inherently stochastic nature of the generative process, the set of FIs is only a rough approximation of the set of TFIs, as it often contains a huge number of false positives, i.e., spurious itemsets that are not among the TFIs. In this work we design and analyze an algorithm to identify a threshold θ̂ such that the collection of itemsets with frequency at least θ̂ in D contains only TFIs with probability at least 1-δ, for some user-specified δ. Our method uses results from statistical learning theory involving the (empirical) VC-dimension of the problem at hand. This allows us to identify almost all the TFIs without including any false positive. We also experimentally compare our method with the direct mining of D at frequency θ and with techniques based on widely-used standard bounds (i.e., the Chernoff bounds) of the binomial distribution, and show that our algorithm outperforms these methods and achieves even better results than what is guaranteed by the theoretical analysis.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The extraction of association rules is one of the fundamental primitives in data mining and knowledge discovery from large databases [1]. In its most general definition, the problem can be reduced to identifying frequent sets of items, or frequent itemsets, appearing in a fraction at least of all transactions in a dataset, where is provided in input by the user. Frequent itemsets and association rules are not only of interest for classic data mining applications (e.g., market basket analysis), but are also useful for other data analysis and mining task, including clustering, classification, and indexing [16, 15].

In most applications, one is not interested in mining a dataset to extract the frequent itemsets per se, but the mining process is used to infer properties of the underlying process that generated the dataset. For example, in market basket analysis the dataset is analyzed with the intent to understand the purchase behavior of customers, assuming that the purchase behavior of customers that generated the current dataset is the same that will be followed in the future. As an example, if one analyzes the transactional dataset of purchases to identify itemsets frequently sold on a specific day of the week (e.g., Monday), she will use these patterns to infer the itemsets that will be frequently sold on the same day of the following weeks. Analogously, in online recommendation systems the itemsets describing the frequent purchases of current customers are used to infer the frequent purchases of the future customers.

A natural and general model for these settings is to assume that the transactions in the dataset are independent identically distributed (i.i.d) samples from a (unknown) probability distribution on all the possible transactions on the items of , and each itemset has a fixed probability to appear in a random transaction from . The goal of the mining process is then to identify itemsets that have probability at least to appear in a random transaction drawn from , given that for an itemset such probability corresponds to the fraction of transactions, among an infinite number of transactions, that contain the itemset. Since represents only a finite sample from , the frequent itemsets of only provide an approximation of such itemsets, and due to the random nature of the generative process a number of spurious, or false, discoveries, i.e. itemsets that appear among the frequent itemsets of but are not generated with probability at least , may be reported. Given that is not known, on one hand one can not aim at identifying all and only the itemsets having probability at least . On the other hand, using the frequent itemsets of as a proxy for such itemsets does not provide any guarantee on the number of false discoveries that are reported. The problem of identifying the itemsets that appear with probability at least with guarantees on the quality of the returned set has received scant attention in the literature.

1.1 Our contributions

In this paper we address the problem of identifying the True Frequent Itemsets (TFIs), that is the itemsets that appear with probability at least , while providing rigorous probabilistic guarantees on the number of false discoveries, and without making any assumption on the generative model of the transactions, i.e. on . This makes the methods we introduce completely distribution free

. We develope our methods within the statistical hypothesis testing framework: each itemset has an associated null hypothesis claiming that the itemset has probability less than

. This hypothesis is tested using information obtained from the dataset, and accepted or rejected accordingly. If the hypothesis is rejected, the itemset is included in the output collection. We focus on returning a set of TFIs with bounded Family-Wise Error Rate (FWER), that is we bound the probability that one or more false discovery (an itemset that have probability less than ) is reported among the TFIs. In particular, the methods have an FWER guaranteed to be within the user-specified limits.

We use results from statistical learning theory to develop and analyze our methods. We define a range set associated to a collection of itemsets and give an upper bound to its (empirical) VC-dimension, showing an interesting connection with a variant of the knapsack optimization problem. This generalizes results from Riondato and Upfal [23]. To the best of our knowledge, ours is the first work to apply these techniques to the field of TFIs, and in general the first application of the sample complexity bound based on empirical VC-Dimension to the field of data mining. We implemented our methods and evaluate their performances in terms of actually controlling the FWER and achieving high statistical power. We notice that the methods are even better at controlling the FWER than what the theory guarantees, and they offer a very high power: only a small fraction of the TFIs is not included in the output collection. We also compare our methods with currently available techniques and find them competitive or superior.

We stress that we do not impose any restriction on the generative model of the transactions that are observed in the transactional dataset. In fact, we only assume that the transactions are independent samples from the distribution , without any constraint on the properties of . This is in contrast with the assumptions that are made by methods that perform statistical testing after the frequent patterns have been identified. In fact, these methods require a well specified, limited generative model to characterize the significance of a pattern.

Outline.

The article is organized as follows. In Sect. 1.2 we review relevant previous contributions. Sections  2 and 3 contain preliminaries to formally define the problem and key concepts that we will use throughout the work. Our methods are described and analyzed in Sect. 4. We present the methodology and results of our experimental evaluation in Sect. 5. Conclusions and future work are presented in Sect. 6.

1.2 Previous work

While the problem of identifying the TFIs has received scant attention in the literature, a number of approaches have been proposed to filter the FIs of spurious patterns, i.e., patterns that are not actually interesting, according to some interestingness measure. We refer the reader to [15, Sect. 3] and [8] for surveys on different measures. We remark that, as noted by Liu et al. [20], the use of the minimum support threshold , reflecting the level of domain significance, is complementary to the use of interestingness measures, and that “statistical significance measures and domain significance measures should be used together to filter uninteresting rules from different perspectives.” A number of works explored the idea to use statistical properties of the patterns in order to assess their interestingness. While this is not the focus of our work, some of the techniques and models proposed are relevant to our framework.

Most of these works are focused on association rules, but some results can be applied to itemsets. In these works, the notion of interestingness is related to the deviation between the actual support of a pattern in the dataset and its expected support in a random dataset generated according to a statistical model that can incorporate prior belief and that can be updated during the mining process to ensure that the most “surprising” patterns are extracted. In many previous works, the statistical model is a simple independence model: an item belongs to a transaction independently from other items [25, 22, 7, 3, 10, 14, 19]. In contrast, our work does not assume any statistical model for data generation, or better, does not impose any restriction on the model, with the result that our method is as general as possible, being distribution-free.

Kirsch et al. [19] developed a multi-hypothesis testing procedure to identify the best support threshold such that the number of itemsets with at least such support deviates significantly from its expectation in a random dataset of the same size and with the same frequency distribution for the individual items. In our work, the minimum threshold is an input parameter fixed by the user, and we return a collection of itemsets such that they all have a support at least as high as the threshold with respect to the distribution that generates the sample data.

Bolton et al. [3] suggest that, in pattern extraction settings, it may be more relevant to bound the False Discovery Rate rather than the Family-Wise Error Rate, due to the high number of statistical tests involved. In our experimental evaluation we noticed that the high number of tests is a problem when using traditional multiple-hypothesis correction techniques. The methods we present do not incur in this issue because they consider all the itemsets together, without the need to test each of them singularly.

Gionis et al. [10] present a method to create random datasets that can act as samples from a distribution satisfying an assumed generative model. The main idea is to swap items in a given dataset while keeping the length of the transactions and the sum over the columns constant. This method is only applicable if one can actually derive a procedure to perform the swapping in such a way that the generated datasets are indeed random samples from the assumed distribution. For the problem we are interested in such a procedure is not available. Considering the same generative model, Hanhijärvi [17] presents a direct adjustment method to bound the FWER while taking into consideration the actual number of hypotheses to be tested.

Webb [28] proposes the use of established statistical techniques to control the probability of false discoveries. In one of these methods (called holdout), the available data are split into two parts: one is used for pattern discovery, while the second is used to verify the significance of the discovered patterns, testing one statistical hypothesis at a time. A new method (layered critical values) to choose the critical values when using a direct adjustment technique to control the FWER is presented by Webb [29] and works by exploiting the itemset lattice. The first method we present is inspired by the holdout technique, while the second can be used also when the dataset can not be split.

Liu et al. [20] conduct an experimental evaluation of direct corrections, holdout data, and random permutations methods to control the false positives. They test the methods on a very specific problem (association rules for binary classification).

In contrast with the methods presented in the works above, ours do not employ an explicit direct correction for multiple hypothesis testing, and use all data to obtain more accurate results, without the need to resampling it to create random datasets.

Teytaud and Lallich [26] suggest the use of VC-dimension to bound the risk of accepting spurious rules extracted from the database. Although referring to them as “association rules”, the rules they focus on involve ranges over domains and conjuctions of Boolean formulas to express subsets of interest. This is different than the transactional market basket analysis setting in our work.

2 Preliminaries

In this section we introduce the definitions, lemmas, and tools that we will use throughout the work, providing the details that are needed in later sections.

2.1 Itemsets mining

Given a ground set of items, let be a probability distribution on . A transaction is a single sample drawn from . The length of a transaction is the number of items in . A dataset is a bag of transactions , i.e., of independent identically distributed (i.i.d.) samples from . We call a subset of an itemset. For any itemset , let be the support set of . We define the true trequency of with respect to as the probability that a transaction sampled from contains :

Analogously, given a dataset , let denote the set of transactions in containing . The frequency of in is the fraction of transactions in that contain : . It is easy to see that is the empirical average and an unbiased estimator for : .

Traditionally, the interest has been on extracting the set of Frequent Itemsets (FIs) from with respect to a minimum frequency threshold  [1], that is, the set

In most applications the final goal of data mining is to gain a better understanding of the process generating the data, i.e., of the true frequency , which is unknown and only approximately reflected in the dataset . Therefore, we are interested in finding the itemsets with true frequency at least for some . We call these itemsets the True Frequent Itemsets (TFIs) and denote the set they form as

If one is only given a finite number of random samples (the dataset ) from as it is usually the case, one can not aim at finding the exact set : no assumption can be made on the set-inclusion relationship between and , because an itemset may not appear in , and vice versa. One can instead try to approximate the set of TFIs. Specifically, given an user-specified parameter , we aim at approximating with a collection of itemsets that, with probability at least , does not contain any spurious itemset:

At the same time, we want to maximize the size of , i.e., finding as many TFIs as possible. In this work we present methods to identify a large number of the TFIs. These methods do not assume any limitation on but use information from , and guarantee a small probability of false positives while achieving a high success rate.

2.2 Vapnik-Chervonenkis dimension

The Vapnik-Chernovenkis (VC) dimension of a class of subsets defined on a set of points is a measure of the complexity or expressiveness of such class [27]. A finite bound on the VC-dimension of a structure implies a bound on the number of random samples required to approximate the expectation of each indicator function associated to a set with its empirical average. We outline here some basic definitions and results and refer the reader to the works of Alon and Spencer [2, Sect. 14.4] and Boucheron et al. [4, Sect. 3] for an introduction of VC-dimension and a survey of recent developments.

Let be a domain and be a collection of subsets from . We call a range set on . Given , the projection of on is the set . We say that the set is shattered by if .

Definition 1.

Given a set , the empirical Vapnik-Chervonenkis (VC) dimension of on , denoted as is the cardinality of the largest subset of that is shattered by . The VC-dimension of is defined as .

The main application of (empirical) VC-dimension in statistics and learning theory is in computing the number of samples needed to approximate the probabilities associated to the ranges through their empirical averages. Formally, let

be a collection of independent identically distributed random variables taking values in

, sampled according to some distribution on the elements of . For a set , let be the probability that a sample from belongs to the set , and let

where is the indicator function for the set . The function is the empirical average of on .

Definition 2.

Let be a range set on and be a probability distribution on . For , an -approximation to is a bag of elements of such that

An -approximation can be constructed by sampling points of the domain according to the distribution , provided an upper bound to the VC-dimension of or to its empirical VC-dimension is known:

Theorem 1 (Thm. 2.12 [18]).

Let be a range set on with , and let be a distribution on . Given and a positive integer , let

(1)

where is an universal positive constant. Then, a bag of elements of sampled independently according to is an -approximation to with probability at least .

Löffler and Phillips [21] showed experimentally that the constant is approximately . We used this value in our experiments.

Theorem 2 (Sect. 3 [4]).

Let be a range set on , and let be a distribution on . Let be a collection of elements from sampled independently according to . Let be an integer such that . Given , let

(2)

Then, is a -approximation for with probability at least .

2.3 Statistical hypothesis testing

We develope our methods to identify the True Frequent Itemsets within the framework of statistical hypothesis testing. In statistical hypothesis testing, one uses some data to evaluate a null hypothesis , whose rejection corresponds to claiming the identification of a significant phenomenon. A test statistic associated to the hypothesis is computed from the data. If belongs to a predefined acceptance region (a subset of the domain of ), then is accepted, otherwise it is rejected. The acceptance region is defined a priori in such a way that the probability of a false positive, i.e., the probability of accepting a true null hypothesis (corresponding to a non significant phenomenon), is at most some critical value . Accepting a true null hypothesis is also called a “Type-1 error”. Often, defining an acceptance region as function of is done implicitly, and instead of verifying whether the statistics belongs to it, one evaluates the associated -value, viz. the probability that the statistic is at least as extreme as the value observed on the given data, conditioning on the null hypothesis being true, and reject if the -value is not larger than . Another important factor for a statistical test is its power, that is the probability that the test correctly rejects the null hypothesis when the null hypothesis is false (also defined as , where a “Type-2 error” consists in non rejecting a false null hypothesis).

In our scenario, the naïve method to employ statistical hypothesis testing to find the TFIs is to define a null hypothesis for every itemset , with ”, and to compute the -value for such null hypothesis using as statistic. In particular, the -value is given by the probability that the frequency of in a random dataset (with the same number of transactions of ) sampled from is conditioning on the event “”. This is easy to compute given that the number of transactions in a dataset that contain has a Binomial distribution whose parameters are the size of the dataset and the true frequency of (Binomial test). If the -value is “small enough”, the null hypothesis is rejected, that is is flagged as a TFI. The issue, in our case, is that we are considering a number of itemsets, that is we are facing a multiple hypothesis testing problem [20]. In this case, one is interested in controlling the probability of false discoveries among all the hypotheses tested, i.e., in controlling the Family-Wise Error Rate.

Definition 3.

The Family-Wise Error Rate (FWER) of a statistical test is the probability of reporting at least one false discovery.

In order to achieve the desired FWER, one must define a sequence of critical values for the individual hypotheses, that is, implicitly define a sequence of acceptance region for the test statistics. The Bonferroni correction [6] is a widely-employed method to define such a sequence. In its simplest form, the Bonferroni correction suggests to compare the -value of each hypothesis to the critical value , where is the desired FWER, and is the number of hypotheses to be tested. The Bonferroni correction is not a good choice to identify TFIs, since the statistical power of methods that use the Bonferroni correction decreases with the number of hypotheses tested. In practical cases there would be hundreds or thousands of TFIs, making the use of the Bonferroni correction impractical if one wants to achieve high statistical power.

3 The range set of a collection of itemsets

In this section we define the concept of a range set associated to a collection of itemets and show how to bound the VC-dimension and the empirical VC-dimension of such range sets. We use these tools in the methods presented in later sections.

Definition 4.

Given a collection of itemsets built on a ground set , the range set associated to is a range space on containing the support sets of the itemsets in : .

Lemma 1.

Let be a collection of itemsets and let be a dataset. Let be the maximum integer for which there are at least transactions such that the set is an antichain, and each , contains at least itemsets from . Then .

Proof.

The first requirement guarantees that the set of transactions considered in the computation of could indeed theoretically be shattered. Assume that a subset of contains two transactions and such that . Any itemset from appearing in would also appear in , so there would not be any itemset such that but , which would imply that can not be shattered. Hence sets that are not antichains should not be considered. This has the net effect of potentially resulting in a lower , i.e., in a stricter upper bound to .

Let now and consider a set of transactions from that is an antichain. Assume that is shattered by . Let be a transaction in . The transactions belongs to subsets of . Let be one of these subsets. Since is shattered, there exists an itemset such that . From this and the fact that , we have that or equivalently that . Given that all the subsets containing are different, then also all the ’s such that should be different, which in turn implies that all the itemsets should be different and that they should all appear in . There are subsets of containing , therefore must contain at least itemsets from , and this holds for all transactions in . This is a contradiction because and is the maximum integer for which there are at least transactions containing at least itemsets from . Hence cannot be shattered and the thesis follows. ∎∎

The exact computation of as defined in Lemma 1 could be extremely expensive since it requires to scan the transactions one by one, compute the number of itemsets from appearing in each transaction, and make sure to consider antichains. Given the very large number of transactions in typical dataset and the fact that the number of itemsets in a transaction is exponential in its length, this method would be computationally too expensive. An upper bound to (hence to ) can be computed more efficiently by solving a Set-Union Knapsack Problem (SUKP) [12] associated to .

Definition 5 ([12]).

Let be a set of elements and let be a set of subsets of , i.e. for . Each subset , , has an associated non-negative profit , and each element , as an associate non-negative weight . Given a subset , we define the profit of as . Let . We define the weight of as . Given a non-negative parameter that we call capacity, the Set-Union Knapsack Problem (SUKP) requires to find the set which maximizes over all sets for which .

In our case, is the set of items that appear in the itemsets of , , the profits and the weights are all unitary, and the capacity constraint is an integer . We call this optimization problem the SUKP associated to with capacity . It should be evident that the optimal profit of this SUKP is the maximum number of itemsets from that a transaction of length can contain. In order to show how to use this fact to compute an upper bound to , we need to define some additional terminology. Let be the sequence of the transaction lengths of , i.e., for each value for which there is at least a transaction in of length , there is one (and only one) index , such that . Assume that the ’s are labelled in sorted decreasing order: . Let now , be the number of transactions in that have length at least and such that for no two , of them we have either or . The sequences and a sequence of upper bounds to can be computed efficiently with a scan of the dataset. Let now be the optimal profit of the SUKP associated to with capacity , and let . The following lemma uses these sequences to show how to obtain an upper bound to the empirical VC-dimension of on .

Lemma 2.

Let be the minimum integer for which . Then .

Proof.

If , then there are at least transactions which can contain itemsets from and this is the maximum for which it happens, because the sequence is sorted in decreasing order, given that the sequence is. Then satisfies the conditions of Lemma 1. Hence . ∎∎

Corollary 1.

Let be profit of the SUKP associated to with capacity equal to , the number of items such that there is at least one itemset in containing them. Let . Then .

The SUKP is NP-hard in the general case, but there are known restrictions for which it can be solved in polynomial time using dynamic programming [12]. For our case available optimization problem solvers can compute the optimal solution reasonably fast even for very large instances with thousands of items and hundred of thousands of itemsets in . Moreover, it is actually not necessary to compute the optimal solution to the SUKP: any upper bound solution for which we can prove that there is no power of two between that solution and the optimal solution would result in the same upper bound to the (empirical) VC dimension, while substantially speeding up the computation.

The range set associated to is particularly interesting for us. Better bounds to and are available.

Theorem 3 ([23]).

Let be a dataset built on a ground set . The d-index of is the maximum integer such that contains at least transactions of length at least that form an antichain. Then .

Corollary 2.

.

Riondato and Upfal [23] presented an efficient algorithm to compute an upper bound to the d-index of a dataset with a single linear scan of the dataset . The upper bound presented in Thm. 3 is tight: there are datasets for which  [23]. This implies that the upper bound presented in Corol. 2 is also tight.

4 Finding the true frequent itemsets

In this section we present two methods to identify True Frequent Itemsets with respect to a minimum true frequency threshold , while guaranteeing that the Family-Wise Error Rate (FWER) is less than , for some user-specified parameter . In other words, we present two algorithms to find a collection of itemsets such that . Our methods achieve this goal using the same tools in two different ways and are applicable to different and complementary situations. Specifically the first method can be used when it is possible to randomly split the available dataset in two parts, and is inspired by the holdout technique [28]. While splitting the dataset may be advantageous in certain cases, this may not always be possible and in general a better characterization of the TFIs may be obtained using the whole dataset. We develop a second method for this situation.

The intuition behind the two methods is similar. Each method starts by building a set of “candidates TFIs”. To each itemset we associate a null hypothesis ”, and the tests use the frequency of in the dataset (or in a portion of the dataset) as the test statistic. If the frequency falls into the acceptance region , where is a function of and computed by our methods, then is accepted, otherwise is rejected and is flagged as True Frequent and included in the output collection . Any itemset not in is not considered and will not be reported in output. It should be clear that the definition of the acceptance region is critical for the method to have the desired FWER at most : one needs to compute an such that

The two methods we present differ in the definition of and in the computation of , but both use the tools we developed in Sect. 3.

4.1 Method 1: a split-dataset approach

We now present and analyze our first method for identifying TFIs, which draws inspiration from the holdout technique presented by Webb [28]. This test requires that the dataset could be randomly split into two parts, not necessarily of the same size: an exploratory part and an evaluation part . The method works in two phases: we first use the exploratory part to identify a small set of candidate TFIs which will be used to compute the acceptance region, then we test these same itemsets using their frequencies in the evaluation part as test statistics.

Let and be such that . Let be the range space of all itemsets. We use Corol. 2 (resp. Thm. 3) to compute an upper bound to (resp.  to ). Then, given that is still a collection of i.i.d. samples from the generative distribution , we can use in Thm. 1 (resp.  in Thm 2) to compute an (resp. an ) such that is, with probability at least , an -approximation (resp. -approximation) to . Then, if we let , is, with probability at least , an -approximation to .

In the first phase of the set we compute the collections of itemsets and . To obtain these sets, we extract the set and partition it appropriately into and . In the second phase, we compute a value such that, with probability at least , the evaluation dataset is an -approximation to . In order to obtain through Thms. 1 and 2, we need to compute upper bounds to and . We solve SUKPs assocociated to and obtain such bounds, as stated in Corol. 1 and Lemma 2 respectively. As in the first phase, we use these bounds and Thm. 1 (resp. Thm. 2) to compute a (resp.  ) such that is, with probability at least , an -approximation (resp. an -approximation) to . Once we have obtained , we compute the set

The method returns the collection of itemsets . The proof that comes from the definition of and through Thms. 1 and 2 .

Lemma 3.

The FWER of Method 1 is at most .

Proof.

Consider the two events =“ is an -approximation for ” and is an -approximation for . From the above discussion it follows that the event occurs with probability at least . Suppose from now on that indeed occurs.

Given that occurs, then all the itemsets with frequency in at least must have a real frequency at least . This equals to say that all itemsets in are True Frequent Itemsets (.

Given that occurs, then we know that all itemsets in have frequency in that is at most far from their true frequency:

In particular this means that an itemset can have only if , that is only if . Hence, .

We can then conclude that if the event occurs, we have . Since occurs with probability at least , this equals to say that

∎∎

4.2 Method 2: a full-dataset approach

We now present a method for identifying TFI’s that can be used in all cases, even when it is not possible or desirable to split the available dataset in two parts. Our experimental evaluation also showed that this method can often identify a larger fraction of the TFIs than Method 1, at the expense of requiring more computational resources. The intuition behind the definition of the acceptance region is the following. Let be the negative border of , that is the set of itemsets not in but such that all their proper subsets are in . If we can find an such that is an -approximation to with probability at least , then, if this is the case, we have that any itemset has a frequency in less than , given that it must be . By the antimonotonicity property of the frequency, the same holds for all itemsets that are supersets of those in . Hence, the only itemsets that can have frequency in greater or equal to are those with true frequency at least . In the following paragraphs we show how to compute .

Let and be such that . Following the same steps as in the first phase of Method 1, we can find an such that is an -approximation for with probability at least . Let be the negative border of , , and . We want to find an upper bound the (empirical) VC-dimension of . To this end, we use the fact that the negative border of a collection of itemsets is a maximal antichain on .

Lemma 4.

Let be the set of maximal antichains in . If is an -approximation to , then

, and

.

Proof.

Assume that is an -approximation to . From the definition of this happens with probability at least . Then . From this and the definition of negative border and of , we have that . Since is a maximal antichain, then , and the thesis follows immediately. ∎∎

In order to compute upper bounds to and we can solve slightly modified SUKPs associated to with the additional constraint that the optimal solution, which is a collection of itemsets, must be a maximal antichain. Using these bounds in Thms. 1 and 2, we compute an such that, with probability at least , is an -approximation to . The method returns the collection of itemsets , and we prove that the probability that this collection contains any itemset not in TFI’s is at most .

Lemma 5.

The FWER of Method 2 is at most .

Proof.

Consider the two events =“ is an -approximation for ” and is an -approximation for . From the above discussion and the definition of and it follows that the event occurs with probability at least . Suppose from now on that indeed occurs.

Since occurs, then Lemma 4 holds, and the bounds we compute by solving the modified SUKP problems are indeed bounds to and . Then the computation of is valid. Since also occurs, then for any we have , but given that because the elements of are not TFIs, then we have . Because of the antimonotonicity property of the frequency and the definition of , this hold for any itemset that is not True Frequent. Hence, the only itemsets that can have a frequency in at least are the TFIs, so . ∎∎

5 Experimental evaluation

We conducted an extensive evaluation to assess the practical applicability and the statistical power of the methods we propose. In the following sections we describe the methodology and present the results.

Implementation.

We implemented our methods in Python 3.3. To mine the FIs, we used the C implementation by Grahne and Zhu [13]. Our solver of choice for the SUKPs was IBM® ILOG® CPLEX® Optimization Studio 12.3. We run the experiments on a number of machines with x86-64 processors running GNU/Linux 2.6.32.

5.1 Datasets generation

We evaluated our methods using datasets from the FIMI’04 repository222http://fimi.ua.ac.be/data/: accidents [9], BMS-POS, chess, connect, kosarak, pumsb*, and retail [5]. These datasets differs in size, number of items, and more importantly distribution of the frequencies of the itemsets [11]. For currently available datasets the ground truth is not known, that is, we do not know the distribution , hence neither the true frequencies of the itemsets, which we need to evaluate the performances of our methods. Therefore to establish a ground truth, we created new enlarged versions of the datasets from the FIMI repository by sampling transactions uniformly at random from them until the desired size (in our case 20 million transactions) has been reached. We considered the true frequencies of the itemsets to be the frequencies in the original datasets. Given that our method to control the FWER is distribution-independent this is a valid way to establish a ground truth. We used these enlarged datasets in our experiment.

5.2 Control of the FWER

We evaluated the capability of our methods to control the FWER by first creating a number of enlarged datasets, and then run our methods to extract a collection of TFIs from thee dataset. We used a wide range of values (see Table 1) for the minimum true frequency threshold and fixed . We repeated each experiment on 20 different enlarged datasets generated from the same distribution, for each original distribution. In all the hundreds of runs of our algorithms, the returned collection of itemsets never contained any false positives, i.e., always contained only TFIs. In other words, the precision of the output was 1.0 in all our experiments. This means that not only our methods can control the FWER effectively, but they are more conservative than what is guaranteed by the theoretical analysis.

Statistical power (%)
Split dataset Full dataset
Dataset No. of TFIs Holdout Method 1 Bonferroni Method 2
accidents 0.8 149 94.631 97.987 83.893 95.973
0.7 529 97.543 98.488 92.439 98.488
0.6 2074 98.505 98.987 96.625 99.132
0.5 8057 98.349 99.007 94.551 99.081
0.45 16123 98.177 98.915 94.691 99.076
0.4 32528 98.032 98.761 94.774 98.970
0.35 68222 98.140 98.666
0.3 149545 98.033 98.529
0.25 346525 98.165 98.382
0.2 889883 97.995 98.057
BMS-POS 0.05 59 98.305 93.220 84.746 93.220
0.03 134 99.254 99.254 88.060 99.254
0.02 308 98.377 95.130 84.740 95.455
0.01 1099 95.814 85.805 81.620 88.171
0.0075 1896 96.097 82.331
chess 0.8 8227 97.265 98.918 96.475 99.198
0.75 20993 96.375 98.042 95.756 98.418
0.7 48731 97.613 99.001
0.6 254944 97.243 98.464
0.5 1272932 97.352 98.293
connect 0.9 27127 93.217 97.213 89.369 97.921
0.875 65959 94.481 96.040
0.85 142127 94.242 96.882
0.8 533975 95.047 97.464
kosarak 0.04 42 97.620 97.620 73.810 95.238
0.035 50 100 98.000 72.000 98.000
0.03 65 96.923 93.846
0.025 82 98.780 97.561
0.02 121 100 97.521
0.015 189 98.942 93.122
pumsb* 0.55 305 92.131 94.426 80.984 95.401
0.5 679 99.853 99.853 92.931 99.853
0.49 804 97.637 98.507 86.692 98.756
0.45 1913 96.968 97.909
0.4 27354 98.183 99.101
0.35 116787 96.435 96.972
0.3 432698 96.567 97.326
retail 0.03 32 100 100 62.500 100
0.025 38 100 97.368 84.211 97.368
0.02 55 100 94.545 72.340 95.745
0.015 84 100 95.238
0.01 159 99.371 93.082
Table 1: Statistical power of various methods to extract the True Frequent Itemsets while controlling the FWER (). In bold the best result for each setting (split or full dataset).

5.3 Statistical power and comparison with other methods

The power of a statistical test is the probability that the test will reject the null hypothesis when the null hypothesis is false. In the case of TFIs, this corresponds to the probability of including a TFIs in the output collection. It is often difficult to analytically quantify the statistical power of a test, especially in multiple hypothesis testing settings with correlated hypotheses. This is indeed our case, and we therefore conducted an empirical evaluation of the statistical power of our methods by assessing what fraction of the total number of TFIs is reported in output. This is corresponds to evaluating the recall of the output collection. We fixed , and considered in a wide range (see Table 1

) and repeated each experiment 20 times on different datasets, finding insignificant variance in the results. We also wanted to compare the power of our methods with that of established techniques that can control the probability of Type-1 errors. To this end we adapted the holdout technique proposed by 

Webb [28] to the TFIs problem and compared its power with that of Method 1. In this technique, the dataset is randomly split into two portions, an exploratory part and an evaluation part. First the frequent itemsets w.r.t.  are extracted from the exploratory part. Then, each of the null hypotheses of these itemsets is tested using a Binomial test on the frequency of the itemsets in the evaluation dataset, with critical value , where is the number of patterns found in the first step. To evaluate the power of Method 2, which does not need to split the dataset, we compared its results with the power of a method that uses the Bonferroni correction over all possible itemsets to test, with a Binomial test, the FIs w.r.t.  obtained from the entire dataset.

The statistical power evaluation and the comparison with the holdout technique are presented in Table 1, where we reported the average over 20 runs for each experiment. Only partial results are reported for the full dataset approaches because of issues with the amount of computational resources required by this method when there is a very high number of itemsets in the negative border. The statistical power of our methods is very high in general, for all the tested datasets and values for . The comparison of Method 1 with the holdout technique shows that the relative performances of these methods are dataset specific: the statistical power seems to depend heavily on the true frequency distribution of the itemsets. From the table it is possible to see that Method 1 performs better than the holdout technique when the number of TFIs is high, while the opposite is true when the minimum threshold frequency is very low. We conjecture that this is due to the fact that the value for is not taken in consideration when computing the to define the acceptance region in Method 1, as the computation of in Thms. 1 and 2 does not depend on . In future work, we plan to investigate whether it is possible to include as part of the computation of . On the other hand the direct correction applied in the holdout technique becomes less and less powerful as the number of TFIs increases because it does not take into account the many correlations between the itemsets. Given that in common applications the number of frequent itemsets of interest is large, we believe that in real scenarios Method 1 can be more effective in extracting the True Frequent Itemsets while rigorously controlling the probability of false discoveries. For the full dataset case, we can see that Method 2 performs much better than the Bonferroni correction method in all cases, due to not having to take into account, when computing the acceptance region, all possible hypotheses. One can also compare Method 1 and Method 2 to each other, given that Method 2 can be used in cases where Method 1 can. Method 2 manages to achieve higher statistical power than Method 1 in almost all cases, but we must stress that this is at the expense of a longer running time and an higher amount of required memory.

5.4 Runtime evaluation

We found our methods to be quite efficient in practice in terms of time needed to compute the output collection. The running time is dominated by the time needed to compute the negative border (only for Method 2) and to solve the SUKPs. Various optimizations and shortcuts can be implemented to substantially speed up our methods (see also the discussion in Sect. 3).

6 Conclusions

In this work we developed two methods to approximate the collection of True Frequent Itemsets while guaranteeing that the probability of any false positive (FWER) is bounded by a user-specified threshold, and without imposing any restriction on the generative model. We used tools from statistical learning theory to develop and analyze our methods. The two methods cover complementary use scenarios. Our experimental evaluation shows that the methods we propose achieve very high statistical power and are competitive or superior to existing techniques to solve the same problem. There are a number of directions for further research, including improving the power of methods to identify TFIs by making the computation of the acceptance region dependent on the minimum true frequency threshold, and studying lower bounds to the VC-dimension of the range set of a collection of itemsets. Moreover, while this work focuses on itemsets mining, further research is needed for the extraction of more structured (e.g., sequences, graphs) True Frequent patterns.

References

  • Agrawal et al. [1993] Agrawal, R., Imieliński, T., Swami, A.: Mining association rules between sets of items in large databases. SIGMOD Rec. 22, 207–216 (1993)
  • Alon and Spencer [2008] Alon, N., Spencer, J.H.: The Probabilistic Method. Wiley, third edn. (2008)
  • Bolton et al. [2002] Bolton, R.J., Hand, D.J., Adams, N.M.: Determining hit rate in pattern search. In: Pattern Detection and Discovery. LNCS, vol. 2447, pp. 36–48 (2002)
  • Boucheron et al. [2005] Boucheron, S., Bousquet, O., Lugosi, G.: Theory of classification : A survey of some recent advances. ESAIM: Probab. and Stati. 9, 323–375 (2005)
  • Brijs et al. [1999] Brijs, T., Swinnen, G., Vanhoof, K., Wets, G.: Using association rules for product assortment decisions: A case study. KDD’99 (1999)
  • Dudoit et al. [2003] Dudoit, S., Shaffer, J., Boldrick, J.: Multiple hypothesis testing in microarray experiments. Statis. Science 18(1), 71–103 (2003)
  • DuMouchel and Pregibon [2001] DuMouchel, W., Pregibon, D.: Empirical Bayes screening for multi-item associations. KDD’01 (2001)
  • Geng and Hamilton [2006] Geng, L., Hamilton, H.J.: Interestingness measures for data mining: A survey. ACM Comput. Surveys 38(3) (2006)
  • Geurts et al. [2003] Geurts, K., Wets, G., Brijs, T., Vanhoof, K.: Profiling of high-frequency accident locations by use of association rules. Transp. Res. Rec.: J. of the Transp. Res. Board 1840, 123–130 (2003)
  • Gionis et al. [2007] Gionis, A., Mannila, H., Mielikäinen, T., Tsaparas, P.: Assessing data mining results via swap randomization. ACM Trans. on Knowl. Disc. from Data 1(3) (2007)
  • Goethals and Zaki [2004] Goethals, B., Zaki, M.J.: Advances in frequent itemset mining implementations: report on FIMI’03. SIGKDD Explor. Newsl. 6(1), 109–117 (2004)
  • Goldshmidt et al. [1994] Goldshmidt, O., Nehme, D., Yu, G.: Note: On the set-union knapsack problem. Naval Research Logistics 41(6), 833–842 (1994)
  • Grahne and Zhu [2003] Grahne, G., Zhu, J.: Efficiently using prefix-trees in mining frequent itemsets. FIMI’03 (2003)
  • Hämäläinen [2010] Hämäläinen, W.: StatApriori: an efficient algorithm for searching statistically significant association rules. Knowl. and Inf. Sys. 23(3), 373–399 (2010)
  • Han et al. [2007] Han, J., Cheng, H., Xin, D., Yan, X.: Frequent pattern mining: current status and future directions. Data Min. and Knowl. Disc. 15, 55–86 (2007),
  • Han et al. [2006] Han, J., Kamber, M., Pei, J.: Data mining: concepts and techniques. Morgan Kaufmann (2006)
  • Hanhijärvi [2011] Hanhijärvi, S.: Multiple hypothesis testing in pattern discovery. DS’11 (2011)
  • Har-Peled and Sharir [2011] Har-Peled, S., Sharir, M.: Relative -approximations in geometry. Discr. & Comput. Geom. 45(3), 462–496 (2011),
  • Kirsch et al. [2012] Kirsch, A., Mitzenmacher, M., Pietracaprina, A., Pucci, G., Upfal, E., Vandin, F.: An efficient rigorous approach for identifying statistically significant frequent itemsets. J. of the ACM 59(3), 12:1–12:22 ( 2012),
  • Liu et al. [2011] Liu, G., Zhang, H., Wong, L.: Controlling false positives in association rule mining. Proc. of the VLDB Endow. 5(2), 145–156 (2011),
  • Löffler and Phillips [2009] Löffler, M., Phillips, J.M.: Shape fitting on point sets with probability distributions. ESA’09 (2009)
  • Megiddo and Srikant [1998] Megiddo, N., Srikant, R.: Discovering predictive association rules. KDD’98 (2008)
  • Riondato and Upfal [2012] Riondato, M., Upfal, E.: Efficient discovery of association rules and frequent itemsets through sampling with tight performance guarantees. ECML PKDD’12 (2012)
  • Riondato and Vandin [2013] Riondato, M., Vandin, F.: Finding the True Frequent Itemsets. CoRR abs/1301.1218 (2013). http://arxiv.org/abs/1301.1218
  • Silverstein et al. [1998] Silverstein, C., Brin, S., Motwani, R.: Beyond market baskets: Generalizing association rules to dependence rules. Data Min. and Knowl. Disc. 2(1), 39–68 (1998)
  • Teytaud and Lallich [2001] Teytaud, O., Lallich, S.: Contribution of statistical learning to validation of association rules (2001)
  • Vapnik and Chervonenkis [1971] Vapnik, V.N., Chervonenkis, A.J.: On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications 16(2), 264–280 (1971)
  • Webb [2007] Webb, G.I.: Discovering significant patterns. Mach. Learn. 68(1), 1–33 (2007)
  • Webb [2008] Webb, G.I.: Layered critical values: a powerful direct-adjustment approach to discovering significant patterns. Mach. Learn. 71, 307–323 (2008)