## 1 Introduction

The extraction of association rules is one of the fundamental primitives in
data mining and knowledge discovery from large databases [1].
In its most general definition, the problem can be reduced to identifying
frequent sets of items, or *frequent itemsets*, appearing in a fraction at
least of all transactions in a dataset, where is provided in
input by the user. Frequent itemsets and association rules are not only of
interest for classic data mining applications (e.g., market basket analysis), but
are also useful for other data analysis and mining task, including clustering,
classification, and indexing [16, 15].

In most applications, one is not interested in mining a dataset to extract the frequent itemsets *per se*, but the mining process is used to infer properties of the underlying process that generated the dataset. For example, in market basket analysis the dataset is analyzed with the intent to understand the purchase behavior of customers, assuming that the purchase behavior of customers that generated the current dataset is the same that will be followed in the future.
As an example, if one analyzes the transactional dataset of purchases to identify itemsets frequently sold on a specific day of the week (e.g., Monday), she will use these patterns to infer the itemsets that will be frequently sold on the same day of the following weeks. Analogously, in online recommendation systems the itemsets describing the frequent purchases of current customers are used to infer the frequent purchases of the future customers.

A natural and general model for these settings is to assume that the
transactions in the dataset are independent identically distributed
(i.i.d) samples from a (unknown) probability distribution on all the possible
transactions on the items of , and each itemset has a fixed probability to
appear in a random transaction from . The goal of the mining process is
then to identify itemsets that have probability at least to appear in
a random transaction drawn from , given that for an itemset such
probability corresponds to the fraction of transactions, among an infinite
number of transactions, that contain the itemset. Since represents only a
finite sample from , the frequent itemsets of only provide an
approximation of such itemsets, and due to the random nature of the generative
process a number of *spurious, or false, discoveries*, i.e. itemsets that
appear among the frequent itemsets of but are not generated with
probability at least , may be reported. Given that
is not known, on one hand one can not aim at identifying all
and only the itemsets having probability at least .
On the
other hand, using the frequent itemsets of as a proxy for such itemsets
does not provide any guarantee on the number of false discoveries
that are reported. The problem of identifying the itemsets that appear with
probability at least with guarantees on the quality of the returned set has received scant attention
in the literature.

### 1.1 Our contributions

In this paper we address the problem of
identifying the *True Frequent Itemsets* (TFIs), that is the itemsets that
appear with probability at least , while providing rigorous probabilistic guarantees on the number of false
discoveries, and without making any assumption on the generative model of the
transactions, i.e. on . This makes the methods we introduce completely *distribution free*

. We develope our methods within the statistical hypothesis testing framework: each itemset has an associated null hypothesis claiming that the itemset has probability less than

. This hypothesis is tested using information obtained from the dataset, and accepted or rejected accordingly. If the hypothesis is rejected, the itemset is included in the output collection. We focus on returning a set of TFIs with bounded Family-Wise Error Rate (FWER), that is we bound the probability that one or more false discovery (an itemset that have probability less than ) is reported among the TFIs. In particular, the methods have an FWER guaranteed to be within the user-specified limits.We use results from statistical learning theory to develop and analyze our methods. We define a range set associated to a collection of itemsets and give an upper bound to its (empirical) VC-dimension, showing an interesting connection with a variant of the knapsack optimization problem. This generalizes results from Riondato and Upfal [23]. To the best of our knowledge, ours is the first work to apply these techniques to the field of TFIs, and in general the first application of the sample complexity bound based on empirical VC-Dimension to the field of data mining. We implemented our methods and evaluate their performances in terms of actually controlling the FWER and achieving high statistical power. We notice that the methods are even better at controlling the FWER than what the theory guarantees, and they offer a very high power: only a small fraction of the TFIs is not included in the output collection. We also compare our methods with currently available techniques and find them competitive or superior.

We stress that we do not impose any restriction on the generative model of the transactions that are observed in the transactional dataset. In fact, we only assume that the transactions are independent samples from the distribution , without any constraint on the properties of . This is in contrast with the assumptions that are made by methods that perform statistical testing after the frequent patterns have been identified. In fact, these methods require a well specified, limited generative model to characterize the significance of a pattern.

#### Outline.

The article is organized as follows. In Sect. 1.2 we review relevant previous contributions. Sections 2 and 3 contain preliminaries to formally define the problem and key concepts that we will use throughout the work. Our methods are described and analyzed in Sect. 4. We present the methodology and results of our experimental evaluation in Sect. 5. Conclusions and future work are presented in Sect. 6.

### 1.2 Previous work

While the problem of identifying the TFIs has received scant attention in the literature, a number
of approaches have been proposed to filter the FIs of *spurious patterns*, i.e., patterns that are not
actually *interesting*, according to some interestingness measure. We refer the reader to [15, Sect. 3]
and [8] for surveys on different measures.
We remark that, as noted by Liu et al. [20], the use of the minimum
support threshold , reflecting the level of domain significance, is complementary to the
use of interestingness measures, and that “statistical significance measures and domain significance
measures should be used together to filter uninteresting rules from different perspectives.”
A number of works explored the idea to use statistical properties of the
patterns in order to assess their interestingness. While this is not the focus of our work,
some of the techniques and models proposed are relevant to our framework.

Most of these works are focused on association rules, but some results can be applied to itemsets. In these works, the notion of interestingness is related to the deviation between the actual support of a pattern in the dataset and its expected support in a random dataset generated according to a statistical model that can incorporate prior belief and that can be updated during the mining process to ensure that the most “surprising” patterns are extracted. In many previous works, the statistical model is a simple independence model: an item belongs to a transaction independently from other items [25, 22, 7, 3, 10, 14, 19]. In contrast, our work does not assume any statistical model for data generation, or better, does not impose any restriction on the model, with the result that our method is as general as possible, being distribution-free.

Kirsch et al. [19] developed a multi-hypothesis testing procedure to identify the best support threshold such that the number of itemsets with at least such support deviates significantly from its expectation in a random dataset of the same size and with the same frequency distribution for the individual items. In our work, the minimum threshold is an input parameter fixed by the user, and we return a collection of itemsets such that they all have a support at least as high as the threshold with respect to the distribution that generates the sample data.

Bolton et al. [3] suggest that, in pattern extraction settings, it may be more relevant to bound the False Discovery Rate rather than the Family-Wise Error Rate, due to the high number of statistical tests involved. In our experimental evaluation we noticed that the high number of tests is a problem when using traditional multiple-hypothesis correction techniques. The methods we present do not incur in this issue because they consider all the itemsets together, without the need to test each of them singularly.

Gionis et al. [10] present a method to create random datasets that can act as samples from a distribution satisfying an assumed generative model. The main idea is to swap items in a given dataset while keeping the length of the transactions and the sum over the columns constant. This method is only applicable if one can actually derive a procedure to perform the swapping in such a way that the generated datasets are indeed random samples from the assumed distribution. For the problem we are interested in such a procedure is not available. Considering the same generative model, Hanhijärvi [17] presents a direct adjustment method to bound the FWER while taking into consideration the actual number of hypotheses to be tested.

Webb [28] proposes the use of established statistical techniques to control the probability of false discoveries. In one of these methods (called holdout), the available data are split into two parts: one is used for pattern discovery, while the second is used to verify the significance of the discovered patterns, testing one statistical hypothesis at a time. A new method (layered critical values) to choose the critical values when using a direct adjustment technique to control the FWER is presented by Webb [29] and works by exploiting the itemset lattice. The first method we present is inspired by the holdout technique, while the second can be used also when the dataset can not be split.

Liu et al. [20] conduct an experimental evaluation of direct corrections, holdout data, and random permutations methods to control the false positives. They test the methods on a very specific problem (association rules for binary classification).

In contrast with the methods presented in the works above, ours do not employ an explicit direct correction for multiple hypothesis testing, and use all data to obtain more accurate results, without the need to resampling it to create random datasets.

Teytaud and Lallich [26] suggest the use of VC-dimension to bound the risk of accepting spurious rules extracted from the database. Although referring to them as “association rules”, the rules they focus on involve ranges over domains and conjuctions of Boolean formulas to express subsets of interest. This is different than the transactional market basket analysis setting in our work.

## 2 Preliminaries

In this section we introduce the definitions, lemmas, and tools that we will use throughout the work, providing the details that are needed in later sections.

### 2.1 Itemsets mining

Given a ground set of *items*, let be a
probability distribution on . A *transaction* is a single sample
drawn from . The *length* of
a transaction is the number of items in .
A *dataset*
is a bag of transactions , i.e., of
*independent identically distributed* (i.i.d.) samples from . We
call a subset of an *itemset*. For any itemset , let
be the *support set*
of . We define the
*true trequency* of with respect to as the
probability that a transaction sampled from contains :

Analogously, given a dataset , let denote
the set of transactions in containing . The *frequency* of
in is the fraction of transactions in that contain : . It is easy to see that is the
*empirical average* and an *unbiased estimator* for :
.

Traditionally, the interest has been on extracting the set
of *Frequent Itemsets* (FIs) from with respect to a minimum frequency
threshold [1], that is, the set

In most applications the final goal of data mining is to gain a better
understanding of the *process generating the data*, i.e., of the true
frequency , which is *unknown* and only approximately reflected
in the dataset . Therefore, we are interested in finding the itemsets with *true* frequency
at least for some . We call these itemsets the
*True Frequent Itemsets* (TFIs) and denote the set they form as

If one is only given a *finite* number of random
samples (the dataset ) from as it is usually the case, one can not
aim at finding the exact set : no assumption can be
made on the set-inclusion relationship between and
,
because an itemset may not appear in
, and vice versa. One can instead try
to *approximate* the set of TFIs. Specifically, given an user-specified
parameter , we aim at approximating
with a collection of itemsets that, with probability at least
, does not contain any spurious itemset:

At the same time, we want to maximize the size of , i.e., finding as many TFIs as possible. In this work we present methods to identify a large number of the TFIs. These methods do not assume any limitation on but use information from , and guarantee a small probability of false positives while achieving a high success rate.

### 2.2 Vapnik-Chervonenkis dimension

The Vapnik-Chernovenkis (VC) dimension of a class of subsets defined on a set of points is a measure of the complexity or expressiveness of such class [27]. A finite bound on the VC-dimension of a structure implies a bound on the number of random samples required to approximate the expectation of each indicator function associated to a set with its empirical average. We outline here some basic definitions and results and refer the reader to the works of Alon and Spencer [2, Sect. 14.4] and Boucheron et al. [4, Sect. 3] for an introduction of VC-dimension and a survey of recent developments.

Let be a domain and be a collection of subsets from . We call a
*range set on *.
Given , the *projection of on * is the set
. We say that the set is
*shattered* by if .

###### Definition 1.

Given a set , the *empirical Vapnik-Chervonenkis
(VC) dimension of on *, denoted as is
the cardinality of the largest subset of that is shattered by
. The *VC-dimension of * is defined as .

The main application of (empirical) VC-dimension in statistics and learning theory is in computing the number of samples needed to approximate the probabilities associated to the ranges through their empirical averages. Formally, let

be a collection of independent identically distributed random variables taking values in

, sampled according to some distribution on the elements of . For a set , let be the probability that a sample from belongs to the set , and letwhere is the indicator function for the set . The function
is the *empirical average* of on .

###### Definition 2.

Let be a range set on and be a probability distribution on . For ,
an *-approximation to * is a bag of
elements of such that

An -approximation can be constructed by sampling points of the domain according to the distribution , provided an upper bound to the VC-dimension of or to its empirical VC-dimension is known:

###### Theorem 1 (Thm. 2.12 [18]).

Let be a range set on with , and let be a distribution on . Given and a positive integer , let

(1) |

where is an universal positive constant. Then, a bag of elements of sampled independently according to is an -approximation to with probability at least .

Löffler and Phillips [21] showed experimentally that the constant is approximately . We used this value in our experiments.

###### Theorem 2 (Sect. 3 [4]).

Let be a range set on , and let be a distribution on . Let be a collection of elements from sampled independently according to . Let be an integer such that . Given , let

(2) |

Then, is a -approximation for with probability at least .

### 2.3 Statistical hypothesis testing

We develope our methods to identify the True Frequent Itemsets within the
framework of *statistical
hypothesis testing*. In statistical hypothesis testing, one uses some data
to evaluate a *null hypothesis* , whose rejection corresponds to claiming
the identification of a significant phenomenon. A *test statistic* associated
to the hypothesis is computed from the data. If belongs to a predefined
*acceptance region* (a subset of the domain of ), then is
accepted, otherwise it is rejected. The acceptance region is defined a priori
in such a way that the probability of a *false positive*, i.e., the
probability of accepting a true null hypothesis (corresponding to a non significant
phenomenon), is at most some *critical
value* . Accepting a true null hypothesis is also called a “Type-1
error”. Often, defining an acceptance region as function of is done
implicitly, and instead of verifying whether the statistics belongs to it, one
evaluates the associated *-value*, viz. the probability that the
statistic is at least as extreme as the value observed on the given data,
conditioning on the null hypothesis being true, and reject if the -value
is not larger than . Another important factor for a
statistical test is its *power*, that is the probability that the test
correctly rejects the null hypothesis when the null hypothesis is false (also
defined as , where a “Type-2 error” consists in
non rejecting a false null hypothesis).

In our scenario, the naïve method to employ statistical hypothesis testing
to find the TFIs is to define a null hypothesis for every itemset
,
with “”, and to compute the -value for such null
hypothesis using as statistic. In particular, the -value is given
by the probability that the frequency of in a random dataset (with the same
number of transactions of ) sampled from is
*conditioning* on the event “”. This is easy to
compute given that the number of transactions in a dataset that contain has
a Binomial distribution whose parameters are the size of the dataset and the
true frequency of (*Binomial test*). If the -value is “small enough”, the
null hypothesis is rejected, that is is flagged as a TFI. The issue, in
our case, is that we are considering a number of itemsets, that is we are facing a
*multiple hypothesis testing* problem [20]. In this case, one is
interested in controlling the probability of false discoveries among all the
hypotheses tested, i.e., in controlling the *Family-Wise Error Rate*.

###### Definition 3.

The Family-Wise Error Rate (FWER) of a statistical test is the probability of reporting at least one false discovery.

In order to achieve the desired FWER, one must define a sequence of critical
values for the individual hypotheses, that is, implicitly define a sequence of
acceptance region for the test statistics. The *Bonferroni
correction* [6] is a widely-employed method to define such a
sequence. In its simplest form, the Bonferroni correction suggests to compare
the -value of each hypothesis to the critical value , where is the
desired FWER, and is the number of hypotheses to be tested. The Bonferroni
correction is not a good choice to identify TFIs,
since the statistical power of methods that use the Bonferroni
correction decreases with the number of hypotheses tested. In practical cases
there would be hundreds or thousands of TFIs, making the use of the Bonferroni
correction impractical if one wants to achieve high statistical power.

## 3 The range set of a collection of itemsets

In this section we define the concept of a range set associated to a collection of itemets and show how to bound the VC-dimension and the empirical VC-dimension of such range sets. We use these tools in the methods presented in later sections.

###### Definition 4.

Given a collection of itemsets built on a ground set , the
*range set associated to is a range
space on * containing the support sets of the itemsets in :
.

###### Lemma 1.

Let be a collection of itemsets and let be a dataset. Let be the maximum integer for which there are at least transactions such that the set is an antichain, and each , contains at least itemsets from . Then .

###### Proof.

The first requirement guarantees that the set of transactions considered in the computation of could indeed theoretically be shattered. Assume that a subset of contains two transactions and such that . Any itemset from appearing in would also appear in , so there would not be any itemset such that but , which would imply that can not be shattered. Hence sets that are not antichains should not be considered. This has the net effect of potentially resulting in a lower , i.e., in a stricter upper bound to .

Let now and consider a set of transactions from that is an antichain. Assume that is shattered by . Let be a transaction in . The transactions belongs to subsets of . Let be one of these subsets. Since is shattered, there exists an itemset such that . From this and the fact that , we have that or equivalently that . Given that all the subsets containing are different, then also all the ’s such that should be different, which in turn implies that all the itemsets should be different and that they should all appear in . There are subsets of containing , therefore must contain at least itemsets from , and this holds for all transactions in . This is a contradiction because and is the maximum integer for which there are at least transactions containing at least itemsets from . Hence cannot be shattered and the thesis follows. ∎∎

The exact computation of as defined in Lemma 1 could
be extremely expensive since it requires to scan the transactions one by one,
compute the number of itemsets from appearing in each
transaction, and make sure to consider antichains. Given the very large
number of transactions in typical dataset and the fact that the number of
itemsets in a transaction is exponential in its length, this method would be
computationally too expensive. An upper bound to (hence to
) can be computed more efficiently by solving a
*Set-Union Knapsack Problem* (SUKP) [12] associated to
.

###### Definition 5 ([12]).

Let be a set of elements and let
be a set of subsets of , i.e.
for . Each subset , , has an associated
non-negative profit , and each element , as an associate non-negative weight .
Given a subset , we define the profit of
as . Let
. We
define the weight of as . Given a non-negative parameter that we call
*capacity*, the *Set-Union Knapsack Problem* (SUKP) requires to find
the set which *maximizes*
over all sets for which .

In our case, is the set of items that appear in the itemsets of
, , the profits and the weights are all
unitary, and the capacity constraint is an integer . We call this
optimization problem the *SUKP associated to with capacity
*.
It should be evident that the optimal profit of this SUKP is the maximum number
of itemsets from that a transaction of length can contain. In order to show how to use this fact to compute an upper bound to
, we need to define some additional terminology. Let
be the sequence of the
*transaction lengths* of , i.e., for each value
for which there is at least a transaction in of length , there is
one (and only one) index , such that . Assume that
the ’s are labelled in sorted decreasing order:
. Let now , be the number of
transactions in that have length at least and such that
for no two , of them we have either or
. The sequences and a sequence
of upper bounds to can be computed efficiently with a scan of the
dataset. Let now be the optimal profit of the SUKP associated to
with capacity , and let .
The following lemma uses these sequences to show how to obtain an upper bound to
the empirical VC-dimension of on .

###### Lemma 2.

Let be the minimum integer for which . Then .

###### Proof.

If , then there are at least transactions which can contain itemsets from and this is the maximum for which it happens, because the sequence is sorted in decreasing order, given that the sequence is. Then satisfies the conditions of Lemma 1. Hence . ∎∎

###### Corollary 1.

Let be profit of the SUKP associated to with capacity equal to , the number of items such that there is at least one itemset in containing them. Let . Then .

The SUKP is NP-hard in the general case, but there are known restrictions for which it can be solved in polynomial time using dynamic programming [12]. For our case available optimization problem solvers can compute the optimal solution reasonably fast even for very large instances with thousands of items and hundred of thousands of itemsets in . Moreover, it is actually not necessary to compute the optimal solution to the SUKP: any upper bound solution for which we can prove that there is no power of two between that solution and the optimal solution would result in the same upper bound to the (empirical) VC dimension, while substantially speeding up the computation.

The range set associated to is particularly interesting for us. Better bounds to and are available.

###### Theorem 3 ([23]).

Let be a dataset built on a
ground set . The
*d-index* of is the maximum integer such that
contains at least transactions of length at least that form an antichain.
Then .

###### Corollary 2.

.

Riondato and Upfal [23] presented an efficient algorithm to compute an upper bound to the d-index of a dataset with a single linear scan of the dataset . The upper bound presented in Thm. 3 is tight: there are datasets for which [23]. This implies that the upper bound presented in Corol. 2 is also tight.

## 4 Finding the true frequent itemsets

In this section we present two methods to identify True Frequent Itemsets with respect to a minimum true frequency threshold , while guaranteeing that the Family-Wise Error Rate (FWER) is less than , for some user-specified parameter . In other words, we present two algorithms to find a collection of itemsets such that . Our methods achieve this goal using the same tools in two different ways and are applicable to different and complementary situations. Specifically the first method can be used when it is possible to randomly split the available dataset in two parts, and is inspired by the holdout technique [28]. While splitting the dataset may be advantageous in certain cases, this may not always be possible and in general a better characterization of the TFIs may be obtained using the whole dataset. We develop a second method for this situation.

The intuition behind the two methods is similar. Each method starts by building a set
of “candidates TFIs”. To each itemset
we associate a *null hypothesis*
“”, and the tests use the frequency of in
the dataset (or in a portion of the dataset) as the *test statistic*. If
the frequency falls into the *acceptance region* , where is a function of and
computed by our methods,
then is *accepted*, otherwise is *rejected*
and is flagged as True Frequent and included in the output collection
. Any
itemset not in is not considered and will not
be reported in output. It should be clear that the definition of the acceptance
region is critical for the method to have the desired FWER at most
: one needs to compute an such that

The two methods we present differ in the definition of and in the computation of , but both use the tools we developed in Sect. 3.

### 4.1 Method 1: a split-dataset approach

We now present and analyze our first method for identifying TFIs, which draws
inspiration from the holdout technique presented by Webb [28]. This test
requires that the dataset could be randomly split into two
parts, not necessarily of the same size: an *exploratory* part
and an *evaluation* part .
The method works in two phases: we first use the exploratory part
to identify a small set of candidate TFIs which will be used to
compute the acceptance region, then we test these same itemsets using their
frequencies in the evaluation part as test statistics.

Let and be such that . Let be the range space of all itemsets. We use Corol. 2 (resp. Thm. 3) to compute an upper bound to (resp. to ). Then, given that is still a collection of i.i.d. samples from the generative distribution , we can use in Thm. 1 (resp. in Thm 2) to compute an (resp. an ) such that is, with probability at least , an -approximation (resp. -approximation) to . Then, if we let , is, with probability at least , an -approximation to .

In the first phase of the set we compute the collections of itemsets and . To obtain these sets, we extract the set and partition it appropriately into and . In the second phase, we compute a value such that, with probability at least , the evaluation dataset is an -approximation to . In order to obtain through Thms. 1 and 2, we need to compute upper bounds to and . We solve SUKPs assocociated to and obtain such bounds, as stated in Corol. 1 and Lemma 2 respectively. As in the first phase, we use these bounds and Thm. 1 (resp. Thm. 2) to compute a (resp. ) such that is, with probability at least , an -approximation (resp. an -approximation) to . Once we have obtained , we compute the set

The method returns the collection of itemsets . The proof that comes from the definition of and through Thms. 1 and 2 .

###### Lemma 3.

The FWER of Method 1 is at most .

###### Proof.

Consider the two events =“ is an -approximation for ” and “ is an -approximation for . From the above discussion it follows that the event occurs with probability at least . Suppose from now on that indeed occurs.

Given that occurs, then all the itemsets with frequency in at least must have a real frequency at least . This equals to say that all itemsets in are True Frequent Itemsets (.

Given that occurs, then we know that all itemsets in have frequency in that is at most far from their true frequency:

In particular this means that an itemset can have
*only* if
, that is *only* if .
Hence, .

We can then conclude that if the event occurs, we have . Since occurs with probability at least , this equals to say that

∎∎

### 4.2 Method 2: a full-dataset approach

We now present a method for identifying TFI’s that can be used in all cases, even when it is not possible or desirable to split the available dataset in two parts. Our experimental evaluation also showed that this method can often identify a larger fraction of the TFIs than Method 1, at the expense of requiring more computational resources. The intuition behind the definition of the acceptance region is the following. Let be the negative border of , that is the set of itemsets not in but such that all their proper subsets are in . If we can find an such that is an -approximation to with probability at least , then, if this is the case, we have that any itemset has a frequency in less than , given that it must be . By the antimonotonicity property of the frequency, the same holds for all itemsets that are supersets of those in . Hence, the only itemsets that can have frequency in greater or equal to are those with true frequency at least . In the following paragraphs we show how to compute .

Let and be such that . Following the same steps as in the first phase of Method 1, we can
find an such that is an -approximation for
with probability at
least . Let be the *negative
border* of ,
, and .
We want to find an upper bound the (empirical) VC-dimension of
. To this end, we use the fact that the negative border of a
collection of itemsets is a *maximal
antichain* on .

###### Lemma 4.

Let be the set of maximal antichains in . If is an -approximation to , then

, and

.

###### Proof.

Assume that is an -approximation to . From the definition of this happens with probability at least . Then . From this and the definition of negative border and of , we have that . Since is a maximal antichain, then , and the thesis follows immediately. ∎∎

In order to compute upper bounds to and we can solve slightly modified SUKPs associated to with the additional constraint that the optimal solution, which is a collection of itemsets, must be a maximal antichain. Using these bounds in Thms. 1 and 2, we compute an such that, with probability at least , is an -approximation to . The method returns the collection of itemsets , and we prove that the probability that this collection contains any itemset not in TFI’s is at most .

###### Lemma 5.

The FWER of Method 2 is at most .

###### Proof.

Consider the two events =“ is an -approximation for ” and “ is an -approximation for . From the above discussion and the definition of and it follows that the event occurs with probability at least . Suppose from now on that indeed occurs.

Since occurs, then Lemma 4 holds, and the bounds we compute by solving the modified SUKP problems are indeed bounds to and . Then the computation of is valid. Since also occurs, then for any we have , but given that because the elements of are not TFIs, then we have . Because of the antimonotonicity property of the frequency and the definition of , this hold for any itemset that is not True Frequent. Hence, the only itemsets that can have a frequency in at least are the TFIs, so . ∎∎

## 5 Experimental evaluation

We conducted an extensive evaluation to assess the practical applicability and the statistical power of the methods we propose. In the following sections we describe the methodology and present the results.

#### Implementation.

We implemented our methods in Python 3.3. To mine the FIs, we used the C implementation by Grahne
and Zhu [13]. Our solver of choice for the SUKPs was IBM^{®}
ILOG^{®} CPLEX^{®}
Optimization Studio 12.3. We run the experiments on a number of machines with x86-64 processors running
GNU/Linux 2.6.32.

### 5.1 Datasets generation

We evaluated our methods using datasets from the FIMI’04
repository^{2}^{2}2http://fimi.ua.ac.be/data/:
accidents [9], BMS-POS, chess,
connect, kosarak, pumsb^{*}, and
retail [5]. These datasets differs in size, number of
items, and more importantly distribution of the frequencies of the
itemsets [11]. For currently available datasets the *ground
truth* is not known, that is, we do not know the distribution , hence
neither the true frequencies of the itemsets, which we need to evaluate
the performances of our methods. Therefore to establish a ground truth, we
created new *enlarged* versions of the datasets from the FIMI
repository by sampling transactions uniformly at random from them until the desired size
(in our case 20 million transactions) has been reached. We considered the true
frequencies of the itemsets to be the frequencies in the original datasets.
Given that our method to control the FWER is distribution-independent this is a
valid way to establish a ground truth. We used these enlarged datasets in our
experiment.

### 5.2 Control of the FWER

We evaluated the capability of our methods to control the FWER by first creating
a number of enlarged datasets, and then run our methods to extract a collection of TFIs from thee dataset.
We used a wide range of values (see Table 1) for the minimum
true frequency threshold and fixed .
We repeated each experiment on 20 different enlarged datasets generated from
the same distribution, for each original distribution. In all the hundreds of
runs of our algorithms, the returned collection of itemsets never contained
*any false positives*, i.e., *always contained only TFIs*. In other
words, the *precision* of the output was 1.0 in all our experiments. This
means that not only our methods can control the FWER effectively, but they are
more *conservative* than what is guaranteed by the theoretical
analysis.

Statistical power (%) | ||||||

Split dataset | Full dataset | |||||

Dataset | No. of TFIs | Holdout | Method 1 | Bonferroni | Method 2 | |

accidents | 0.8 | 149 | 94.631 | 97.987 | 83.893 | 95.973 |

0.7 | 529 | 97.543 | 98.488 | 92.439 | 98.488 | |

0.6 | 2074 | 98.505 | 98.987 | 96.625 | 99.132 | |

0.5 | 8057 | 98.349 | 99.007 | 94.551 | 99.081 | |

0.45 | 16123 | 98.177 | 98.915 | 94.691 | 99.076 | |

0.4 | 32528 | 98.032 | 98.761 | 94.774 | 98.970 | |

0.35 | 68222 | 98.140 | 98.666 | |||

0.3 | 149545 | 98.033 | 98.529 | |||

0.25 | 346525 | 98.165 | 98.382 | |||

0.2 | 889883 | 97.995 | 98.057 | |||

BMS-POS | 0.05 | 59 | 98.305 | 93.220 | 84.746 | 93.220 |

0.03 | 134 | 99.254 | 99.254 | 88.060 | 99.254 | |

0.02 | 308 | 98.377 | 95.130 | 84.740 | 95.455 | |

0.01 | 1099 | 95.814 | 85.805 | 81.620 | 88.171 | |

0.0075 | 1896 | 96.097 | 82.331 | |||

chess | 0.8 | 8227 | 97.265 | 98.918 | 96.475 | 99.198 |

0.75 | 20993 | 96.375 | 98.042 | 95.756 | 98.418 | |

0.7 | 48731 | 97.613 | 99.001 | |||

0.6 | 254944 | 97.243 | 98.464 | |||

0.5 | 1272932 | 97.352 | 98.293 | |||

connect | 0.9 | 27127 | 93.217 | 97.213 | 89.369 | 97.921 |

0.875 | 65959 | 94.481 | 96.040 | |||

0.85 | 142127 | 94.242 | 96.882 | |||

0.8 | 533975 | 95.047 | 97.464 | |||

kosarak | 0.04 | 42 | 97.620 | 97.620 | 73.810 | 95.238 |

0.035 | 50 | 100 | 98.000 | 72.000 | 98.000 | |

0.03 | 65 | 96.923 | 93.846 | |||

0.025 | 82 | 98.780 | 97.561 | |||

0.02 | 121 | 100 | 97.521 | |||

0.015 | 189 | 98.942 | 93.122 | |||

pumsb^{*} |
0.55 | 305 | 92.131 | 94.426 | 80.984 | 95.401 |

0.5 | 679 | 99.853 | 99.853 | 92.931 | 99.853 | |

0.49 | 804 | 97.637 | 98.507 | 86.692 | 98.756 | |

0.45 | 1913 | 96.968 | 97.909 | |||

0.4 | 27354 | 98.183 | 99.101 | |||

0.35 | 116787 | 96.435 | 96.972 | |||

0.3 | 432698 | 96.567 | 97.326 | |||

retail | 0.03 | 32 | 100 | 100 | 62.500 | 100 |

0.025 | 38 | 100 | 97.368 | 84.211 | 97.368 | |

0.02 | 55 | 100 | 94.545 | 72.340 | 95.745 | |

0.015 | 84 | 100 | 95.238 | |||

0.01 | 159 | 99.371 | 93.082 |

### 5.3 Statistical power and comparison with other methods

The power of a statistical test is the probability that the test will reject the
null hypothesis when the null hypothesis is false. In the case of TFIs, this
corresponds to the probability of including a TFIs in the output collection. It
is often difficult to
analytically quantify the statistical power of a test, especially
in multiple hypothesis testing settings with correlated hypotheses.
This is indeed our case, and we therefore conducted an empirical evaluation of
the statistical power of our methods by assessing what
fraction of the total number of TFIs is reported in output. This is corresponds
to evaluating the *recall* of the output collection.
We fixed , and considered in a wide range (see Table 1

) and repeated each experiment 20 times on different datasets, finding insignificant variance in the results. We also wanted to compare the power of our methods with that of established techniques that can control the probability of Type-1 errors. To this end we adapted the holdout technique proposed by

Webb [28] to the TFIs problem and compared its power with that of Method 1. In this technique, the dataset is randomly split into two portions, an*exploratory*part and an

*evaluation*part. First the frequent itemsets w.r.t. are extracted from the exploratory part. Then, each of the null hypotheses of these itemsets is tested using a Binomial test on the frequency of the itemsets in the evaluation dataset, with critical value , where is the number of patterns found in the first step. To evaluate the power of Method 2, which does not need to split the dataset, we compared its results with the power of a method that uses the Bonferroni correction over all possible itemsets to test, with a Binomial test, the FIs w.r.t. obtained from the entire dataset.

The statistical power evaluation and the comparison with the holdout technique are presented in Table 1, where we reported the average over 20 runs for each experiment. Only partial results are reported for the full dataset approaches because of issues with the amount of computational resources required by this method when there is a very high number of itemsets in the negative border. The statistical power of our methods is very high in general, for all the tested datasets and values for . The comparison of Method 1 with the holdout technique shows that the relative performances of these methods are dataset specific: the statistical power seems to depend heavily on the true frequency distribution of the itemsets. From the table it is possible to see that Method 1 performs better than the holdout technique when the number of TFIs is high, while the opposite is true when the minimum threshold frequency is very low. We conjecture that this is due to the fact that the value for is not taken in consideration when computing the to define the acceptance region in Method 1, as the computation of in Thms. 1 and 2 does not depend on . In future work, we plan to investigate whether it is possible to include as part of the computation of . On the other hand the direct correction applied in the holdout technique becomes less and less powerful as the number of TFIs increases because it does not take into account the many correlations between the itemsets. Given that in common applications the number of frequent itemsets of interest is large, we believe that in real scenarios Method 1 can be more effective in extracting the True Frequent Itemsets while rigorously controlling the probability of false discoveries. For the full dataset case, we can see that Method 2 performs much better than the Bonferroni correction method in all cases, due to not having to take into account, when computing the acceptance region, all possible hypotheses. One can also compare Method 1 and Method 2 to each other, given that Method 2 can be used in cases where Method 1 can. Method 2 manages to achieve higher statistical power than Method 1 in almost all cases, but we must stress that this is at the expense of a longer running time and an higher amount of required memory.

### 5.4 Runtime evaluation

We found our methods to be quite efficient in practice in terms of time needed to compute the output collection. The running time is dominated by the time needed to compute the negative border (only for Method 2) and to solve the SUKPs. Various optimizations and shortcuts can be implemented to substantially speed up our methods (see also the discussion in Sect. 3).

## 6 Conclusions

In this work we developed two methods to approximate the collection of True Frequent Itemsets while guaranteeing that the probability of any false positive (FWER) is bounded by a user-specified threshold, and without imposing any restriction on the generative model. We used tools from statistical learning theory to develop and analyze our methods. The two methods cover complementary use scenarios. Our experimental evaluation shows that the methods we propose achieve very high statistical power and are competitive or superior to existing techniques to solve the same problem. There are a number of directions for further research, including improving the power of methods to identify TFIs by making the computation of the acceptance region dependent on the minimum true frequency threshold, and studying lower bounds to the VC-dimension of the range set of a collection of itemsets. Moreover, while this work focuses on itemsets mining, further research is needed for the extraction of more structured (e.g., sequences, graphs) True Frequent patterns.

## References

- Agrawal et al. [1993] Agrawal, R., Imieliński, T., Swami, A.: Mining association rules between sets of items in large databases. SIGMOD Rec. 22, 207–216 (1993)
- Alon and Spencer [2008] Alon, N., Spencer, J.H.: The Probabilistic Method. Wiley, third edn. (2008)
- Bolton et al. [2002] Bolton, R.J., Hand, D.J., Adams, N.M.: Determining hit rate in pattern search. In: Pattern Detection and Discovery. LNCS, vol. 2447, pp. 36–48 (2002)
- Boucheron et al. [2005] Boucheron, S., Bousquet, O., Lugosi, G.: Theory of classification : A survey of some recent advances. ESAIM: Probab. and Stati. 9, 323–375 (2005)
- Brijs et al. [1999] Brijs, T., Swinnen, G., Vanhoof, K., Wets, G.: Using association rules for product assortment decisions: A case study. KDD’99 (1999)
- Dudoit et al. [2003] Dudoit, S., Shaffer, J., Boldrick, J.: Multiple hypothesis testing in microarray experiments. Statis. Science 18(1), 71–103 (2003)
- DuMouchel and Pregibon [2001] DuMouchel, W., Pregibon, D.: Empirical Bayes screening for multi-item associations. KDD’01 (2001)
- Geng and Hamilton [2006] Geng, L., Hamilton, H.J.: Interestingness measures for data mining: A survey. ACM Comput. Surveys 38(3) (2006)
- Geurts et al. [2003] Geurts, K., Wets, G., Brijs, T., Vanhoof, K.: Profiling of high-frequency accident locations by use of association rules. Transp. Res. Rec.: J. of the Transp. Res. Board 1840, 123–130 (2003)
- Gionis et al. [2007] Gionis, A., Mannila, H., Mielikäinen, T., Tsaparas, P.: Assessing data mining results via swap randomization. ACM Trans. on Knowl. Disc. from Data 1(3) (2007)
- Goethals and Zaki [2004] Goethals, B., Zaki, M.J.: Advances in frequent itemset mining implementations: report on FIMI’03. SIGKDD Explor. Newsl. 6(1), 109–117 (2004)
- Goldshmidt et al. [1994] Goldshmidt, O., Nehme, D., Yu, G.: Note: On the set-union knapsack problem. Naval Research Logistics 41(6), 833–842 (1994)
- Grahne and Zhu [2003] Grahne, G., Zhu, J.: Efficiently using prefix-trees in mining frequent itemsets. FIMI’03 (2003)
- Hämäläinen [2010] Hämäläinen, W.: StatApriori: an efficient algorithm for searching statistically significant association rules. Knowl. and Inf. Sys. 23(3), 373–399 (2010)
- Han et al. [2007] Han, J., Cheng, H., Xin, D., Yan, X.: Frequent pattern mining: current status and future directions. Data Min. and Knowl. Disc. 15, 55–86 (2007),
- Han et al. [2006] Han, J., Kamber, M., Pei, J.: Data mining: concepts and techniques. Morgan Kaufmann (2006)
- Hanhijärvi [2011] Hanhijärvi, S.: Multiple hypothesis testing in pattern discovery. DS’11 (2011)
- Har-Peled and Sharir [2011] Har-Peled, S., Sharir, M.: Relative -approximations in geometry. Discr. & Comput. Geom. 45(3), 462–496 (2011),
- Kirsch et al. [2012] Kirsch, A., Mitzenmacher, M., Pietracaprina, A., Pucci, G., Upfal, E., Vandin, F.: An efficient rigorous approach for identifying statistically significant frequent itemsets. J. of the ACM 59(3), 12:1–12:22 ( 2012),
- Liu et al. [2011] Liu, G., Zhang, H., Wong, L.: Controlling false positives in association rule mining. Proc. of the VLDB Endow. 5(2), 145–156 (2011),
- Löffler and Phillips [2009] Löffler, M., Phillips, J.M.: Shape fitting on point sets with probability distributions. ESA’09 (2009)
- Megiddo and Srikant [1998] Megiddo, N., Srikant, R.: Discovering predictive association rules. KDD’98 (2008)
- Riondato and Upfal [2012] Riondato, M., Upfal, E.: Efficient discovery of association rules and frequent itemsets through sampling with tight performance guarantees. ECML PKDD’12 (2012)
- Riondato and Vandin [2013] Riondato, M., Vandin, F.: Finding the True Frequent Itemsets. CoRR abs/1301.1218 (2013). http://arxiv.org/abs/1301.1218
- Silverstein et al. [1998] Silverstein, C., Brin, S., Motwani, R.: Beyond market baskets: Generalizing association rules to dependence rules. Data Min. and Knowl. Disc. 2(1), 39–68 (1998)
- Teytaud and Lallich [2001] Teytaud, O., Lallich, S.: Contribution of statistical learning to validation of association rules (2001)
- Vapnik and Chervonenkis [1971] Vapnik, V.N., Chervonenkis, A.J.: On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications 16(2), 264–280 (1971)
- Webb [2007] Webb, G.I.: Discovering significant patterns. Mach. Learn. 68(1), 1–33 (2007)
- Webb [2008] Webb, G.I.: Layered critical values: a powerful direct-adjustment approach to discovering significant patterns. Mach. Learn. 71, 307–323 (2008)

Comments

There are no comments yet.