MCRapper: Monte-Carlo Rademacher Averages for Poset Families and Approximate Pattern Mining

06/16/2020 ∙ by Leonardo Pellegrina, et al. ∙ Università di Padova Brown University Amherst College 0

We present MCRapper, an algorithm for efficient computation of Monte-Carlo Empirical Rademacher Averages (MCERA) for families of functions exhibiting poset (e.g., lattice) structure, such as those that arise in many pattern mining tasks. The MCERA allows us to compute upper bounds to the maximum deviation of sample means from their expectations, thus it can be used to find both statistically-significant functions (i.e., patterns) when the available data is seen as a sample from an unknown distribution, and approximations of collections of high-expectation functions (e.g., frequent patterns) when the available data is a small sample from a large dataset. This feature is a strong improvement over previously proposed solutions that could only achieve one of the two. MCRapper uses upper bounds to the discrepancy of the functions to efficiently explore and prune the search space, a technique borrowed from pattern mining itself. To show the practical use of MCRapper, we employ it to develop an algorithm TFP-R for the task of True Frequent Pattern (TFP) mining. TFP-R gives guarantees on the probability of including any false positives (precision) and exhibits higher statistical power (recall) than existing methods offering the same guarantees. We evaluate MCRapper and TFP-R and show that they outperform the state-of-the-art for their respective tasks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Pattern mining is a key sub-area of knowledge discovery from data, with a large number of variants (from itemsets mining (Agrawal et al., 1993) to subgroup discovery (Klösgen, 1992), to sequential patterns (Agrawal and Srikant, 1995), to graphlets (Ahmed et al., 2015)) tailored to applications ranging from market basket analysis to spam detection to recommendation systems. Ingenuous algorithms have been proposed over the years, and pattern mining is both extremely used in practice and a very vibrant area of research.

In this work we are interested in the analysis of samples for pattern mining. There are two meanings of “sample” in this context, but, as we now argue, they are really two sides of the same coin, and our methods work for both sides.

The first meaning is sample as a small random sample of a large dataset: since mining patterns becomes more expensive as the dataset grows, it is reasonable to mine only a small random sample that fits into the main memory of the machine. Recently, this meaning of sample as “sample-of-the-dataset” has been used also to enable interactive data exploration using progressive algorithms for pattern mining (Servan-Schreiber et al., 2018). The patterns obtained from the sample are an approximation of the exact collection, due to the noise introduced by the sampling process. To obtain desirable probabilistic guarantees on the quality of the approximation, one must study the trade-off between the size of the sample and the quality of the approximation. Many works have progressively obtained better characterizations of the trade-off using advanced probabilistic concepts (Toivonen, 1996; Chakaravarthy et al., 2009; Riondato and Upfal, 2014, 2015; Riondato and Vandin, 2018; Servan-Schreiber et al., 2018). Recent methods (Riondato and Upfal, 2014, 2015; Riondato and Vandin, 2018; Servan-Schreiber et al., 2018) use VC-dimension, pseudodimension, and Rademacher averages (Bartlett and Mendelson, 2002; Koltchinskii and Panchenko, 2000)

, key concepts from statistical learning theory 

(Vapnik, 1998) (see also Sect. 3.2 and 2), because they allow to obtain uniform (i.e., simultaneous) probabilistic guarantees on the deviations of all sample means (e.g., sample frequencies, or other measure of interestingness, of all patterns) from their expectations (the exact interestingness of the patterns in the dataset).

The second meaning is sample as a sample from an unknown data generating distribution: the whole dataset is seen as a collection of samples from an unknown distribution, and the goal of mining patterns from the available dataset is to gain approximate information (or better, discover knowledge) about the distribution. This area is known as statistically-sound pattern discovery (Hämäläinen and Webb, 2018), and there are many different flavors of it, from significant pattern mining (Terada et al., 2013) from transactional datasets (Pellegrina et al., 2019; Kirsch et al., 2012), sequences (Tonon and Vandin, 2019), or graphs (Sugiyama et al., 2015), to true frequent itemset mining (Riondato and Vandin, 2014), to, at least in part, contrast pattern mining (Bay and Pazzani, 2001). Many works in this area also use concepts from statistical learning theory such as empirical VC-dimension (Riondato and Vandin, 2014) or Rademacher averages (Pellegrina et al., 2019), because, once again, these concepts allow to get very sharp bounds on the maximum difference between the observed interestingness on the sample and the unknown interestingness according to the distribution.

The two meanings of “sample” are really two sides of the same coin, because also in the first case the goal is to approximate an unknown distribution from a sample, thus falling back into the second case. Despite this similarity, previous contributions have been extremely point-of-view-specific and pattern-specific. In part, these limitations are due to the techniques used to study the trade-off between sample size and quality of the approximation obtained from the sample. Our work instead proposes a unifying solution for mining approximate collections of patterns from samples, while giving guarantees on the quality of the approximation: our proposed method can easily be adapted to approximate collections of frequent itemsets, frequent sequences, true frequent patterns, significant patterns, and many other tasks, even outside of pattern mining.

At the core of our approach is the -Samples Monte-Carlo (Empirical) Rademacher Average (-MCERA) (Bartlett and Mendelson, 2002) (see (4)), which has the flexibility and the power needed to achieve our goals, as it gives much sharper bounds to the deviation than other approaches. The challenge in using the -MCERA, like other quantities from statistical learning theory, is how to compute it efficiently.

Contributions

We present MCRapper, an algorithm for the fast computation of the -MCERA of families of functions with a poset structure, which often arise in pattern mining tasks (Sect. 3.1).

  • [leftmargin=10pt]

  • MCRapper is the first algorithm to compute the -MCERA efficiently. It achieves this goal by using sharp upper bounds to the discrepancy of each function in the family (Sect. 4.1) to quickly prune large parts of the function search space during the exploration necessary to compute the -MCERA, in a branch-and-bound fashion. We also develop a novel sharper upper bound to the supremum deviation using the 1-MCERA (Thm. 4.6). It holds for any family of functions, and is of independent interest.

  • To showcase the practical strength of MCRapper, we develop TFP-R (Sect. 5), a novel algorithm for the extraction of the True Frequent Patterns (TFP) (Riondato and Vandin, 2014). TFP-R gives probabilistic guarantees on the quality of its output: with probability at least (over the choice of the sample and the randomness used in the algorithm), for user-supplied , the output is guaranteed to not contain any false positives. That is, TFP-R controls the Family-Wise Error Rate (FWER) at level while achieving high statistical power, thanks to the use of the -MCERA and of novel variance-aware tail bounds (Thm. 3.2). We also discuss other applications of MCRapper, to remark on its flexibility as a general-purpose algorithm.

  • We conduct an extensive experimental evaluation of MCRapper and TFP-R on real datasets (Sect. 6), and compare their performance with that of state-of-the-art algorithms for their respective tasks. MCRapper, thanks to the -MCERA, computes much sharper (i.e, lower) upper bounds to the supremum deviation than algorithms using the looser Massart’s lemma (Shalev-Shwartz and Ben-David, 2014, Lemma 26.8). TFP-R extracts many more TFPs (i.e., has higher statistical power) than existing algorithms with the same guarantees.

2. Related Work

Our work applies to both the “small-random-sample-from-large-dataset” and the “dataset-as-a-sample” settings, so we now discuss the relationship of our work to prior art in both settings. We do not study the important but different task of output sampling in pattern mining (Boley et al., 2011; Dzyuba et al., 2017). We focus on works that use concepts from statistical learning theory: these are the most related to our work, and most often the state of the art in their areas. More details are available in surveys (Riondato and Upfal, 2014; Hämäläinen and Webb, 2018).

The idea of mining a small random sample of a large dataset to speed up the pattern extraction step was proposed for the case of itemsets by Toivonen (1996) shortly after the first algorithm for the task had been introduced. The trade-off between the sample size and the quality of the approximation obtained from the sample has been progressively better characterized (Chakaravarthy et al., 2009; Riondato and Upfal, 2014, 2015), with large improvements due to the use of concepts from statistical learning theory. Riondato and Upfal (2014) study the VC-dimension of the itemsets mining task, which results in a worst-case dataset-dependent but sample- and distribution-agnostic characterization of the trade-off. The major advantage of using Rademacher averages (Koltchinskii and Panchenko, 2000), as we do in MCRapper  is that the characterization is now sample-and-distribution-dependent, which gives much better upper bounds to the maximum deviation of sample means from their expectations. Rademacher averages were also used by Riondato and Upfal (2015), but they used worst-case upper bounds (based on Massart’s lemma (Shalev-Shwartz and Ben-David, 2014, Lemma 26.2)) to the empirical Rademacher average of the task, resulting in excessively large bounds. MCRapper instead computes the exact -MCERA of the family of interest on the observed sample, without having to consider the worst case. For other kinds of patterns, Riondato and Vandin (2018) studied the pseudodimension of subgroups, while Servan-Schreiber et al. (2018) and Santoro et al. (2020) considered the (empirical) VC-dimension and Rademacher averages for sequential patterns. MCRapper can be applied in all these cases, and obtains better bounds because it uses the sample-and-distribution-dependent -MCERA, rather than a worst case dataset-dependent bound.

Significant pattern mining considers the dataset as a sample from an unknown distribution. Many variants and algorithms are described in the survey by Hämäläinen and Webb (2018). We discuss only the two most related to our work. Riondato and Vandin (2014) introduce the problem of finding the true frequent itemsets, i.e., the itemsets that are frequent w.r.t. the unknown distribution. They propose a method based on empirical VC-dimension to compute the frequency threshold to use to obtain a collection of true frequent patterns with no false positives (see also Sect. 5). Our algorithm TFP-R uses the -MCERA, and as we show in Sect. 6, it greatly outperforms the state-of-the-art (a modified version of the algorithm by Riondato and Upfal (2015) for approximate frequent itemsets mining). Pellegrina et al. (2019) use empirical Rademacher averages in their work for significant pattern mining. As their work uses the bound by Riondato and Upfal (2015), the same comments about the -MCERA being a superior approach hold.

Our approach to bounding the supremum deviation by computing the -MCERA with efficient search space exploration techniques is novel, not just in knowledge discovery, as the -MCERA has received scant attention. De Stefani and Upfal (2019) use it to control the generalization error in a sequential and adaptive setting, but do not discuss efficient computation. We believe that the lack of attention to the -MCERA can be be explained by the fact that there were no efficient algorithms for it, a gap now filled by MCRapper.

3. Preliminaries

We now define the most important concepts and results that we use throughout this work. Let be a class of real valued functions from a domain to the interval . We use to denote and to denote . In this work, we focus on a specific class of families (see Sect. 3.1). In pattern mining from transactional datasets, is the set of all possible transactions (or, e.g., sequences). Let be an unknownprobability distribution over and the sample be a bag of i.i.d. random samples from drawn according to . We discussed in Sect. 1 how in the pattern mining case, the sample may either be the whole dataset (sampled according to an unknown distribution) or a random sample of a large dataset (more details in Sect. 3.1). For each , we define its empirical sample average (or sample mean) on and its expectation respectively as

In the pattern mining case, the sample mean is the observed interestingness of a pattern, e.g., its frequency (but other measures of interestingness can be modeled as above, as discussed for subgroups by Riondato and Vandin (2018)), while the expectation is the unknown exact interestingness that we are interested in approximating, that is, either in the large datasets or w.r.t. the unknown data generating distribution. We are interested in developing tight and fast-to-compute upper bounds to the supremum deviation (SD) of on between the empirical sample average and the expectation simultaneously for all , defined as

(1)

The supremum deviation allows to quantify how good the estimates obtained from the samples are. Because

is unknown, it is not possible to compute exactly. We introduce concepts such as Monte-Carlo Rademacher Average and results to compute such bounds in Sect. 3.2, but first we elaborate on the specific class of families that we are interested in.

3.1. Poset families and patterns

A partially-ordered set, or poset is a pair where is a set and is a binary relation between elements of that is reflexive, anti-symmetric, and transitive. Examples of posets include the and the obvious “less-than-or-equal-to” () relation, and the powerset of a set of elements and the “subset-or-equal” () relation. For any element , we call an element , a descendant of (and call an ancestor of ) if . Additionally, if and there is no , , such that , then we say that is a child of and that is a parent of . For example, the set is a parent of the set and an ancestor of the set , when considering to be all possible subsets of integers and the relation.

In this work we are interested in posets where is a family of functions as in Sect. 3.2, and the relation is the following: for any

(2)

The very general but a bit complicated requirement often collapses to much simpler ones as we discuss below. We aim for generality, as our goal is to develop a unifying approach for many pattern mining tasks, for both meanings of “sample”, as discussed in Sect. 1. For now, consider for example that requiring for every is a specialization of the above more general requirement. We assume to have access to a blackbox function that, given any function , returns the list of children of according to , and to a blackbox function that, given , returns the minimal elements w.r.t. , i.e., all the functions without any parents. We refer to families that satisfy these conditions as poset families, even if the conditions are more about the relation than about the family. We now discuss how poset families arise in many pattern mining tasks.

In pattern mining, it is assumed to have a language containing the patterns of interest. For example, in itemsets mining (Agrawal et al., 1993), is the set of all possible itemsets, i.e., all non-empty subsets of an alphabet of items, while in sequential pattern mining (Agrawal and Srikant, 1995), is the set of sequences, and in subgroup discovery (Klösgen, 1992), is set by the user as the set of patterns of interest. In all these cases, for each pattern , it is possible to define a function from the domain , which is the set of all possible transactions, i.e., elementary components of the dataset or of the sample, to an appropriate co-domain , such that denotes the “value” of the pattern on the transaction . For example, for itemsets mining, is all the subsets of and maps to so that iff and otherwise. A consequence of this definition is that is the frequency of in , i.e., the fraction of transaction of that contain the pattern . A more complex (due to the nature of the patterns) but similar definition would hold for sequential patterns. For the case of high-utility itemset mining (Fournier-Viger et al., 2019), the value of would be the utility of in the transaction . The family is the set of the functions for every pattern . Similar reasoning also applies to patterns on graphs, such as graphlets (Ahmed et al., 2015).

Now that we have defined the set that we are interested in, let’s comment on the relation that, together with the set, forms the poset. In the itemsets case, for any two patterns and , i.e., for any two functions and , it holds iff . For sequences, the subsequence relation defines instead. In all pattern mining tasks, the only minimal element of w.r.t.  is the empty itemset (or sequence) . Our assumption to have access to the blackboxes and is therefore very reasonable, because computing these collections is extremely straightforward in all the pattern mining cases we just mentioned and many others.

3.2. Rademacher Averages

Here we present Rademacher averages (Koltchinskii and Panchenko, 2000; Bartlett and Mendelson, 2002) and related results at the core of statistical learning theory (Vapnik, 1998). Our presentation uses the most recent and sharper results, and we also introduce new results (Thm. 3.2, and later Thm. 4.6) that may be of independent interest. For an introduction to statistical learning theory and more details about Rademacher averages, we refer the interested reader to the textbook by Shalev-Shwartz and Ben-David (2014). In this section we consider a generic family , not necessarily a poset family.

A key quantity to study the supremum deviation (SD) from (1) is the empirical Rademacher average (ERA) of on  (Koltchinskii and Panchenko, 2000; Bartlett and Mendelson, 2002), defined as follows. Let be a collection of

i.i.d. Rademacher random variables, i.e., each taking value in

with equal probability. The ERA of on is the quantity

(3)

Computing the ERA exactly is often intractable, due to the expectation over possible assignments for , and the need to compute a supremum for each of these assignments, which precludes many standard techniques for computing expectations. Bounds to the SD are then obtained through efficiently-computable upper bounds to the ERA. Massart’s lemma (Shalev-Shwartz and Ben-David, 2014, Lemma 26.2) gives a deterministic upper bound to the ERA that is often very loose. Monte-Carlo estimation allows to obtain an often sharper probabilistic upper bound to the ERA. For , let be a matrix of i.i.d. Rademacher random variables. The -Samples Monte-Carlo Empirical Rademacher Average (-MCERA) of on using is (Bartlett and Mendelson, 2002)

(4)

The -MCERA allows to obtain probabilistic upper bounds to the SD as follows (proof in Sect. A.1). In Sect. 4.3 we show a novel improved bound for the special case (Thm. 4.6).

Theorem 3.1 ().

Let . For ease of notation let

(5)

With probability at least over the choice of and , it holds

(6)

Sharper upper bounds to can be obtained with the -MCERA when more information about is available. The proof is in Sect. A.1. We use this result for a specific pattern mining task in Sect. 5.

Theorem 3.2 ().

Let be an upper bound to the variance of every function in , and let . Define the following quantities

(7)
(8)

Then, with probability at least over the choice of and , it holds

Due to the dependency on in Thms. 3.2 and 3.1, it is often convenient to use in place of in the above theorems, where denotes the range-centralized family of functions obtained by shifting every function in by . The results still hold for because the SD is invariant to shifting, but the bounds to the SD usually improve since the corresponding for the range-centralized family is smaller.

4. MCRapper

We now describe and analyze our algorithm MCRapper to efficiently compute the -MCERA (see (4)) for a family with the binary relation defined in (2) and the blackbox functions and described in Sect. 3.1.

4.1. Discrepancy bounds

For , we denote as the -discrepancy of on w.r.t.  the quantity

The -discrepancy is not an anti-monotonic function, in the sense that it does not necessarily hold that for every descendant of . Clearly, it holds

(9)

A naïve computation of the -MCERA would require enumerating all the functions in and computing their -discrepancies, , in order to find each of the suprema. We now present novel easy-to-compute upper bounds and to such that and for every , where denote the set of the descendants of w.r.t. . This key property (which is a generalization of anti-monotonicity to posets) allows us to derive efficient algorithms for computing the -MCERA exactly without enumerating all the functions in . Such algorithms take a branch-and-bound approach using the upper bounds to to prune large portions of the search space (see Sect. 4.2).

For every and , let

and for every and , define the functions

It holds and for every and . For every and , define

and
(10)

Computationally, these quantities are extremely straightforward to obtain. Both and are upper bounds to and to for all (proof in Sect. A.1).

Theorem 4.1 ().

For any and , it holds

The bounds we derived in this section are deterministic. An interesting direction for future research is how to obtain sharper probabilistic bounds.

4.2. Algorithms

We now use the discrepancy bounds and from Sect. 4.1 in our algorithm MCRapper for computing the exact -MCERA. As the real problem is usually not to only compute the -MCERA but to actually compute an upper bound to the SD, our description of MCRapper includes this final step, this also enables fair comparison with existing algorithms that use deterministic bounds to the ERA to compute an upper bound to the SD (see also Sect. 6).

MCRapper offers probabilistic guarantees on the quality of the bound it computes (proof deferred to after the presentation).

Theorem 4.2 ().

Let . With probability at least over the choice of and of , the value returned by MCRapper is such that .

Input: Poset family , sample of size , ,
Output: Upper bound to with probability .
1 return Function getSupDevBound(, , , ):
2      return r.h.s. of (6) using
3      Function getNMCERA(, , ):
4           empty priority queue foreach  do  empty dictionary from to subsets of foreach  do
5                .push()
6                while  is not empty do
7                     .pop() foreach  s.t.  do
8                          if  then
9                              
10                              
11                               foreach  do
12                                    if  then  else  if  then
13                                         if  then .delete()
14                                         else
15                                             if  then .push()
16                                             
17                                             
18                                              return
Algorithm 1 MCRapper

The pseudocode of MCRapper is presented in Alg. 1. The division in functions is useful for reusing parts of the algorithm in later sections (e.g., Alg. 3). After having sampled the matrix of i.i.d. Rademacher random variables (line 1), the algorithm calls the function getSupDevBound with appropriate parameters, which in turn calls the function getNMCERA, the real heart of the algorithm. This function computes the -MCERA by exploring and pruning the search space (i.e., ) in according to the order of the elements in the priority queue (line 1). One possibility is to explore the space in Breadth-First-Search order (so is just a FIFO queue), while another is to use the upper bound as the priority, with the top element in the queue being the one with maximum priority among those in the queue. Other orders are possible, but we assume that the order is such that all parents of a function are explored before the function, which is reasonable to ensure maximum pruning, and is satisfied by the two mentioned orders. We assume that the priority queue also has a method delete() to delete an element in the queue. This requirement could be avoided with some additional book-keeping, but it simplifies the presentation of the algorithm.

The algorithm keeps in the quantities , , the currently best available lower bound to the quantity (see (9)), which initially are all (the lowest possible value of a discrepancy). MCRapper also maintains a dictionary (line 1), initially empty, whose keys will be elements of and the values are subsets of . The value associated to a key in the dictionary is a superset of the set of values for which , i.e., for which or one of its descendants may be the function attaining the supremum -discrepancy among all the functions in (see (9)). A function and all its descendants are pruned when this set is the empty set. The set of keys of the dictionary is, at all times, the set of all and only the functions in that have ever been added to . The last data structure is the set (line 1), initially empty, which will contain pruned elements of , in order to avoid visiting either them or their descendants.

MCRapper populates and by inserting in them the minimal elements of w.r.t.  (line 1), using the set as the value for these keys in the dictionary. It then enters a loop that keeps iterating as long as there are elements in (line 1). The top element of is extracted at the beginning of each iteration (line 1). A set , initially empty, is created to maintain a superset of the set of values for which a child of may be the function attaining the supremum -discrepancy among all the functions in (see (9)). The algorithm then iterates over the elements s.t.  is greater than (line 1). The elements for which can be ignored because and its descendants can not attain the supremum of the -discrepancy in this case, thanks to Thm. 4.1. Computing is straightforward and can be done even faster if one keeps a frequent-pattern tree or a similar data structure to avoid having to scan all the times, but we do not discuss this case for ease of presentation. For the values that satisfy the condition on line 1, the algorithm computes and updates to this value if larger than the current value of (line 1), to maintain the invariant that stores the highest value of -discrepancy seen so far (this invariant, together with the one maintained by the pruning strategy, is at the basis of the correctness of MCRapper). Finally, is added to the set (line 1), as it may still be the case that a descendant of has -discrepancy higher than . The algorithm then iterates over the children of that have not been pruned, i.e., those not in (line 1). If the child is such that there is a key in (because before we visited another parent of ), then let be , otherwise, let be . The set is a superset of the indices s.t.  may attain the supremum -discrepancy. Indeed for a value to have this property, it is necessary that for every parent of (where the value of in this expression is the one that had when was visited). If , then and all its descendants can be pruned, which is achieved by adding to (line 1) and removing from if it is a key (line 1). When , first is added to (with the appropriate priority depending on the ordering of ) if it did not belong to yet (line 1), and then is set to (line 1). This operation completes the current loop iteration starting at line 1.

Once is empty, the loop ends and the function getNMCERA() returns the sum of the values divided by . The returned value is summed to an appropriate term to obtain (line 1), which is used to compute the return value of the function getSupDevBound() using (6) with (line 1). This value is returned in output by MCRapper when it terminates (line 1).

The following result is at the core of the correctness of MCRapper (proof in Sect. A.1.)

Lemma 4.3 ().

getNMCERA(, , ) returns the value .

The proof of Thm. 4.2 is then just an application of Lemma 4.3 and Thm. 3.1 (with ), as the value returned by MCRapper is computed according to (6).

4.2.1. Limiting the exploration of the search space

Despite the very efficient pruning strategy made possible by the upper bounds to the -discrepancy, MCRapper may still need to explore a large fraction of the search space, with negative impact on the running time. We now present a “hybrid” approach that limits this exploration, while still ensuring the guarantees from Thm. 4.2.

Let be any positive value and define

and . In the case of itemsets mining, would be the set of frequent itemsets w.r.t. .

The following result is a consequence of Hoeffding’s inequality and a union bound over events.

Lemma 4.4 ().

Let . Then, with probability at least over the choice of , it holds that simultaneously for all ,

(11)

The following is an immediate consequence of the above and the definition of -MCERA.

Theorem 4.5 ().

Let . Then with probability over the choice of , it holds

The result of Thm. 4.5 is especially useful in situations when it is possible to compute efficiently reasonable upper bounds on the cardinality of , possibly using information from (but not ). For the case of pattern mining, these bounds are often easy to obtain: e.g., in the case of itemsets, it holds , where is the number of items in the transaction . Much better bounds are possible, and in many other cases, but we cannot discuss them here due to space limitations.

Combining the above with MCRapper may lead to a significant speed-up thanks to the fact that MCRapper would be exploring only (a subset of) instead of (a subset of) the entire search space , at the cost of computing an upper bound to , rather than its exact value. We study this trade-off, which is governed by the choice of , experimentally in Sect. 6.3. The correctness follows from Thms. 3.1, 4.5 and 4.2, and an application of the union bound.

We now describe this variant MCRapper-H of MCRapper, presented in Alg. 2. MCRapper-H accepts in input the same parameters of MCRapper, but also the parameters and , which controls the confidence of the probabilistic bound from Thm. 4.5. After having drawn , MCRapper-H computes the upper bound to (line 2), and calls the function getNMCERA(, , ) (line 2), slightly modified w.r.t. the one on line 1 of Alg. 1 so it returns the set of values instead of their average. Then, it computes using the r.h.s. of (11) and returns the bound to the SD obtained from the r.h.s. of (6) with .

Input: Poset family , sample of size , , ,
Output: Upper bound to with prob. .
upper bound to return r.h.s. of (6) using
Algorithm 2 MCRapper-H

It is not necessary to choose a-priori, as long as it is chosen without using any information that depends on . In situations where deciding a-priori is not simple, one may define instead, for a given value of set by the user, the quantity defined as

When the queue (line 1 of Alg. 1) is sorted by decreasing value of , the value is the maximum number of nodes the branch-and-bound search in getNMCERA() may enumerate. We are investigating more refined bounds than Thm. 4.5.

4.3. Improved bounds for

For the special case of , it is possible to derive a better bound to the SD than the one presented in Thm. 3.1. This result is new and of independent interest because it holds for any family . The proof is in Sect. A.1.

Theorem 4.6 ().

Let . With probability at least over the choice of and , it holds that

(12)

The advantage of (12) over (6) (with ) is in the smaller “tail bounds” terms that arise thanks to a single application of a probabilistic tail bound, rather than three such applications. To use this result in MCRapper, line 1 must be replaced with

so the upper bound to the SD is computed according to (12). The same guarantees as in Thm. 4.2 hold for this modified algorithm.

(a) .
(b) .
(c) .
Figure 1. Ratios of the SD Bound obtained by MCRapper () and Amira for the entire , for of the datasets we analyzed. For , dashed lines use the tail bound from Thm. 3.1 instead of the one from Thm. 4.6.
(a)
(b)
(c)
Figure 2. (a) Bound on the Supremum Deviation obtained by TFP-R and TFP-A. (b) Number of reported patterns (left -axis) and ratios (right -axis) by TFP-R and TFP-A. (c) Running times of MCRapper, MCRapper-H and Amira vs corresponding upper bound on SD of the entire . For MCRapper-H we use different values of . Each marker shape corresponds to one of the datasets we considered (other shown in the Appendix). For Amira we also show the time for mining the TFPs (Amira+Min.), with freq. , as needed after processing the sample.

5. Applications

To showcase MCRapper’s practical strengths, we now discuss applications to various pattern mining tasks. The value computed by MCRapper can be used, for example, to compute, from a random sample , a high-quality approximation of the collection of frequent itemsets in a dataset w.r.t. a frequency threshold , by mining the sample at frequency  (Riondato and Upfal, 2014). Also, it can be used in the algorithm by Pellegrina et al. (2019) to achieve statistical power in significant pattern mining, or in the progressive algorithm by Servan-Schreiber et al. (2018) to enable even more accurate interactive data exploration. Essentially any of the tasks we mentioned in Sect. 2 and 1 would benefit from the improved bound to the SD computed by MCRapper. To support this claim, we now discuss in depth one specific application.

Mining True Frequent Patterns

We now show how to use MCRapper together with sharp variance-aware bounds to the SD (Thm. 3.2) for the specific application of identifying the True Frequent Patterns (TFPs) (Riondato and Vandin, 2014). The original work considered the problem only for itemsets, but we solve the problem for a general poset family of functions, thus for many other pattern classes, such as sequences.

The task of TFP mining is, given a pattern language (i.e., a poset family) and a threshold , to output the set

Computing exactly requires to know for all ; since this is almost never the case (and in such case the task is trivial), it is only possible to compute an approximation of using information available from a random bag of i.i.d. samples from . In this work, mimicking the guarantees given in significant pattern mining (Hämäläinen and Webb, 2018) and in multiple hypothesis testing settings, we are interested in approximations that are a subset of , i.e., we do not want false positives in our approximation, but we accept false negatives. A variant that returns a superset of is possible and only requires minimal modifications ot the algorithm. Due to the randomness in the generation of , no algorithm can guarantee to be able to compute a (non-trivial) subset of from every possible . Thus, one has to accept that there is a probability over the choice of and other random choices made by the algorithm to obtain a set of patterns that is not a subset of . We now present an algorithm TFP-R with the following guarantee (proof in Sect. A.1).

Theorem 5.1 ().

Given , , , , and a number of Monte-Carlo trials, TFP-R returns a set such that

where the probability is over the choice of both and the randomness in TFP-R, i.e., an matrix of i.i.d. Rademacher variables .

The intuition for TFP-R is the following. Let be the negative border of , that is, the set of functions in such that every parent w.r.t.  of is in . If we can compute an such that, for every , it holds , then we can be sure that any such that belongs to . This guarantee will naturally be probabilistic, for the reasons we already discussed. Since is unknown, TFP-R approximates it by progressively refining a superset of it, starting from . The correctness of TFP-R is based on the fact that at every point in the execution, it holds , as we show in the proof of Thm. 5.1.

Input: Poset family , sample of size , , , .
Output: A set of patterns
1 draw(, ) if  then  else  repeat
2      getSupDevBoundVar(, , , , )
until  return
Algorithm 3 TFP-R

The pseudocode of TFP-R is presented in Alg. 3. The algorithm first draws the matrix (line 3), and then computes an upper bound to the variances of the the frequencies in (line 3). It then initializes, as discussed above, the set to (line 3) and enters a loop. At each iteration of the loop, TFP-R calls the function getSupDevBoundVar which returns a value computed as in (8) using , and . The function getNMCERA from Alg. 1 is used inside of getSupDevBoundVar (with parameters , , and ) to compute the -MCERA in the value from (7). The properties of are discussed in the proof for Thm. 5.1.

TFP-R uses to refine the set with the goal of obtaining a better approximation of . The set stores the current value of , and the new value of is obtained by keeping all and only the patterns such that (line 3). All the patterns that have been filtered out, i.e., the patterns in , or in other words, all the patterns such that , are added to the output set (line 3). TFP-R keeps iterating until the value of does not change from the previous iteration (condition on line 3), and finally the set is returned in output. While we focused on the a conceptually high-level description of TFP-R, we note that an efficient implementation only requires one exploration of , such that can be provided in output as is explored, therefore without executing either multiple instances of MCRapper or, at the end of TFP-R, a frequent pattern mining algorithm to compute .

6. Experiments

In this section we present the results of our experimental evaluation for MCRapper. We compare MCRapper to Amira (Riondato and Upfal, 2015), an algorithm that bounds the Supremum Deviation by computing a deterministic upper bound to the ERA with one pass on the random sample. The goal of our experimental evaluation is to compare MCRapper to Amira in terms of the upper bound to the SD they compute. We also assess the impact of the difference in the SD bound provided by MCRapper and Amira for the application of mining true frequent patterns, by comparing our algorithm TFP-R with TFP-A, a simplified variant of TFP-R that uses Amira to compute a bound on the SD for all functions in , and returns as candidate true frequent patterns the set . It is easy to prove that the output of TFP-A is a subset of true frequent patterns with probability . We also evaluate the running time of MCRapper and of its variant MCRapper-H.

Datasets and implementation

We implemented MCRapper and MCRapper-H in C, by modifying TopKWY (Pellegrina and Vandin, 2020). Our implementations are available at https://github.com/VandinLab/MCRapper. The implementation of Amira (Riondato and Upfal, 2015) has been provided by the authors. We test both methods on 18 datasets (see Table 1 in the Appendix for their statistics), widely used for the benchmark of frequent itemset mining algorithms. To compare MCRapper to Amira in terms of the upper bound to the SD, we draw, from every dataset, random samples of increasing size ; we considered values equally spaced in the logarithmic space in the interval . We only consider values of smaller than the dataset size . For both algorithms we fix . For MCRapper  we use .

To compare TFP-R to TFP-A, we analyze synthetic datasets of size obtained by random sampling transactions from each dataset: the true frequency of a pattern corresponds to its frequency in the original dataset, which we use as the ground truth. We use for TFP-R, and . We report the results for (other values of and produced similar results).

For all experiments and parameters combinations we perform runs (i.e., we create random samples of the same size from the same dataset). In all the figures we report the averages and avg standard deviations of these runs.

6.1. Bounds on the SD

Figure 1 shows the ratio between the upper bound on the SD obtained by MCRapper and the one obtained by Amira for different values of . The bound provided by MCRapper is always better (i.e., lower) than the bound provided by Amira (e.g., for the bound from MCRapper is always at least smaller than the bound from Amira). For one can see that the novel improved bound from Thm. 4.6 should really be preferred over the “standard” one (dashed lines). Similar results hold for all other datasets. These results highlight the effectiveness of MCRapper in providing a much tighter bound to the SD than currently available approaches.

6.2. Mining True Frequent Patterns

We compare the final SD computed by MCRapper with the one computed by TFP-A. The results are shown in Fig. 1(a). Similarly to what we observed in Sect. 6.1, MCRapper provides much tighter bounds being, in most cases, less than of the bound reported by Amira. We then assessed the impact of such difference in the mining of TFP, by comparing the number of patterns reported by TFP-R and by TFP-A. Since for both algorithms the output is a subset of the true frequent patterns with probability , reporting a higher number of patterns corresponds to identifying more true frequent patterns, i.e., to higher power. Figure 1(b) shows the number of patterns reported by TFP-R and by TFP-A (left -axis) and the ratio between such quantities (right -axis). The SD bound from MCRapper is always lower than the SD bound from Amira, so TFP-R always reports at least as many patterns as TFP-A, and for 10 out of 18 datasets, it reports at least twice as many patterns as TFP-A. These results show that the SD bound computed by TFP-R provides a great improvement in terms of power for mining TFPs w.r.t. current state-of-the-art methods for SD bound computation.

6.3. Running time

For these experiments we take random samples of size of the most demanding datasets (accidents, chess, connect, phishing, pumb-star, susy; for the other datasets MCRapper takes much less time than the ones shown) and use the hybrid approach MCRapper-H (Sect. 4.2.1) with different values of (and , which gives a good trade-off between the bounds and the running time, , ). We naïvely upper bound with , where is the length of the transaction , a very loose bound that could be improved using more information from . Figures 3 and 1(c) (in the Appendix) show the running time of MCRapper and Amira vs. the obtained upper bound on the SD (different colors correspond to different values of ). With Amira one can quickly obtain a fairly loose bound on the SD, by using MCRapper and MCRapper-H one can trade-off the running time for smaller bounds on the SD.

7. Conclusion

We present MCRapper, an algorithm for computing a bound to the supremum deviation of the sample means from their expectations for families of functions with poset structure, such as those that arise in pattern mining tasks. At the core of MCRapper there is a novel efficient approach to compute the -sample Monte-Carlo Empirical Rademacher Average based on fast search space exploration and pruning techniques. MCRapper returns a much better (i.e., smaller) bound to the supremum deviation than existing techniques. We use MCRapper to extract true frequent patterns and show that it finds many more patterns than the state of the art.

Acknowledgements.
Part of this work was conducted while L.P. was visiting the Department of Computer Science of Brown University, supported by a “Sponsor Fondazione Ing. Aldo Gini Rlhttps://www.unipd.it/fondazionegini” fellowship. Part of this work is supported by the Sponsor National Science Foundation Rlhttps://www.nsf.gov grant Grant #3, by the Sponsor MIUR of Italy Rlhttp://www.miur.it under Grant #3 AHeAD (Efficient Algorithms for HArnessing Networked Data), and by the Sponsor University of Padova Rlhttp://www.unipd.it grant Grant #3.

References

  • R. Agrawal, T. Imieliński, and A. Swami (1993) Mining association rules between sets of items in large databases. SIGMOD Rec. 22, pp. 207–216. External Links: Document, ISSN 0163-5808 Cited by: §1, §3.1.
  • R. Agrawal and R. Srikant (1995) Mining sequential patterns. In Proceedings of the Eleventh International Conference on Data Engineering,, ICDE’95, pp. 3–14. Cited by: §1, §3.1.
  • N. K. Ahmed, J. Neville, R. A. Rossi, and D. N. (2015) Efficient graphlet counting for large networks. In 2015 IEEE International Conference on Data Mining, pp. 1–10. External Links: Document, ISSN 1550-4786 Cited by: §1, §3.1.
  • P. L. Bartlett and S. Mendelson (2002) Rademacher and Gaussian complexities: risk bounds and structural results.

    Journal of Machine Learning Research

    3 (Nov), pp. 463–482.
    Cited by: §1, §1, §3.2, §3.2.
  • S. D. Bay and M. J. Pazzani (2001) Detecting group differences: mining contrast sets. Data Mining and Knowledge Discovery 5 (3), pp. 213–246. Cited by: §1.
  • M. Boley, C. Lucchese, D. Paurat, and T. Gärtner (2011) Direct local pattern sampling by efficient two-step random procedures. Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD ’11. External Links: Document, ISBN 9781450308137, Link Cited by: §2.
  • V. T. Chakaravarthy, V. Pandit, and Y. Sabharwal (2009) Analysis of sampling techniques for association rule mining. In Proc. 12th Int. Conf. Database Theory, ICDT ’09, New York, NY, USA, pp. 276–283. External Links: Document, ISBN 978-1-60558-423-2 Cited by: §1, §2.
  • L. De Stefani and E. Upfal (2019) A rademacher complexity based method for controlling power and confidence level in adaptive statistical analysis. In

    2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

    ,
    pp. 71–80. External Links: Document Cited by: §2.
  • V. Dzyuba, M. van Leeuwen, and L. De Raedt (2017) Flexible constrained sampling with guarantees for pattern mining. Data Mining and Knowledge Discovery 31 (5), pp. 1266–1293. External Links: Document, ISSN 1573-756X, Link Cited by: §2.
  • P. Fournier-Viger, J. Chun-Wei Lin, T. Truong-Chi, and R. Nkambou (2019) A survey of high utility itemset mining. In High-Utility Pattern Mining, Cited by: §3.1.
  • W. Hämäläinen and G. I. Webb (2018) A tutorial on statistically sound pattern discovery. Data Mining and Knowledge Discovery. External Links: Document, ISSN 1573-756X, Link Cited by: §1, §2, §2, §5.
  • A. Kirsch, M. Mitzenmacher, A. Pietracaprina, G. Pucci, E. Upfal, and F. Vandin (2012) An efficient rigorous approach for identifying statistically significant frequent itemsets. Journal of the ACM (JACM) 59 (3), pp. 1–22. Cited by: §1.
  • W. Klösgen (1992) Problems for knowledge discovery in databases and their treatment in the Statistics Interpreter Explora. International Journal of Intelligent Systems 7, pp. 649–673. Cited by: §1, §3.1.
  • V. Koltchinskii and D. Panchenko (2000) Rademacher processes and bounding the risk of function learning. In High dimensional probability II, pp. 443–457. Cited by: §1, §2, §3.2, §3.2.
  • L. Pellegrina, M. Riondato, and F. Vandin (2019) SPuManTE: significant pattern mining with unconditional testing. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’19, New York, NY, USA, pp. 1528–1538. External Links: Document, ISBN 978-1-4503-6201-6, Link Cited by: §1, §2, §5.
  • L. Pellegrina and F. Vandin (2020) Efficient mining of the most significant patterns with permutation testing. Data Mining and Knowledge Discovery. Cited by: §6.
  • M. Riondato and E. Upfal (2014) Efficient discovery of association rules and frequent itemsets through sampling with tight performance guarantees. ACM Trans. Knowl. Disc. from Data 8 (4), pp. 20. External Links: Document, Link Cited by: §1, §2, §2, §5.
  • M. Riondato and E. Upfal (2015) Mining frequent itemsets through progressive sampling with Rademacher averages. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’15, pp. 1005–1014. Cited by: §1, §2, §2, §6, §6.
  • M. Riondato and F. Vandin (2014) Finding the true frequent itemsets. In Proceedings of the 2014 SIAM international conference on data mining, pp. 497–505. Cited by: 2nd item, §1, §2, §5.
  • M. Riondato and F. Vandin (2018) MiSoSouP: mining interesting subgroups with sampling and pseudodimension. In Proc. 24th ACM SIGKDD Int. Conf. Knowl. Disc. and Data Mining, KDD ’18, pp. 2130–2139. Cited by: §1, §2, §3.
  • D. Santoro, A. Tonon, and F. Vandin (2020) Mining sequential patterns with vc-dimension and rademacher complexity. Algorithms 13 (5), pp. 123. Cited by: §2.
  • S. Servan-Schreiber, M. Riondato, and E. Zgraggen (2018) ProSecCo: progressive sequence mining with convergence guarantees. In Proceedings of the 18th IEEE International Conference on Data Mining, pp. 417–426. Cited by: §1, §2, §5.
  • S. Shalev-Shwartz and S. Ben-David (2014) Understanding machine learning: from theory to algorithms. Cambridge University Press. Cited by: 3rd item, §2, §3.2, §3.2.
  • M. Sugiyama, F. Llinares-López, N. Kasenburg, and K. M. Borgwardt (2015) Significant subgraph mining with multiple testing correction. In Proceedings of the 2015 SIAM International Conference on Data Mining, pp. 37–45. Cited by: §1.
  • A. Terada, M. Okada-Hatakeyama, K. Tsuda, and J. Sese (2013) Statistical significance of combinatorial regulations. Proceedings of the National Academy of Sciences 110 (32), pp. 12996–13001. Cited by: §1.
  • H. Toivonen (1996) Sampling large databases for association rules. In Proc. 22nd Int. Conf. Very Large Data Bases, VLDB ’96, San Francisco, CA, USA, pp. 134–145. External Links: ISBN 1-55860-382-4 Cited by: §1, §2.
  • A. Tonon and F. Vandin (2019) Permutation strategies for mining significant sequential patterns. In 2019 IEEE International Conference on Data Mining (ICDM), pp. 1330–1335. Cited by: §1.
  • V. N. Vapnik (1998) Statistical learning theory. Wiley. External Links: ISBN 0471030031 Cited by: §1, §3.2.

Appendix A Appendix

a.1. Missing Proofs

Theorem A.1 (Symmetrization inequality [Koltchinskii and Panchenko, 2000]).

For any family it holds

Theorem A.2 ([Bousquet, 2002, Thm. 2.2]).

Let . Let . Then, with probability at least over the choice of , it holds

(13)
Proof of Thm. 3.2.

Consider the following events

From Lemma A.4, we know that holds with probability at least over the choice of and . is guaranteed to with probability at least over the choice of  [Oneto et al., 2013, (generalization of) Thm. 3.11]. Define the event as the event in (13) for and the event as the event in (13) for and for . [Bousquet, 2002, Thm. 2.2] tells us that events and hold each with probability at least over the choice of . Thus from the union bound we have that the event holds with probability at least over the choice of and . Assume from now on that the event holds.

Because holds, it must be . From this result and Thm. A.1 we have that

From here, and again because , by plugging in place of into (13) (for ), we obtain that . To show that it also holds

(which allows us to conclude that ), we repeat the reasoning above for and use the fact that , a known property of the ERA, thus

Theorem A.3 (McDiarmid’s inequality [McDiarmid, 1989]).

Let , and let be a function such that, for each , , there is a nonnegative constant such that:

(14)

Let be independent random variables taking value in such that . Then it holds

where .

The following result is an application of McDiarmid’s inequality to the -MCERA, with constants .

Lemma A.4 ().

Let . Then, with probability at least over the choice of , it holds

The following result gives a probabilistic upper bound to the supremum deviation using the RA and the ERA [Oneto et al., 2013, Thm. 3.11].

Theorem A.5 ().

Let . Then, with probability at least over the choice of , it holds

(15)
Proof of Thm. 3.1.

Through Lemma A.4 (using there equal to ), Thm. A.5 (using there equal to ), and an application of the union bound. ∎

Proof of Thm. 4.1.

It is immediate from the definitions of and in (4.1) that , so we can focus on . We start by showing that . It holds