ALLSAT compressed with wildcards: Frequent Set Mining

10/31/2019
by   Marcel Wild, et al.
0

Once the maximal frequent sets are known, the family of all frequent sets can be efficiently compressed (without loss of information) by the use of suitable wildcards.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

02/10/2012

Abstract Representations and Frequent Pattern Discovery

We discuss the frequent pattern mining problem in a general setting. Fro...
09/04/2017

Reductions for Frequency-Based Data Mining Problems

Studying the computational complexity of problems is one of the - if not...
06/03/2002

Mining All Non-Derivable Frequent Itemsets

Recent studies on frequent itemset mining algorithms resulted in signifi...
04/16/2019

Most Frequent Itemset Optimization

In this paper we are dealing with the frequent itemset mining. We concen...
11/07/2017

Grafting for Combinatorial Boolean Model using Frequent Itemset Mining

This paper introduces the combinatorial Boolean model (CBM), which is de...
08/06/2018

Know Abnormal, Find Evil: Frequent Pattern Mining for Ransomware Threat Hunting and Intelligence

Emergence of crypto-ransomware has significantly changed the cyber threa...
02/04/2019

Safe projections of binary data sets

Selectivity estimation of a boolean query based on frequent itemsets can...

1 Introduction

This is the first draft of a hopefully longer article in Spe for the series ’ALLSAT compressed with with wildcards’. For more about this series as a whole, see [W2]. The author never published on Frequent Set Mining FSM before (but is acquainted with related data mining frameworks such as Formal Concept Analysis, Knowledge Spaces and Relational Databases). He therefore hopes that this draft attracts co-authors that help to pit the (clearly promising) newcomer algorithm below against state-of-the-art methods such as Apriori, Eclat, -growth. (The latter compete against each other in [H]).

Let be any finite set (our universe). A simplicial complex is any hereditary family of subsets (= faces) of , i.e. implies . Our main idea exploits wildcards for compressing any simplicial complex, provided its maximal faces (= facets) are known. Finding the facets in the first place also entails an apparently new method.

We assume that the reader is familiar with the basics of FSM. Consider an arbitrary binary table whose columns are labelled by items (e.g. matching the items sold in a supermarket) and whose rows are called transactions (e.g. matching the itemsets bought by customers during a specific day). Fix any natural number , called the threshold, and call an itemset frequent if appears (as a subset) in at least many transactions. Stripped to its core FSM attempts the following. Display all frequent sets in a succinct way. Since the family of all frequent sets constitutes a simplicial complex the above-mentioned techniques apply.

2 The first toy example

Our universe will always be the set of all items. For convenience we take In our first binary Table 1 (which for better visualization uses x and blanks instead of 1 and 0) the maximal frequent sets (=facets) are easily determined. Specifically, if then is a frequent itemset because it is contained in and . Obviously is maximal. Likewise and and are facets. We leave it as an exercise to verify that to are the only maximal facets.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Table 1: The four maximal frequent sets of this binary table are found by inspection

Hence the simplicial complex of all frequent sets is . Unfortunately this union of powersets is not disjoint; for instance belongs to three powersets. We can make the union disjoint (indicated by ) as follows:

(1) 

This is achieved neatly by applying the Facets-To-Faces algorithm of [W3] to the facets to :

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0

Table 2: Compressed representation of

Specifically, we use the don’t-care symbol ’2’ to indicate that a bit at this position is free to be 0 or 1. Hence the row comprises bitstrings and matches the powerset . More subtle, the wildcard means ’at least one here’. It follows that . Similarly are compact ways to write the second and third set difference appearing in (1). One readily finds that the ‘database’ given by Table 1 has frequent sets.

The format of Table 2 invites a further statistical analysis of . For instance [W3, Section 5], for any fixed the number of -element bitstrings within a -row is readily calculated. Furthermore, e.g., the number of frequent sets containing any fixed set is easy to obtain. To witness, if then with respect to this number is

3 The second toy example

Here comes a less structured binary table (Table 3) which hence induces a simplicial complex of frequent sets (again ) whose facets are are harder to retrieve. In 3.1 we present an apparently novel (remarks to the contrary are welcome) method to calculate them. With the facets at hand, we compress the whole of in 3.2.

1 2 3 4 5 6 7 8 9

Table 3: The seven maximal frequent sets of this binary table are not obvious

3.1 By inspection the frequent set is clearly maximal. Although at this stage a second facet could again be found by inspection, let us launch our systematic procedure. For starters, for any put

and .

Then is to be found in

All four (set filter) generators happen to be frequent, and so each extends111Specifically, we add arbitrary elements to until any further addition would yield an infrequent set. Thus one needs to check repeatedly whether any given itemset is frequent or not. This works efficiently with an idea that is coined ’vertical layout’ in the FSM literature. It seems it was independently (and in different contexts) discovered in 1995, namely in [HKMT, p.151] respectively [W1, p.113]. to some (at least one) maximal frequent set. For instance extends to . Consequently the next facet is to be found in

All four generators are frequent and the facet happens to be contained in all four set filters. One calculates that

Observe that the sets are infrequent, and a fortiori are their supersets. One can e.g. extend to the facet . Then

As to the last equality, observe that the sixteen cancellations are due222Generally speaking these two calculations must be carried out repeatedly: (a) determine the minimal members in a family of sets and (b) decide whether specific sets are infrequent by simply scanning the database once. to two set filters being contained in , two in , two in , two in ; and the remaining eight being infrequent. Upon extending to the facet one gets

Here all cancellations are due to infrequent sets, the details are left to the reader. Upon extending say to the facet one gets

Upon extending say to the facet one calculates

Since all ten set filter generators are infrequent, we conclude that .

3.2 As in Section 2, applying the Facets-To-Faces algorithm to yields frequent sets, packed in seven 012e-rows:

1 2 3 4 5 6 7 8 9
12
42
9
14
24
56
16

Table 4: Compressed representation of

The conclusion of this preliminary draft is as follows. While the best way to find the maximal frequent sets remains debatable, our method of compression seems hard to beat, particularly when the facets are large. For instance [W3] it took Facets-To-Faces 1114 seconds to compress approximately frequent sets (contained in 70 random facets each of cardinality 300) into many -rows.

References

  1. M. Holsheimer, M. Kersten, H. Mannila, H. Toivonen, A perspective on databases and data mining, KDD-95 Proceedings.

  2. J. Han, J. Pei, Y. Yin, Mining frequent patterns without candidate generation, Manuscript (8268 citations on Google Scholar, no further bibliography).

  3. J. Heaton, Comparing dataset characteristics that favor the Apriori, Eclat, or FP-Growth Frequent Itemset mining algorithms, South-East Conference 2016, pages 1-7.

  4. M. Wild, Computations with finite closure systems and implications, LNCS 959 (1995) 111-120.

  5. M. Wild, ALLSAT compressed with wildcards: From CNFs to orthogonal DNFs by imposing the clauses one after another. Submitted.

  6. M. Wild, ALLSAT compressed with wildcards: Partitionings and face-numbers of simplicial complexes. Submitted.