This is the first draft of a hopefully longer article in Spe for the series ’ALLSAT compressed with with wildcards’. For more about this series as a whole, see [W2]. The author never published on Frequent Set Mining FSM before (but is acquainted with related data mining frameworks such as Formal Concept Analysis, Knowledge Spaces and Relational Databases). He therefore hopes that this draft attracts co-authors that help to pit the (clearly promising) newcomer algorithm below against state-of-the-art methods such as Apriori, Eclat, -growth. (The latter compete against each other in [H]).
Let be any finite set (our universe). A simplicial complex is any hereditary family of subsets (= faces) of , i.e. implies . Our main idea exploits wildcards for compressing any simplicial complex, provided its maximal faces (= facets) are known. Finding the facets in the first place also entails an apparently new method.
We assume that the reader is familiar with the basics of FSM. Consider an arbitrary binary table whose columns are labelled by items (e.g. matching the items sold in a supermarket) and whose rows are called transactions (e.g. matching the itemsets bought by customers during a specific day). Fix any natural number , called the threshold, and call an itemset frequent if appears (as a subset) in at least many transactions. Stripped to its core FSM attempts the following. Display all frequent sets in a succinct way. Since the family of all frequent sets constitutes a simplicial complex the above-mentioned techniques apply.
2 The first toy example
Our universe will always be the set of all items. For convenience we take In our first binary Table 1 (which for better visualization uses x and blanks instead of 1 and 0) the maximal frequent sets (=facets) are easily determined. Specifically, if then is a frequent itemset because it is contained in and . Obviously is maximal. Likewise and and are facets. We leave it as an exercise to verify that to are the only maximal facets.
Table 1: The four maximal frequent sets of this binary table are found by inspection
Hence the simplicial complex of all frequent sets is . Unfortunately this union of powersets is not disjoint; for instance belongs to three powersets. We can make the union disjoint (indicated by ) as follows:
This is achieved neatly by applying the Facets-To-Faces algorithm of [W3] to the facets to :
Table 2: Compressed representation of
Specifically, we use the don’t-care symbol ’2’ to indicate that a bit at this position is free to be 0 or 1. Hence the row comprises bitstrings and matches the powerset . More subtle, the wildcard means ’at least one here’. It follows that . Similarly are compact ways to write the second and third set difference appearing in (1). One readily finds that the ‘database’ given by Table 1 has frequent sets.
The format of Table 2 invites a further statistical analysis of . For instance [W3, Section 5], for any fixed the number of -element bitstrings within a -row is readily calculated. Furthermore, e.g., the number of frequent sets containing any fixed set is easy to obtain. To witness, if then with respect to this number is
3 The second toy example
Here comes a less structured binary table (Table 3) which hence induces a simplicial complex of frequent sets (again ) whose facets are are harder to retrieve. In 3.1 we present an apparently novel (remarks to the contrary are welcome) method to calculate them. With the facets at hand, we compress the whole of in 3.2.
Table 3: The seven maximal frequent sets of this binary table are not obvious
3.1 By inspection the frequent set is clearly maximal. Although at this stage a second facet could again be found by inspection, let us launch our systematic procedure. For starters, for any put
Then is to be found in
All four (set filter) generators happen to be frequent, and so each extends111Specifically, we add arbitrary elements to until any further addition would yield an infrequent set. Thus one needs to check repeatedly whether any given itemset is frequent or not. This works efficiently with an idea that is coined ’vertical layout’ in the FSM literature. It seems it was independently (and in different contexts) discovered in 1995, namely in [HKMT, p.151] respectively [W1, p.113]. to some (at least one) maximal frequent set. For instance extends to . Consequently the next facet is to be found in
All four generators are frequent and the facet happens to be contained in all four set filters. One calculates that
Observe that the sets are infrequent, and a fortiori are their supersets. One can e.g. extend to the facet . Then
As to the last equality, observe that the sixteen cancellations are due222Generally speaking these two calculations must be carried out repeatedly: (a) determine the minimal members in a family of sets and (b) decide whether specific sets are infrequent by simply scanning the database once. to two set filters being contained in , two in , two in , two in ; and the remaining eight being infrequent. Upon extending to the facet one gets
Here all cancellations are due to infrequent sets, the details are left to the reader. Upon extending say to the facet one gets
Upon extending say to the facet one calculates
Since all ten set filter generators are infrequent, we conclude that .
3.2 As in Section 2, applying the Facets-To-Faces algorithm to yields frequent sets, packed in seven 012e-rows:
Table 4: Compressed representation of
The conclusion of this preliminary draft is as follows. While the best way to find the maximal frequent sets remains debatable, our method of compression seems hard to beat, particularly when the facets are large. For instance [W3] it took Facets-To-Faces 1114 seconds to compress approximately frequent sets (contained in 70 random facets each of cardinality 300) into many -rows.
M. Holsheimer, M. Kersten, H. Mannila, H. Toivonen, A perspective on databases and data mining, KDD-95 Proceedings.
J. Han, J. Pei, Y. Yin, Mining frequent patterns without candidate generation, Manuscript (8268 citations on Google Scholar, no further bibliography).
J. Heaton, Comparing dataset characteristics that favor the Apriori, Eclat, or FP-Growth Frequent Itemset mining algorithms, South-East Conference 2016, pages 1-7.
M. Wild, Computations with finite closure systems and implications, LNCS 959 (1995) 111-120.
M. Wild, ALLSAT compressed with wildcards: From CNFs to orthogonal DNFs by imposing the clauses one after another. Submitted.
M. Wild, ALLSAT compressed with wildcards: Partitionings and face-numbers of simplicial complexes. Submitted.