Propositional satisfiability (SAT) became a core technology in many application domains, such as formal verification, planning and various new applications derived by the recent impressive progress in practical SAT solving. Propositional formulae in conjunctive normal form (CNF) is the standard input format for propositional satisfiability. Such convenient CNF form is derived from a general boolean formula using the well-known Tseitin encoding [Tseitin68]. Two important flaws were identified and largely discussed in the literature. First, it is often argued that by encoding arbitrary propositional formulae in CNF, structural properties of the original problem are not reflected in the CNF formula. Secondly, even if such translation is linear in the size of the original formula, a huge CNF formula might result when encoding real-world problems. Some instances exceed the capacity of the available memory, and even if the instance can be stored, the time needed for reading the input instance might be higher than its solving time.
To address this problem, developing a more compact representation is clearly an interesting research issue. By compact encoding of formulae, we have in mind a representation model which through its use of structural properties results in the most compact possible formula.
Two promising models were proposed these last years. The first, proposed by H. Dixon et al [DixonGHLP04], uses group theory to represent several classical clauses by a single clause called an ”augmented clause”. The second model was proposed by M. L. Ginsberg et al [Ginsberg00], called QPROP (”quantified propositional logic”), which may be seen as a propositional formula extended by the introduction of quantifications over finite domains, i.e. first order logic limited to finite types and without functional symbols. The problem rises in finding efficient solving techniques of formulae encoded using such models.
More recently, an original approach for compacting sets of binary clauses was proposed by J. Rintanen in [Rintanen06]. Binary clauses are ubiquitous in propositional formulae that represent real-world problems ranging from model-checking problems in computer-aided verification to AI planning problems. In [Rintanen06], using auxiliary variables, it is shown how constraint graphs that contain big cliques or bi-cliques of binary clauses can be represented more compactly than the quadratic and explicit representation. The main limitation of this approach lies in its restriction to particular sets of binary clauses whose constraints graph represents cliques or bi-cliques. Such particular regularities can caused by the presence of an at-most-one constraint over a subset of variables, forbidding more than one of them to be true at a time.
In data mining community, several models and techniques for discovering interesting patterns in large databases has been proposed in the last few years. The problem of mining frequent itemsets is well-known and essential in data mining, knowledge discovery and data analysis. Since the first article of Agrawal [agrawal93] on association rules and itemset mining, the huge number of works, challenges, datasets and projects show the actual interest in this problem (see [tiwari2010survey] for a recent survey).
Our goal in this work is to address the problem of finding compact representation of CNF formulae. Our proposed Mining4SAT approach aims to discover hidden structures from arbitrary CNF formulae and to exploit them to reduce the overall size of the CNF formula while preserving satisfiability. Mining4SAT makes an original use for SAT of an exciting novel application domain, namely, the data mining task of finding frequent itemset from 0-1 transaction databases [agrawal93].
Recently, a first constraint programming (CP) based data mining framework was proposed by Luc De Raedt et al. in [Raedt08] for itemset mining. This new framework offers a declarative and flexible representation model. It allows data mining problems to benefit from several generic and efficient CP solving techniques [GunsNR11]. This first study leads to the first CP approach for itemset mining displaying nice declarative opportunities while opening interesting perspectives to cross fertilization between data-mining, constraint programming and propositional satisfiability.
In this paper, we are particularly interested in the other side of this innovative connection between these two research domains, namely how data-mining can be helpful for SAT. We present the first data-mining approach for Boolean Satisfiability. We show that itemset mining techniques are very suitable for discovering interesting patterns from CNF formulae. Such patterns are then used to rewrite the CNF formula more compactly. We also show how sets of binary clauses can be also compacted by our approach. Wa also prove that our approach can automatically achieve similar reductions as in [Rintanen06], on bi-cliques and cliques of binary clauses. It is also important to note, that our proposed mining4SAT approach is incremental. Indeed, our method can be applied incrementally or in parallel on the subsets of any partition of the original CNF formula. This will be particularly helpful for huge CNF formula that can not be entirely stored in memory.
2 Frequent Itemset Mining Problem
2.1 Preliminary Notations and Definitions
Let be a set of items. A set is called an itemset. A transaction is a couple where is the transaction identifier and is an itemset. A transaction database is a finite set of transactions over where for all two different transactions, they do not have the same transaction identifier. We say that a transaction supports an itemset if .
The cover of an itemset in a transaction database is the set of identifiers of transactions in supporting : . The support of an itemset in is defined by: . Moreover, the frequency of in is defined by: .
For example, let us consider the transaction database in Table 1.
Each transaction corresponds to the favorite writers of a library member.
For instance, we have and
Let be a transaction database over and a minimal support threshold. The frequent itemset mining problem consists of computing the following set: .
The problem of computing the number of frequent itemsets is -hard [Gunopulos2003]. The complexity class corresponds to the set of counting problems associated with a decision problems in . For example, counting the number of models satisfying a CNF formula is a problem.
2.2 Maximal and Closed Frequent Itemsets
Let us now define two condensed representations of the set of all frequent itemsets: maximal and closed frequent itemsets.
Definition 1 (Maximal Frequent Itemset)
Let be a transaction database, a minimal support threshold and . is called maximal when for all , ( is not a frequent itemset).
We denote by the set of all maximal frequent itemsets in with as a minimal support threshold. For instance, in the previous example, we have .
Definition 2 (Closed Frequent Itemset)
Let be a transaction database, a minimal support threshold and . is called closed when for all , .
We denote by the set of all closed frequent itemsets
in with as a minimal support threshold.
For instance, we have
. In particular, let us note that we have and . That explains why and are both closed. One can easily see that if all the closed (resp. maximal) frequent itemsets are computed, then all the frequent itemsets can be computed without using the corresponding database. Indeed, the frequent itemsets correspond to all the subsets of the closed (resp. maximal) frequent itemsets.
Clearly, the number of maximal (resp. closed) frequent itemsets is significantly smaller than the number of frequent itemsets. Nonetheless, this number is not always polynomial in the size of the database [Yang2004]. In particular, the problem of counting the number of maximal frequent itemsets is -complete (see also [Yang2004]).
Many algorithm has been proposed for enumerating frequent closed itemsets. One can cite Apriori-like algorithm, originally proposed in [agrawal93] for mining frequent itemsets for association rules. It proceeds by a level-wise search of the elements of . Indeed, it starts by computing the elements of of size one. Then, assuming the element of of size is known, it computes a set of candidates of size so that is a candidate if and only if all its subsets are in . This procedure is iterated until no more candidates are found. Obviously, this basic procedure is enhanced using some properties such as the anti-monotonicity property that allow us to reduce the search space. Indeed, if , then for all . In our experiments, we consider one of the state-of-the-art algorithm LCM for mining frequent closed itemsets proposed by Takeaki Uno et al. in [UnoKA04]. In theory, the authors prove that LCM exactly enumerates the set of frequent closed itemsets within polynomial time per closed itemset in the total input size. Let us mention that LCM algorithm obtained the best implementation award of FIMI’2004 (Frequent Itemset Mining Implementations).
3 From CNF Formula to Transaction Database
We first introduce the satisfiability problem and some necessary notations. We consider the conjunctive normal form (CNF) representation for the propositional formulas. A CNF formula is a conjunction of clauses, where a clause is a disjunction of literals. A literal is a positive () or negated () propositional variable. The two literals and are called complementary. A CNF formula can also be seen as a set of clauses, and a clause as a set of literals. The size of the CNF formula is defined as , where is equal to the number of literals in . We denote by the complementary literal of . More precisely, if then is and if then is . Let us recall that any propositional formula can be translated to CNF using Tseitin’s linear encoding [Tseitin68]. We denote by the set of propositional variables appearing in , while the set of literals of is defined as . An interpretation of a propositional formula is a function which associates a value ( corresponds to and to ) to the variables . A model of a formula is an interpretation that satisfies the formula: . The SAT problem consists in deciding if a given CNF formula admits a model or not.
A CNF formula can be considered as a transaction database, called CNF database, where the items correspond to literals and the transactions to clauses. Complementary literals correspond to two different items.
Definition 3 (CNF to )
Let be a CNF formula. The set of items and the transaction database associated to is defined as
In this context, a frequent itemset corresponds to a frequent set of literals: the number of clauses containing these literals is greater or equal to the minimal threshold. For instance, if we set the minimal threshold to 2, we get as a frequent itemset in the previous database. The set of maximal frequent itemsets is the smallest set of frequent set of literals where each frequent set of literals is included in at least one of its elements. For instance, the unique maximal frequent itemset in the previous example is (). Furthermore, the set of closed frequent itemsets is the smallest set of frequent set of literals where each frequent itemset is included in at least one of its elements having the same support. For instance, the set of the closed frequent itemsets is .
In the definition of a transaction database, we did not require that the set of items in a transaction to be unique. Indeed, two different transactions can have the same set of items and different identifiers. A CNF formula may contain the same clause more than once, but in practice this does not provide any information about satisfiability. Thus, we can consider a CNF database as just a set of itemsets (sets of literals).
4 Mining-based Approach for Size-Reduction of CNF Formulae
In this section, we describe our mining based approach, called Mining4SAT, for reducing the size of CNF formulae. The key idea consists in searching for frequent sets of literals (sub-clauses) and substituting them with new variables using Tseitin’s encoding [Tseitin68].
4.1 Tseitin’s Encoding
Tseitin’s encoding consists in introducing fresh variables to represent sub-formulae in order to represent their truth values. Let us consider the following DNF formula (Disjunctive Normal Form: a disjunction of conjunctions):
A naive way of converting such a formula to a CNF formula consists in using the distributivity of disjunction over conjunction ():
Such a naive approach is clearly exponential in the worst case. In Tseitin’s transformation, fresh propositional variables are introduced to prevent such combinatorial explosion, mainly caused by the distributivity of disjunction over conjunction and vice versa. With additional variables, the obtained CNF formula is linear in the size of the original formula. However the equivalence is only preserved w.r.t satisfiability:
4.2 A Size-Reduction Method
Let us consider the following CNF formula :
where , , are literals and are clauses. The number of literals in this formula can be reduced as follows:
where is a fresh propositional variable. Indeed, literals are replaced with literals. Clearly, a boolean interpretation is a model of the formula obtained after reduction if and only if it is a model of . Now, if we consider the CNF database corresponding to , is a frequent itemset where the minimal support threshold is greater or equal to .
It is easy to see that to reduce the number of literals must be greater or equal to . Indeed, if then there is no reduction of the number of literals, on the contrary, their number is increased. Regarding the value of , one can also see that such a transformation is interesting only when . Thus, there are three cases : if , then , else if then , otherwise. Therefore, the number of literals is always reduced when .
In the previous example, we illustrate how the problem of finding frequent itemsets can be used to reduce the size of a CNF formula. One can see that, in general, it is more interesting to consider a condensed representation of the frequent itemsets (closed and maximal) to reduce the number of literals. Indeed, by using a condensed representation, we consider all the frequent itemsets and the number of fresh propositional variables and new clauses (in our example, and ) introduced is smaller than that of those introduced by using all the frequent itemsets. For instance, in the previous formula, it is not interesting to introduce a fresh propositional variable for each subset of .
Closed vs. Maximal
In Section 2.2, we introduced two condensed representations of the frequent itemsets: closed and maximal. The question is, which condensed representation is better? We know that the set of maximal frequent itemsets is included in that of the closed ones. Thus, a small number of fresh variables and new clauses are introduced using the maximal frequent itemsets. However, there are cases where the use of the closed frequent itemsets is more suitable. For example, let us consider the following formula:
where , and . We assume that the frequent itemsets are only the subsets of . Therefore, is the unique maximal itemset and the closed itemsets are and . Let us start by using the closed frequent itemset in the reduction of the number of literals:
Now, by using , we get the following formula:
In this example, it is clearly more interesting to consider the closed frequent itemsets in our Mining4SAT approach.
In fact, a (closed) frequent itemset and one of its subsets (which can be closed) are both interesting if . Indeed, if we apply our transformation using , then the support of in the resulting formula is equal to , and we know that is interesting in the resulting formula if its support is greater to .
Let be a set of itemsets. Two itemsets and of overlap if . Moreover, and are in the same overlap class if there exist itemsets of such that and for all , and overlap.
In our transformation, one can have some problems when two frequent itemsets overlap.
For example, if and are two frequent itemsets (3 is the minimal support threshold) such that
and , then if we apply our transformation using , then the support of is equal to (infrequent) in the resulting formula and vice versa. Thus, we can not use both of them in the transformation.
Le us note that the overlap notion can be seen as a generalization of the subset one. Let and be frequent itemsets such that they overlap. They are both interesting in our transformation if:
or . This comes from the fact that if we apply the transformation using (resp. ), then the support of (resp. ) is equal to (resp. ).
(resp. ) where if (resp. ), if (resp. ), otherwise. Indeed, in the previous cases, (resp. ) can be used in our transformation.
We now describe our Mining4SAT algorithm using the set of closed frequent itemsets. Let us note that the optimal transformation using the set of all the closed frequent itemsets can be obtained by an optimal transformation using separately the overlap classes of this set. Actually, since any two distinct overlap classes do not share any literal, the reduction applied to a given formula using the elements of an overlap class does not affect the supports of the elements of the other classes. Moreover, one can easily compute the set of all the overlap classes of the set of the closed frequent itemsets: let be an undirected graph such that is the set of the closed frequent itemsets and is an edge of if and only if and overlap; is an overlap class if and only if it corresponds to the set of vertices of a connected component of which is not included in any other connected component of . For this reason, we restrict here our attention to the reductions that can be obtained using a single overlap class. The hole size reduction process can be performed by iterating on all the overlap classes.
Let be a closed frequent itemset, We denote by the value that corresponds to the number of literals reduced by applying our transformation with on a CNF formula.
Algorithm 1 takes as input a CNF formula and an overlap class , and returns after applying size-reduction transformations. It iterates until there is no element in . In each iteration, it first selects one of the most interesting elements in (line 2): an element of such that there is no element satisfying . Note that this element is not necessarily unique in . This instruction means that Algorithm 1 is a greedy algorithm because it makes the locally optimal choice at each iteration. Then, it applies our transformation using : it replaces the occurrences of with a fresh propositional variable (line 3); and it adds the clause to (line 4). It next removes from (line 5) and replaces in the the other elements of with (line 6). The next instruction (line 7) consists in removing the elements of that could increase the number of literals: the elements that overlap with and are not included in . As explained before, an element of overlapping with does not necessarily increase the number of literals. Thus, by removing elements from because only they overlap with , our algorithm can remove closed frequent itemsets decreasing the number of literals. A partial solution to this problem consists in recomputing the closed frequent itemsets in the formula returned by Algorithm 1. The last instruction in the while loop (line 8) consists in updating the supports of the elements remaining in following the new value of : a support of an element remaining in changes only when it is included in and its new support is equal to . This instruction also removes all the elements of becoming uninteresting because of the new supports and sizes.
5 Application: A Compact Representation of Sets of Binary Clauses
Binary clauses (2-CNF formula) are ubiquitous in CNF formula encoding real-world problems. Some of them contain more than 50% of binary clauses. However, in our size reduction approach, binary clauses are not taken into account. Indeed, to reduce the size of the formula, we only search for itemsets of size at least two literals. The extremely rare case where a binary clause representing a closed frequent itemset can be considered is when it appears at least four times in the formula i.e. it subsumes at least 4 clauses. In this section, we first show how our mining based approach can be used to achieve a compact representation of arbitrary sets of binary clauses. Then, we consider two interesting special cases corresponding to sets of binary clauses representing either a clique or a bi-clique.
5.1 Compacting arbitrary set of binary clauses
In order to reduce the size of the set of binary clauses, we only need to rewrite the formula and to slightly modify the Algorithm 1.
Definition 4 (B-implications)
Let be a 2-CNF formula. We define , where . We call a B-implication.
Obviously, the formula and are equivalent and there exists several ways to rewrite as a conjunction of B-implications.
Let be a 2-CNF formula. We can rewrite as or as .
In the sequel, we use a lexicographic ordering on literals of . In the example 1, we obtain using the lexicographic ordering .
Definition 5 (2-CNF to )
Let be a 2-CNF formula and . The transaction database associated to is defined as .
Let us now describe our approach to compact a 2-CNF formula , called CNF2RED (for reducing the size of sets of binary clauses). First, after rewriting as , we build the transaction database . The set of closed frequent itemsets and its associated overlap classes are computed. The last step aims to reduce the size of the 2-CNF using a slightly modified version of the Algorithm 1. First the Algorithm 1 takes as input a formula and returns after reducing its size. Secondly, for an itemset , in line (4) of the Algorithm 1, we introduce a fresh variable and we add a bi-implication to .
5.2 Special case of (bi-)clique of binary clauses
In [Rintanen06], J. Rintanen addressed the problem of representing big sets of binary clauses compactly. He particularly shows that constraint graphs arising from practically interesting applications (eg. AI planning) contain big cliques or bi-cliques of binary clauses. An identified bi-clique involving the two sets of literals and expresses the propositional formula , while a clique involving the literals expresses that at-most one literal from is ,
Bi-clique of binary clauses
Let us explain how a bi-clique can be compacted with CNF2RED method. Let a bi-clique of binary clauses. Considering the lexicographic ordering, corresponds exactly to . Obviously, the transaction database contains a single closed frequent itemset . Applying our algorithm leads to the following compact representation of . We obtain exactly the same gain as in [Rintanen06] ( binary clauses and one additional variable).
Clique of binary clauses
Let be a clique of binary clauses. The formula . If we take a closer look to , the closed frequent itemset with greatest value corresponds to . In the first rows of , is substituted by a fresh variable and a new set of binary clauses is added to it, leading to two subproblems of size . Obviously, the same treatment is done on the formula . Consequently the number of variables is defined by the following recurrence equation: , . The basic case is reached for , where the last fresh variable is introduced to represent the conjunction . For no fresh variable is introduced because no frequent closed itemset can leads to a reduction of the size of the formula. Consequently, from the solution of the previous recurrence equation, we obtain that our encoding is in auxiliary variables. Using the same reasoning, we also obtain the same complexity for the number of binary clauses. This corresponds to the complexity obtained in [Rintanen06].
The two special cases of clique and bi-clique of binary clauses considered in this section, allow us to show that when a constraint is not well encoded, our approach can be used to correct and to derive a more efficient and compact encodings automatically.
|Instance||orig. form. size||red. form. size||% rmv|
|1dlx_c_iq57_a||190 Mo||164 Mo||12,47 %|
|6pipe_6_ooo.*-as.sat03-413||11 Mo||7,7 Mo||19,64 %|
|9dlx_vliw_at_b_iq6.*-*04-347||76 Mo||65 Mo||14,02 %|
|abb313GPIA-9-c.*.sat04-317||21 Mo||6,9 Mo||63,92 %|
|E05F18||3,7 Mo||2,2 Mo||43,48 %|
|eq.atree.braun.11.unsat||120 Ko||72 Ko||27,93 %|
|eq.atree.braun.12.unsat||144 Ko||88 Ko||27,66 %|
|k2mul.miter.*-as.sat03-355||1,5 Mo||1,3 Mo||11,27 %|
|korf-15||1,2 Mo||752 Ko||34,17 %|
|rbcl_xits_08_UNSAT||1,1 Mo||856 Ko||16,42 %|
|SAT_dat.k45||3,5 Mo||2,6 Mo||24,53 %|
|traffic_b_unsat||18 Mo||12 Mo||26,53 %|
|x1mul.miter.*-as.sat03-359||1,1 Mo||928 Ko||12,68 %|
|9dlx_vliw_at_b_iq3||19 Mo||15 Mo||17,84 %|
|9dlx_vliw_at_b_iq4||31 Mo||26 Mo||18,02 %|
|AProVE07-09||2,8 Mo||2,7 Mo||4,51 %|
|eq.atree.braun.10.unsat||96 Ko||56 Ko||28,30 %|
|goldb-heqc-frg1mul||348 Ko||328 Ko||12,66 %|
|goldb-heqc-x1mul||964 Ko||896 Ko||12,68 %|
|minand128||7,7 Mo||2,6 Mo||65,28 %|
|ndhf_xits_09_UNSAT||2,6 Mo||2,1 Mo||18,61 %|
|rbcl_xits_07_UNSAT||868 Ko||720 Ko||16,49 %|
|velev-pipe-o-uns-1.1-6||5,5 Mo||4,4 Mo||18,89 %|
In this section, we present an experimental evaluation of our proposed approaches. Two kind of experiments has been conducted. The first one deals with size reduction of arbitrary CNF formulas using Mining4SAT algorithm, while the second one attempts to reduce the size of the 2-CNF sub-formulas only, using CNF2RED algorithm.
Both algorithms are tested on different benchmarks taken from the last SAT challenge 2012. From the 600 instances of the application category submitted to this challenge, we selected 100 instances while taking at least one instance from each family. All tests were made on a Xeon 3.2GHz (2 GB RAM) cluster and the timeout was set to 4 hours.
In Table 2 and Table 3, the field indicates the size in octets of each SAT instance before and after reduction. We also provide , the percentage of the removed literals. To study the influence of our size reduction approaches on the solving time, we also run the SAT solver MiniSAT 2.2 on both the original instance and on the those obtained after reduction. Due to a lack of space, we only present a sample of the whole results. Our goal is to provide some insights about the general behavior of our reduction techniques.
Table 2, highlights the results obtained by Mining4SAT general approach. In this experiments, and to allow possible reductions, we only search for frequent closed itemsets of size greater or equal to 4. Consequently, binary clauses are not considered. As we can observe, our Mining4SAT reduction approach allows us to reduce the size more than 20% on the majority of instances. Let us also note that the maximum (65,28 %) is reached in the case of the instance minand128: its original size is 14 Mo and its size after reduction is 5.4 Mo. For the SAT solving time, the results depend on the instances. On some instances we can observe real improvements, whereas on others the performances become worse.
In Table 3, we present a sample of the results obtained by CNF2RED algorithm on compacting only binary clauses. We observe similar behavior as in the first experiment in terms of size reduction However, we observe in general some improvements in terms of SAT solving time.
|Instance||orig. form. size||red. form. size||% rmv|
|velev-pipe-o-uns-1.1-6||5.5 Mo||3.2 Mo||43,23 %|
|9dlx_vliw_at_b_iq2||11 Mo||6 Mo||42,56 %|
|1dlx_c_iq57_a||190 Mo||124 Mo||36,52 %|
|7pipe_k||14 Mo||5.4 Mo||59,66 %|
|SAT_dat.k100.debugged||16 Mo||13 Mo||24,89 %|
|IBM_FV_2004_rule_batch||9,7 Mo||7.5 Mo||25,56 %|
|sokoban-sequential-p145-*.040-*||24 Mo||14 Mo||45,16 %|
|openstacks-*-p30_1.085-*||30 Mo||26 Mo||17,25 %|
|aaai10-planning-ipc5-*-12-step16||17 Mo||12 Mo||35,35 %|
|k2fix_gr_rcs_w8.shuffled||3,4 Mo||1,7 Mo||54,83%|
|homer17.shuffled||20 Ko||16 Ko||39,86 %|
|gripper13u.shuffled-as.sat03-395||524 Ko||364 Ko||35,03 %|
|grid-strips-grid-y-3.045-*||52 Mo||42 Mo||23,48 %|
7 Conclusion and Future Works
In this paper, we propose the first data-mining approach, called Mining4SAT, for reducing the size of Boolean formulae in conjunctive normal form (CNF). It can be seen as a preprocessing step that aims to discover hidden structural knowledge that are used to decrease the number of literals. Mining4SAT combines both frequent itemset mining techniques for discovering interesting substructures, and Tseitin-based approach for a compact representation of CNF formulae using these substructures. Thus, we show in this work, inter alia, that frequent itemset mining techniques are very suitable for discovering interesting patterns in CNF formulae.
Since we use a greedy algorithm in our approach, the formula obtained after transformation is not guaranteed to be optimal w.r.t. size. An important open question, which we will study in future work, is how to optimally use the closed frequent itemsets ranging in an overlap class. Integrating the reduction of sets of binary clauses in the general Mining4SAT approach is also an interesting research perspective.