Continuous improvements in SAT solver technology have resulted in a real scaling up and widening of the class of real-world problems that can be solved in practice. The modeling phase of such increasing number and more complex applications into propositional formulas in Conjunctive Normal Form (CNF) suitable for use by a satisfiability solver becomes even more crucial. The modeling issue follows several polynomial transformations and rewriting steps, starting from high level description, using high order language or full propositional logic, to low level formulation, usually a formula in CNF. The whole process preserves propositional satisfiability, thanks to the extension principle , allowing the introduction of new variables to represent sub-formulas or complex constraints. Among such constraints, cardinality and pseudo-boolean constraints, expressing numerical bounds on discrete quantities, are the most popular as they arise frequently in the encoding of many real-world problems including scheduling, logic synthesis or verification, product configuration and data mining. For the above reasons, there have been various approaches addressing the issue of finding an efficient encoding of cardinality (e.g. [17, 4, 15, 14, 2, 11]) and pseudo-boolean constraints (e.g. [9, 5]) as a CNF formula. Efficiency refers to both the compactness of the representation (size of the CNF formula) and to the ability to achieve the same level of constraint propagation (generalized arc consistency) on the CNF formula. However, most of the proposed encodings does not take care of its interactions with the remaining part of the propositional formula, through different logical connectives. To avoid combinatorial explosion, the Tseitin principle  is usually used to translate general propositional formula to CNF, making use of fresh propositional variables to represent sub-formulas and/or complex contraints. Thanks to Plaisted and Greenbaum  improvement, the polarity of the sub-formula is taken into account leading to conditional constraints of the form or , where is a fresh propositional variable. When a cardinality constraint is involved as a sub-formula, such translation leads to what we call a conditional cardinality constraint.
The translation of single cardinality or pseudo-boolean constraints to SAT is a well studied problem. We are aware of only one contribution that consider the interactions of such constraints with the remaining part of the formula involving it. Indeed, in , the authors described how the encoding of linear constraints can be improved by taking implication chains appearing in the formula into account. The resulting encodings are smaller and can propagate more strongly than separate encodings.
In this paper, we introduce a novel variant of cardinality constraints, called conditional cardinality constraints, defined as . It expresses that no more than variables can be set to , when setting the condition to . We first show that by adding disjunctively to all the clauses resulting from the encoding of the cardinality constraint, most of the well-known encodings cease to maintain constraint propagation. We then address the issue of extending such encodings while maintaining generalized arc consistency. We also consider the particular case of conditional AtMostOne constraints, i.e., . Experimental evaluation is conducted on a SAT based non redundant association rules mining problem, showing the relevance of our proposed framework.
2 Technical Background and Preliminary Definitions
2.1 Preliminary Definitions and Notations
Let be a propositional language of formulas built in the standard way, using usual connectives (, , , , ) and a set of propositional variables. A propositional formula in CNF is a conjunction of clauses, where a clause is a disjunction of literals. A literal is a positive () or negated () propositional variable. A clause can be represented as a set of literals and a formula as a set of clauses. The two literals and are called complementary. We note the complementary literal of . For a set of literals , is defined as . For a clause , we note . A unit clause is a clause containing only one literal (called unit literal), while a binary clause contains exactly two literals. A Horn (resp. reverse Horn) clause is a clause with at-most one positive (resp. negative) literal. A positive (resp. negative) clause is a clause whose literals are all positive (resp. negative). An empty clause, denoted , is interpreted as false (unsatisfiable), whereas an empty CNF formula, denoted , is interpreted as true (satisfiable).
Let us recall that any general propositional formula can be translated to CNF using linear Tseitin’s encoding . This can be done by introducing fresh variables to represent sub-formulas in order to represent their truth values. For example, given a propositional formula containing the variables and , and is a fresh variable, one can add the definition (called extension) to the formula while preserving satisfiability. Two decades later, after Tseitin’s seminal paper, Plaisted and Greenbaum presented an improved CNF translation that essentially produces a subset of Tseitin’s representation . The authors noticed that by keeping track of polarities of sub-formulas, one can remove large parts of Tseitin translation. For example, when the disjunction is a sub-formula with positive polarity, it is sufficient to add the formula , i.e., a clause .
The set of variables occurring in is denoted and its associated set of literals . A set of literals is complete if it contains one literal for each variable in , and fundamental if it does not contain complementary literals. A literal is called monotone or pure if does not appear in . An interpretation of a formula is a function which associates a truth value ( for false and for ) to some of the variables . is complete if it assigns a value to every , and partial otherwise. An interpretation is alternatively represented by a complete and fundamental set of literals. A model of a formula is an interpretation that satisfies the formula, denoted . A formula is a logical consequence of a formula , denoted , iff every model of is a model of . The SAT problem consists in deciding if a given CNF formula admits a model or not.
denotes the formula obtained from by assigning the truth-value . Formally, . This notation is extended to interpretations: given an interpretation , we define . denotes the formula closed under unit propagation, defined recursively as follows: (1) if does not contain any unit clause, (2) if contains two unit-clauses and , (3) otherwise, where is the literal appearing in a unit clause of . A clause is deduced by unit propagation from , noted , iff .
2.2 CNF Encodings of Cardinality Constraints: An Overview
2.2.1 Pigeon-Hole based Encoding:
In , the authors proposed a new encoding of the cardinality constraints , based on the Pigeon-Hole principle. They observed that the semantic of the cardinality constraint can be equivalently expressed as the problem of putting pigeons into holes. The first formulation, called , given in , is simply expressed by the following set of constraints:
The equations (2) encode the well-known pigeon hole problem , where is the number of pigeons and is the number of holes ( expresses that pigeon is in hole ). Unfortunately, checking the satisfiability of a Pigeon-Hole formula is computationally hard. To maintain generalized arc consistency (GAC), the authors proposed an improvement obtained by breaking the symmetries between the variables involved in the pigeon hole expression (equations (2) and (LABEL:eqc3)). By resolution between the clauses of symmetry breaking predicates and those of , the authors derived the following encoding, called :
Let us consider the inequality . Using the pigeon-hole based encoding , we obtain the following CNF:
2.2.2 Sorting Networks based Encoding:
One of the most effective encodings for cardinality constraints is based on sorting networks . In this encoding, the cardinality constraint is translated into a single sorter with inputs and outputs (sorted in descending order) where the k output is forced to . The idea behind this encoding is to sort the input variables into true variables followed by false variables. To satisfy the constraint , it is sufficient to set to false. In , the authors proved that the sorting networks based encoding maintains generalized arc consistency. Let us note the formula representing the sorting networks based circuit that takes as input the set of propositional variables and outputs an unary number represented by the set of propositional variables . The following formula defines the encoding:
As the outputs are sorted in descending order, by fixing to , all the remaining variables must be propagated to . Consequently, as the output variables are sorted in descending order, at most variables might be assigned to . Let us note that the formula encoding the sorting network is a horn formula, derived using a basic comparator between two propositional variables . Given two propositional variables and from , the comparator outputs two variables and from , the two comparator, noted -, is defined by the following horn formula:
This formula allows to sort the two variables and resulting in two other variables and in descending order. For example, when (resp. ) is assigned to (resp. ), the output variable (resp. ) is assigned to (resp. ). For more details, we refer the reader to  and .
2.2.3 Sequential Unary Counter based Encoding:
The sequential counter based encoding of a cardinality constraint proposed by Carsten Sinz in  is another well-known encoding that preserves the generalized arc consistency property. It computes for each propositional variable , the partial sums for increasing values of up to the final . The values of all the sums are represented as unary numbers of size equals to . The encoding is defined as follows:
The variables denotes the digit of the partial sum in unary representation. The constraints (8) and (9) correspond to the case . The formula (12) is very important. It allows to detect the inconsistency and preserves the GAC property at the same time. The other constraints allow the propagation of any changes of a partial sum after any assignment of variables. Let us note that the formula derived by the sequential unary counter based encoding is also a horn formula.
3 Conditional Cardinality Constraints Encodings
In this section, we show how the cardinality constraint encodings of Section 2.2, can be effectively extended to encode conditional cardinality constraints of the form while preserving generalized arc consistency maintained by unit propagation. More precisely, for such conditional cardinality constraint, maintaining GAC, means that when is assigned the truth value , the encoding must maintain GAC on the cardinality constraint . On the other hand, when the cardinality constraint is under the current assignment, the variable must be deduced by unit propagation.
An important observation that can be made from the SAT based encodings of the cardinality constraint presented in the previous subsection, is that the obtained formula is horn. Let us first introduce an important property, allowing us to grasp the intuition behind the encodings we propose in this paper.
Let be a horn formula, the sub-formula denotes the set of negative clauses of and the set of clauses of containing exactly one positive literal.
Let be a Horn formula and an interpretation. iff such that .
() Let us consider the formula . Suppose that there is no clause such that .
Let be the set of units literals of including the literals of . We can note that from only additional positive unit literals can be deduced by unit propagation (). So is a set of positive literals. is clearly a model of . In fact, each clause of is a satisfied clause (its positive literal is in ) or contains at least one negative literal. Indeed, propagating positive literals over leads to a formula where the remaining clauses contains a positive literal and at least one negative literal. The remaining clauses of are negative clauses before deleting from each clause the literals of . Then, by assigning the remaining variables to , we obtain a model of the formula . As , this contradicts the assumption that is unsatisfiable.
() From , we have . As , then .
Given a horn formula , Proposition 1 expresses that unsatisfiability under any interpretation made of a set of positives literals, is caused by a clause from . As a cardinality constraint is usually encoded as a horn formula , to maintain GAC on the encoding of , one only need to disjunctively add to .
3.1 Conditional AtMostOne Constraint Encodings
Let us first consider the conditional AtMostOne Constraint . Many encodings have been proposed to deal with the translation of AtMostOne constraint into CNF. Let us consider two standard encodings of this constraint.
3.1.1 Conditional AtMostOne Pairwise Encoding:
The classical pairwise encoding can be obtained by considering the set of all binary negative clauses build over the set of variables as described by the formula (13).
This naive formulation maintains generalized arc consistency and is in variables and clauses. The formula (14) encoding the conditional AtMostOne constraint is obtained by simply adding to all the clauses of the CNF formula (13) obtained by pairwise encoding. It is straightforward to remark that the obtained formula (14) allows to maintain generalized arc consistency. Indeed, any assignment of two literals and to true, allows to deduce by unit propagation. On the other hand, if is assigned to , the conditional constraint is reduced to a simple AtMostOne constraint which preserve generalized arc consistency.
3.1.2 Conditional AtMostOne Sequential Counter & Pigeon-Hole Encoding:
The second encoding of the AtMostOne constraint is represented by formula (15) obtained using sequential counter . In , the authors shown that the same encoding is obtained using the pigeon-hole encoding described above and by applying an additional step of variables elimination by resolution. In contrast to pairwise encoding (13), the one obtained by sequential counter (15) is linear ( variables and clauses) thanks to the additional variables . Both encodings (13) and (15) are known to maintain generalized arc consistency.
However, with the sequential counter based encoding, by adding to all clauses of the formula (15) we obtain a new formulation of the conditional AtMostOne constraint (formula (16)) that does not maintain generalized arc consistency.
Indeed, assigning two literals from does not allow us to deduce by unit propagation. For example, by assigning and to , the two first clauses from (16) become binary.
To maintain the generalized arc consistency for the conditional AtMostOne constraint using sequential counter or pigeon-hole based encoding, must be added to a subset of the clauses as depicted in the formula (17).
The CNF formula (17) encoding using sequential counter or pigeon-hole encoding maintains generalized arc consistency by unit propagation.
The proof of this proposition is a direct consequence of Proposition 1. In fact, the encoding of is a horn formula. As a consequence when more than one literal from are assigned to , then a clause from the negative clauses of the encodings become . Consequently, to encode , it is sufficient to add to the two negative clauses as shown in Constraint (17). Indeed, suppose that we assign two arbitrary variables and (with ) to . From the assignment of to and the clause , we deduce a unit literal . Then, from the clause we deduce another unit literal . This chain of unit propagated literals continue until . Now if we assign to , the clause allows us to deduce , as (propagated unit literal) and are assigned to . Let us consider another case, where is assigned to . Such assignment allows us to deduce thanks to unit propagation the literals . Then assigning any other literal (with ), we deduce the literal , thanks to the clause . Obviously assigning to leads to the classical encoding of the AtMostOne constraint which for the sequential counter and pigeon hole encoding preserve generalized arc consistency by unit propagation.
3.1.3 Conditional AtMostOne Sorting Networks Encoding:
The sorting network encoding of the AtMostOne conditional constraint is similar to the conditional AtMostK constraint described in Section 3.2.2. It is defined as:
Proposition 4 shows that the encoding, for any value of , maintains generalized arc consistency by unit propagation.
3.2 Conditional AtMostK Constraint Encodings
Let us now consider the general case of Conditional AtMostK Constraint.
3.2.1 Pigeon-Hole based Encoding of Conditional Cardinaility:
In Subsection 2.2.1, we reviewed the pigeon hole based encoding of the cardinality constraint AtLeastK of the form proposed in . For clarity and consistency reasons, and as the constraint AtMostK can be equivalently rewritten as an AtLeastK constraint , for the pigeon hole based encoding, we consider the conditional AtLeastK constraint .
To preserve GAC, must be added to a limited subset of clauses of encoding. Only the positives clauses of constraint (3) are augmented with .
The encoding preserves the generalized arc consistency of .
Let us note that the pigeon hole based encoding of is a reverse-horn formula. So the Proposition 1 can be slightly modified to be adapted to the reverse-horn case by considering assignments of variables to and positive clauses. As a consequence, one can conclude that adding to the positive clauses is sufficient to maintain GAC by unit propagation. Let us sketch the proof using Example 1. The CNF encoding of the conditional constraint is obtained from the CNF formula encoding by disjunctively adding to the positive clauses (clauses on the right hand side). As we can observe the obtained formula remains in the reverse horn class. Let us show that by assigning any three variables among to to , we deduce by unit propagation. Suppose that , and are assigned to . From the second and third set of clauses, we deduce by unit propagation , , , , and . Consequently, from the clause , we deduce . Let us consider another case, say , and are assigned . By unit propagation, we deduce , , , , , , , , . From the clause , we deduce . Similarly, any other assignment of three variables from to produces by unit propagation.
3.2.2 Sorted Networks based Encoding of Conditional Cardinality:
Let us now consider the sorted networks based encoding of the conditional AtMostK constraint . Using the sorted networks encoding of the AtMostK constraint (see Section 2.2.2), its conditional variant can be represented by which is equivalent to the CNF formula . As discussed in Section 2.2.2, the basic comparator of two propositional variables, -, is a building bloc of the sorted networks based encoding , i.e., a conjunction of multiple formulas encoding two comparator basic components. Consequently, the conditional formula can be translated into CNF by adding to all the clauses of each basic two comparators, which leads to multiple conditional two comparators of the form -, written in a clausal form as:
As we can see, assigning any input literal or to a conditional two comparators does not allow us to deduce any literal by unit propagation as all the clauses from (19) become binary. In fact, to maintain generalized arc consistency for the conditional AtMostK constraint using sorting networks-based encoding, must be disjunctively added only to the unit clause :
The encoding preserves the generalized arc consistency of .
In case where is assigned to , the simplified formula represents the AtMostK constraint encoded using sorted networks. No, we consider two cases depending on the truth-value of . In the first case, if is assigned to , we deduce by unit propagation. Indeed, as the outputs are sorted in descending order, this means that the AtMostK constraint is , to satisfy the conditional AtMostK, one must assign to . In the second case, if the truth value of is , this means that the AtMostK constraint is , consequently, no matter is the value of .
4 Sequential Unary Counter based Encoding of Conditional Cardinality Constraint
We have shown in Subsection 3.1 how the conditional AtMostOne constraint can be encoded using the sequential counter-based encoding, while preserving the GAC property. Let us now consider the general case of Sequential counter-based encoding of conditional AtMostK constraint. The clauses allow us to propagate any assignment of to synchronize all the intermediate sequential counters, while the clauses allow us to detect any inconsistency of the constraint AtMostK. Indeed, by adding to all the clauses, the literals can not be propagated from any assignment of variables which prevent the synchronization operation. To preserve the GAC property, we should add only to the clauses of as shown in the following formula:
Let us consider the following constraint which is encoded as follows:
Assume that we start by assigning to then, the literals and are deduced by unit propagation. Next, if we assign to , the literal is unit propagated. Finally by assigning to , which violates the constraint, the literal is propagated thanks to the last clause.
The encoding preserves the generalized arc consistency of .
The encoding based on sequential counter of the cardinality constraint is also a horn formula. Consequently, we can apply the result of Proposition 1 to conclude that must be added only to the negative clauses in order to preserve the generalized arc consistency. The proof is a simple generalization of those sketched in Example 2.
5 SAT-based Association Rules Mining: A Case Study
We now present an application case, the problem of mining non-redundant association rules, whose encoding involves many conditional atMostOne constraints.
5.1 Association Rules Mining
Let be a finite non empty set of symbols, called items. We use the letters , , , etc. to range over the elements of . An itemset over is defined as a subset of , i.e., . We use to denote the set of itemsets over and we use the capital letters , , , etc. to range over the elements of . A transaction
is an ordered pairwhere is a natural number, called transaction identifier, and an itemset, i.e., . A transaction database is defined as a finite non empty set of transactions () where each transaction identifier refers to a unique itemset. The cover of an itemset in a transaction database is defined as . The support of in corresponds to the cardinality of , i.e., . An itemset such that is a closed itemset if, for all itemsets with , .
Let us consider the transaction database depicted in Table 2. We have and while . The itemset is closed, while is not.
An association rule is a pattern of the form where (called the antecedent) and (called the consequent) are two disjoint itemsets. The interestingness predicate is defined using the notions of support and confidence. The support of an association rule in a transaction database , defined as , determines how often a rule is applicable to a given dataset, i.e., the occurrence frequency of the rule. The confidence of in , defined asgiven .
A valid association rule is an association rule with support and confidence greater than or equal to the minimum support threshold (minsupp) and minimum confidence threshold (minconf), respectively.
Definition 1 (Mining Association Rules Problem)
The problem of mining association rules consists in computing
5.2 SAT-based Non-Redundant Association Rules Mining
To mine association rules, Boudane et al.  proposed a SAT-based approach. Boolean variables are introduced to represent the antecedent and the consequence of an association rule . Support and confidence constraints are expressed as 0/1 linear inequalities over the variables associated to transactions.
Let be a set of items, a transaction database, where , (resp. ) a minimum support (resp. confidence) threshold. Each item is associated with two Boolean variables and . (resp. ) is true if and only if (resp. ). Similarly to , to represent the cover of and , each transaction identifier is associated with two propositional variables and . (resp. ) are used to represent the cover of (resp. ). More precisely, given a Boolean interpretation , the corresponding association rule, denoted , is , the cover of is , and the cover of is . The SAT encoding of the association rules mining problem is defined by the constraints (26) to (31).
The two clauses of the formula (26) express that and are not empty sets. Formula (27) allows to express . The formula (28) is used to represent the cover of the itemset corresponding to the left part of the candidate association rule. We know that the transaction identifier does not belong to if and only if there exists an item such that . This property is represented by constraint (28) expressing that is if and only if contains an item that does not belong to the transaction . In the same way, the formula (29) allows to capture the cover of . To specify that the support of the candidate rule has to be greater than or equal to the fixed threshold (in percentage), and the confidence is greater than or equal to we use respectively the constraints (30) and (31) expressed by 0/1 linear inequalities.
To extend the mining task to the closed association rules, the following constraint is added to express that is a closed itemset :
This formula means that, for all item , if we have , which is encoded with the formula , then we get , which is encoded with .
Several contributions deal with the enumeration of a compact representation of association rules. Among such representations, one can cite the well-known Minimal Non-Redundant Association Rules [6, 12] defined as follows:
Definition 2 (Minimal Non-Redundant Rule)
An association rule is a minimal non-redundant rule iff there is no association rule different from s.t. () , () and () and .
Consider the rules given in Table 2. In this set of rules, is a minimal non-redundant rule while is not.
Minimal non-redundant association rules are the closed rules in which the antecedents are minimal w.r.t. set inclusion. The authors of  provided a characterization of the antecedents, called minimal generators.
Definition 3 (Minimal Generator)
Given a closed itemset . An itemset is a minimal generator of iff and there is no s.t. and .
In , the authors proposed to extend the SAT-based encoding to enumerate the minimal non-redundant association rules. To this end, the SAT-based encoding of association rules mining is enhanced with a Boolean constraint expressing that each antecedent is a minimal generator. This constraint expressing that is defined as follows:
Using additional variables, this constraint is rewritten as:
As we can observe, the previous constraint (34) involves conditional AtMostOne constraints. We note , the conjunction of the formulas from (26) to (32) and (34), encoding the problem of minimal non redundant association rules. This encoding is used in our experimental evaluation to show the relevance of our proposed encoding.
In this section, we consider the encoding of minimal non-redundant association rules as described by the boolean formula (see Subsection 5.2). To enumerate the set of models of the resulting CNF formula, we follow the approach of 
. The proposed model enumeration algorithm is based on a backtrack search DPLL-like procedure. In our experiments, the variables ordering heuristic, focus in priority on the variables of respectivelyand to select the one to assign next. The main power of this approach consists in using watched literals structure to perform efficiently the unit propagation process. Let us also note that the constraints (30) and (31) expressing respectively the frequency and the confidence are managed dynamically without translation into CNF form. Indeed, these last constraints are handled and propagated on the fly as usually done in constraint programming. Each model of the propositional formula encoding the association rules mining task corresponds to an association rule obtained by considering the truth values of the propositional variables encoding the antecedent () and the consequent () of this rule.
For illustration purposes, in our experiments, (resp. ) denotes the formula , where the conditional AtMostOne constraints involved in the formula (34) are expressed using the sequential counter encoding ((Section 3.1) that maintains (resp. does not maintain) GAC property (resp. ) expressed by the formula (17) (resp. 16). In our experiments, for each data, the support was varied from 5% to 100% with an interval of size 5%. The confidence is varied in the same way. Then, for each data, a set of 400 configurations is generated. All the experiments were done on Intel Xeon quad-core machines with 32GB of RAM running at 2.66 Ghz. For each instance, we fix the timeout to 15 minutes of CPU time.
Table 3 describes our comparative results. We report in column 1 the name of the dataset and its characteristics in parenthesis: number of items (#items), number of transactions (#trans) and density. For each encoding, we report the number of solved configurations (), and the average solving time ( in seconds). For each unsolved configuration, the time is set to 900 seconds (time out). In the last row of Table 3, we provide the total number of solved configurations and the global average CPU time in seconds.
|data (#items, #trans, density)||#S||time(s)||#S||time(s)|
|Audiology (148, 216, 45%)||21||855.11||22||854.87|
|Zoo-1 (36, 101, 44%)||141||582.79||400||0.25|
|Tic-tac-toe (27, 958, 33%)||395||12.7||400||0.16|
|Anneal (93, 812, 45%)||20||855.00||252||396.65|
|Australian-credit (125, 653, 41%)||60||765.02||288||301.96|
|German-credit (112, 1000, 34%)||82||715.54||331||203.508|
|Heart-cleveland (95, 296, 47%)||100||675.02||312||233.93|
|Hepatitis (68, 137, 50%)||102||670.51||345||165.28|
|Hypothyroid (88, 3247, 49%)||20||855.01||128||643.01|
|kr-vs-kp (73, 3196, 49%)||21||852.76||173||546.40|
|Lymph (68, 148, 40%)||21||852.75||400||16.57|
|Mushroom (119, 8124, 18%)||20||855.08||392||68.71|
|Primary-tumor (31, 336, 48%)||144||577.05||400||3.87|
|Soybean (50, 650, 32%)||63||758.26||400||0.72|
|Vote (48, 435, 33%)||243||353.44||400||25.34|
|Splice-1 (287, 3190, 21%)||363||90.68||380||168.83|