A Mining-Based Compression Approach for Constraint Satisfaction Problems

05/14/2013 ∙ by Said Jabbour, et al. ∙ 0

In this paper, we propose an extension of our Mining for SAT framework to Constraint satisfaction Problem (CSP). We consider n-ary extensional constraints (table constraints). Our approach aims to reduce the size of the CSP by exploiting the structure of the constraints graph and of its associated microstructure. More precisely, we apply itemset mining techniques to search for closed frequent itemsets on these two representation. Using Tseitin extension, we rewrite the whole CSP to another compressed CSP equivalent with respect to satisfiability. Our approach contrast with previous proposed approach by Katsirelos and Walsh, as we do not change the structure of the constraints.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The table constraint is considered for a long time as particularly important in constraint satisfaction problems (CSP). Indeed, on of the most used formulation of CSP consists in expressing the each constraint in extension or as a relation among variables with associated finite domains. Many research work, consider table constraints as the standard representation. Indeed, any constraint can be expressed using a set of allowed or forbidden tuples. However, the size of these kind of extensional constraints might be exponential in the worst case. In [6]

, Katsirelos and Walsh proposed for the first time a compression algorithm for large arity extensional constraints. The proposed algorithm attempts to capture the structure that may exist in a table constraint. The authors proposed an alternative representation of the set of tuples of a given relation by a set of compressed tuples. The proposed representation may lead to an exponential reduction in space complexity. However, the compressed tuples may be larger than the arity of the original constraint. Consequently, the obtained CSP do not follow the standard representation of the table constraint. The authors use decision trees to derive a set of compressed tuples.

In this paper, we present a new compression algorithm that combines both itemset mining techniques and Tseitin extension principles to derive a new compact representation of the table constraints. First, we show our previous Mining for SAT approach can be extended to deal with the CSP by considering the constraint graph as a transaction database, where the transactions corresponds to the constraints and items to the variables of the CSP. The closed frequent itemsets corresponds to subset of variables shared most often by the different constraint of the CSP. Secondly, using extension (auxiliary variables) we show how such constraints can be rewritten while preserving satisfiability. Secondly, we consider each table constraint individually, we derive a new transaction database made of a sequence of tuples i.e. a set of indexed tuples. More precisely, each value of a tuple is indexed with its position in the constraint. By enumerating closed frequent itemsets on such transaction database, we are able to search for the largest rectangle in the table constraint. Similarly, with extension principle, we show how such constraint can be compressed while preserving the traditional representation.

2 Technical background and preliminary definitions

2.1 Frequent Itemset Mining Problem

Let be a set of items. A set is called an itemset. A transaction is a couple where is the transaction identifier and is an itemset. A transaction database is a finite set of transactions over where for all two different transactions, they do not have the same transaction identifier. We say that a transaction supports an itemset if .

The cover of an itemset in a transaction database is the set of identifiers of transactions in supporting : . The support of an itemset in is defined by: . Moreover, the frequency of in is defined by: .

For example, let us consider the transaction database in Table 1. Each transaction corresponds to the favorite writers of a library member. For instance, we have and
.

tid itemset
001
002
003
004
005
006
Table 1: An example of transaction database

Let be a transaction database over and a minimal support threshold. The frequent itemset mining problem consists of computing the following set: .

The problem of computing the number of frequent itemsets is -hard [3]. The complexity class corresponds to the set of counting problems associated with a decision problems in . For example, counting the number of models satisfying a CNF formula is a problem.

Let us now define two condensed representations of the set of all frequent itemsets: maximal and closed frequent itemsets.

Definition 1 (Maximal Frequent Itemset)

Let be a transaction database, a minimal support threshold and . is called maximal when for all , ( is not a frequent itemset).

We denote by the set of all maximal frequent itemsets in with as a minimal support threshold. For instance, in the previous example, we have .

Definition 2 (Closed Frequent Itemset)

Let be a transaction database, a minimal support threshold and . is called closed when for all , .

We denote by the set of all closed frequent itemsets in with as a minimal support threshold. For instance, we have
. In particular, let us note that we have and . That explains why and are both closed. One can easily see that if all the closed (resp. maximal) frequent itemsets are computed, then all the frequent itemsets can be computed without using the corresponding database. Indeed, the frequent itemsets correspond to all the subsets of the closed (resp. maximal) frequent itemsets.

Clearly, the number of maximal (resp. closed) frequent itemsets is significantly smaller than the number of frequent itemsets. Nonetheless, this number is not always polynomial in the size of the database [9]. In particular, the problem of counting the number of maximal frequent itemsets is -complete (see also [9]).

Many algorithm has been proposed for enumerating frequent closed itemsets. One can cite Apriori-like algorithm, originally proposed in [1] for mining frequent itemsets for association rules. It proceeds by a level-wise search of the elements of . Indeed, it starts by computing the elements of of size one. Then, assuming the element of of size is known, it computes a set of candidates of size so that is a candidate if and only if all its subsets are in . This procedure is iterated until no more candidates are found. Obviously, this basic procedure is enhanced using some properties such as the anti-monotonicity property that allow us to reduce the search space. Indeed, if , then for all . In our experiments, we consider one of the state-of-the-art algorithm LCM for mining frequent closed itemsets proposed by Takeaki Uno et al. in [8]. In theory, the authors prove that LCM exactly enumerates the set of frequent closed itemsets within polynomial time per closed itemset in the total input size. Let us mention that LCM algorithm obtained the best implementation award of FIMI’2004 (Frequent Itemset Mining Implementations).

2.2 Constraint Satisfaction Problems: Preliminary definitions and notations

A constraint network is defined as a tuple . is a finite set of variables and is a function mapping a variable to a domain of values . We note the maximum size of the domains, and the set of all values. is a finite set of constraints . Each constraint of arity is defined as a couple where is the set of variables involved in and the set of allowed tuples i.e. iff the tuple satisfies the constraint . We define the size of the constraint network as where and . A solution to the constraint network is an assignment of all the variables satisfying all the constraints in . A CSP (Constraint Satisfaction Problem) is the problem of deciding if a constraint network admits a solution or not.

We denote the value of the variable in the tuple . Let and be two tuples (of values or variables), we define the non-commutative operator by . Let be a CSP instance, a constraint and a sequence of variables such that where is the set of variables of . We denote by the following set of tuples:

2.3 Tseitin’s Extension principle

To explain the Tseitin principles [7]

at the basis of linear transformation of general Boolean formulas to a formula in conjunctive normal form (CNF), let us introduce some necessary definitions and notations. A

CNF formula is a conjunction of clauses, where a clause is a disjunction of literals. A literal is a positive () or negated () propositional variable. The two literals and are called complementary. A CNF formula can also be seen as a set of clauses, and a clause as a set of literals. The size of the CNF formula is defined as , where is equal to the number of literals in .

Tseitin’s encoding consists in introducing fresh variables to represent sub-formulae in order to represent their truth values. Let us consider the following DNF formula (Disjunctive Normal Form: a disjunction of conjunctions):

A naive way of converting such a formula to a CNF formula consists in using the distributivity of disjunction over conjunction ():

Such a naive approach is clearly exponential in the worst case. In Tseitin’s transformation, fresh propositional variables are introduced to prevent such combinatorial explosion, mainly caused by the distributivity of disjunction over conjunction and vice versa. With additional variables, the obtained CNF formula is linear in the size of the original formula. However the equivalence is only preserved w.r.t satisfiability:

3 Compressing Table Constraints Networks

In this section, we proposed two compression rules for table constraints networks. The first one is based on the constraint graph aims to reduce the size of the constraint network by rewriting the constraints using the most shared variables. The second compression technique based on the microstructure of the constraint network aims to reduce the size of table constraints by exploiting common sub-tuples.

3.1 Constraint graph Based Compression

3.1.1 CSP instance as transactions database:

We describe the transactions database that we associate to a given constraints network. It is obtained by considering the set of variables as a set of items.

Definition 3

Let be a constraints network. The transactions database associated to , denoted , is defined over the set of items as follows:

3.1.2 Constraints Graph Rewriting Rule (CGR):

We provide a rewriting rule for reducing the size of a constraints network. It is mainly based on introducing new variables using Tseitin extension principle.

Definition 4 (CGR rule)

Let be a constraints network, a tuple of variables and a subset of constraints of such that for . In order to rewrite , we introduce a new variable and a set of new values such that and . Let be a bijection from to . We denote by the constraint network obtained by rewriting with respect to and :

  • ;

  • is a domain function defined as follows: if , and .

  • , where such that:

It is important to note that our rewriting rule, achieve a weak form of pairwise consistency [5]. A constraint network is pairwise consistent (PWC) iff it has non-empty relations and any consistent tuple of a constraint can be consistently extended to any other constraint that intersects with .

Definition 5 (Pairwise consistency)

[2, 5] Let be a constraints network. is pairwise consistent if and only if and , .

As pairwise consistency deletes tuples from a constraint relation, some values can be eliminated when they have lost all their supports. Consequently, domains can be filtered if generalized arc consistency (GAC) is applied in a second step.

As a side effect, our CGR rewriting rule maintains some weak form of PWC. Indeed, in Definition 4, when a sub-tuple , the tuple is then deleted and do not belong to the new constraint .

Example 1

Let be a constraints network, where , and where
and . Let be a tuple of variables such that and .By applying the CGR rule on , we obtain such that:

  • . We have . We define . Then .

    • ;

    • and

In this simple example, using one additional variable, we reduce the size of the constraint network from to . As we can observe, the value can be eliminated by GAC from the domain of .

3.1.3 Necessary and sufficient condition for size reduction

Let be a constraints network, and be a sub-tuple of variables corresponding to a frequent itemset of where the minimal support threshold is greater or equal to . Let be the set of constraints such that for . Suppose that the constraints network is pairwise consistent, in such a case, all the relations associated to each for contain the same number of tuples. Under such worst case hypothesis, the size of can be reduced by at least . Let us consider again the example 1. The reduction is at least . If we consider, the tuple eliminated by the application of the CGR rule. This results in subtracting from the second term of . Consequently, we obtain a reduction of .

Regarding the value of , one can see that the compression is interesting when i.e. . Indeed, if then there is no reduction. Thus, there are three cases : if , then , else if then , otherwise. Therefore, the constraint network is always reduced when . We obtain exactly the same condition as in our mining based compression approach of Propositional CNF formula [4]. This is not surprising, as CGR rule is an extension of our Mining4SAT approach [4] to CSP.

3.1.4 Closed vs. Maximal:

In [4], we introduced two condensed representations of the frequent itemsets: closed and maximal. We know that the set of maximal frequent itemsets is included in that of the closed ones. Thus, a small number of fresh variables and new clauses are introduced using the maximal frequent itemsets. However, there are cases where the use of the closed frequent itemsets is more suitable. The example given in [4], show the benefit that can be obtained by considering frequent closed itemsets. In our Mining for CSP approach we search for frequent closed itemsets.

3.1.5 Compression algorithm:

Given a constraint network , we first search for closed frequent itemsets (set of variables) on and then we apply the above rewriting rule on the constraint network using the discovered itemsets of variables. For more details on our algorithm, we refer the reader to the Mining4SAT greedy algorithm [4], where the overlap notion between itemsets are considered. The general compression problem can be stated as follows: given a set of frequent closed itemsets (sub-sequence of variables) and a constraints network, the question is to find an ordered sequence of operations (application of the CGR rule) leading to a CSP of minimal size.

3.2 Microstructure Based Compression

In this section, we describe our compression based approach of Table constraints. First, we show how a Table constraint can be translated to a transaction database . Secondly, we show how to compress using itemset mining techniques.

3.2.1 Table constraint as transactions database:

Obviously, a table constraint can be translated in a naive way to a transaction database . Indeed, one can define the set of items as the union of the domains of the variables in the scope of () and a transaction as the set of values involved in the tuple . This naive representation is difficult to exploit in our context. Let be a frequent itemset of . As the variables in each transaction (or tuple) associated to the values in are different, it is difficult to compress the the constraint while using both classical tuples and compressed tuples [6]. To overcome this difficulty, we consider tuples as sequence, where each value is indexed by its position in the tuple.

Definition 6 (Indexed tuples)

Let be a constraint network, and a table constraint such that . Let a tuple of . We define as an indexed tuple associated to i.e. each value of the tuple is indexed with its position in the tuple.

Definition 7 (Inclusion, index)

Let be a table constraint with and a tuple of . We say that is a sub-tuple of , denoted , if such that . We define , while . We also define and .

Definition 8

Let be a constraints network, and a table constraint where . The transaction database associated to , denoted , is defined over the set of items as follows:

Example 2

Let be a constraints network, where , . Let a table constraint, such that and
. The transaction database associated to is defined as follows:

tid itemset
001
002
003
004
005
Table 2: a transaction database associated to

Let be an itemset of . We have , and .

3.2.2 Microstructure Rewriting Rule (MRR):

We now provide a rewriting rule for reducing the size of a table constraint.

Definition 9 (MRR rule)

Let be a constraints network and be a table constraint with and . Let be an itemset of and . In order to rewrite using , we introduce a new variable and a set of new values such that and . Let be a bijection from to . We denote by the constraints network obtained by rewriting with respect to and :

  • ;

  • is a domain function defined as follows: if , and .

  • , where such that:

Example 3

Let us consider again the example 2. Applying the rewriting rule to with respect to , and where and , we obtain the following two constraints:

  • ;

It is easy to see that in example 3, applying MRR rule leads to a constraint of greater size. In what follows, we introduce a necessary and sufficient condition for reducing the size of the table constraint.

3.2.3 Necessary and sufficient condition for size reduction

Let be a table constraint, the number of tuples in , and be a sub-tuple of values corresponding to a frequent itemset of where the minimal support threshold is greater or equal to . Let be the set of tuples such that for . The size of can be reduced by at least . Let us consider again the example 3. The reduction is at least . In this example, we increase the size of by one value. Indeed, and .

Regarding the value of , one can see that applying MRR rule is interesting when i.e. . In the previous example, no reduction is obtained as . (, the condition is not satisfied).

3.2.4 Compression algorithm of a table constraint:

Given a constraint network , and a constraint table of , we first search for closed frequent itemsets (sub-tuple of values) on and then we apply the above rewriting rule on the table constraint using the discovered itemsets of values. Similarly to the constraint graph based compression algorithm, our microstructure based compression algorithm can be derived from the one defined in [4].

As a summary, to compress general CSP, our approach first apply constraint graph based compression algorithm followed by the microstructure based compression algorithm.

4 Conclusion and Future Works

In this paper, we propose a data-mining approach, called Mining4CSP, for reducing the size of constraints satisfaction problems when constraints are represented in extension. It can be seen as a preprocessing step that aims to discover hidden structural knowledge that are used to decrease the size of table constraints. Mining4CSP combines both frequent itemset mining techniques for discovering interesting substructures, and Tseitin-based approach for a compact representation of Table constraints using these substructures. Our approach is able to compact a CSP by considering both its associated constraint graph and microstructure. This allows us to define a two step algorithm. The first step, named coarse-grained compression, allows to compact the constraint graph using patterns representing subsets of variables. The second step, named fine-grained compression allows us to compact a given set of tuples of a given table constraint using patterns representing subset of values. Finally, an experimental evaluation on CSP instances is short term perspective.

References

  • [1] R. Agrawal, T. Imielinski, and A. N. Swami. Mining association rules between sets of items in large databases. In ACM SIGMOD International Conference on Management of Data, pages 207–216, Baltimore, 1993. ACM Press.
  • [2] C. Beeri, R. Fagin, D. Maier, and M. Yannakakis. On the desirability of acyclic database schemes. Journal of the ACM, 30:479–513, 1983.
  • [3] Dimitrios Gunopulos, Roni Khardon, Heikki Mannila, Sanjeev Saluja, Hannu Toivonen, and Ram Sewak Sharma. Discovering all most specific sentences. ACM Trans. Database Syst., 28(2):140–174, June 2003.
  • [4] Saïd Jabbour, Lakhdar Sais, and Yakoub Salhi. Mining to compact cnf propositional formulae. CoRR, abs/1304.4415, 2013.
  • [5] P. Janssen, P. Jégou, B. Nouguier, and M.C. Vilarem. A filtering process for general constraint satisfaction problems: Achieving pairwise consis- tency using an associated binary representation. In

    Proceedings of IEEE Workshop on Tools for Artificial Intelligence

    , pages 420–427, 1989.
  • [6] George Katsirelos and Toby Walsh. A compression algorithm for large arity extensional constraints. In Proceedings of the 13th International Conference on Principles and Practice of Constraint Programming - CP 2007, volume 4741 of Lecture Notes in Computer Science, pages 379–393. Springer, 2007.
  • [7] G.S. Tseitin. On the complexity of derivations in the propositional calculus. In H.A.O. Slesenko, editor, Structures in Constructives Mathematics and Mathematical Logic, Part II, pages 115–125, 1968.
  • [8] Takeaki Uno, Masashi Kiyomi, and Hiroki Arimura. Lcm ver. 2: Efficient mining algorithms for frequent/closed/maximal itemsets. In Roberto J. Bayardo Jr., Bart Goethals, and Mohammed Javeed Zaki, editors, FIMI, volume 126 of CEUR Workshop Proceedings. CEUR-WS.org, 2004.
  • [9] Guizhen Yang. The complexity of mining maximal frequent itemsets and maximal frequent patterns. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’04, pages 344–353, New York, NY, USA, 2004. ACM.