Constraints are a natural, powerful means of representing and reasoning about combinatorial problems that impact all of our lives. Constraint solving is applied successfully in a wide variety of disciplines such as aviation, industrial design, banking, combinatorics and the chemical and steel industries, to name but a few examples.
A constraint satisfaction problem (CSP ) is a set of decision variables, each with an associated domain of potential values, and a set of constraints. An assignment maps a variable to a value from its domain. Each constraint specifies allowed combinations of assignments of values to a subset of the variables. A solution to a CSP is an assignment to all the variables that satisfies all the constraints. Solutions are typically found for CSPs through systematic search of possible assignments to variables. During search, constraint propagation algorithms are used. These propagators make inferences, usually recorded as domain reductions, based on the domains of the variables constrained and the assignments that satisfy the constraints. If at any point these inferences result in any variable having an empty domain then search backtracks and a new branch is considered.
When implementing constraint solvers and modelling constraint problems, many design decision have to be made – for example how much propagation to do, what data structures to use to enable the solver to backtrack and when and how often to check if a constraint is satisfied and which values could be removed. These decisions have so far been made mostly manually. Making the “right” decision often depends on the experience of the person making it.
We approach this problem using machine learning. Given a particular problem, we want to decide automatically which design decisions to make. This improves over the current state of the art in two ways. First, we do not require humans to more or less arbitrarily decide on something they may not have any experience with. Second, we can change design decisions for particular problems.
We demonstrate that we can approach machine learning as a “black box” and use generic techniques to increase the performance of the learned classifiers. The result is a system which is able to dynamically decide which implementation to use by looking at an unknown problem. The decision made is in general better than simply relying on a default choice and enables us to solve constraint problems faster.
We are addressing an instance of the Algorithm Selection Problem , which, given variable performance among a set of algorithms, is to choose the best candidate for a particular problem instance. Machine learning is an established method of addressing this problem [14, 16]. Particularly relevant to our work are the machine learning approaches that have been taken to configure, to select among, and to tune the parameters of solvers in the related fields of mathematical programming, propositional satisfiability (SAT), and constraints.
configures a constraint solver for a particular instance distribution. It makes informed choices about aspects of the solver such as the search heuristic and the level of constraint propagation. The Adaptive Constraint Engine learns search heuristics from training instances. SATenstein  configures stochastic local search solvers for solving SAT problems.
An algorithm portfolio consists of a collection of algorithms, which can be selected and applied in parallel to an instance, or in some (possibly truncated) sequence. This approach has recently been used with great success in SATzilla  and CP Hydra . In earlier work Borrett et al  employed a sequential portfolio of constraint solvers. Guerri and Milano 
use a decision-tree based technique to select among a portfolio of constraint- and integer-programming based solution methods for the bid evaluation problem. Similarly, Gentet al  investigate decision trees to choose whether to use lazy constraint learning  or not.
employ a genetic algorithm to tune the parameters of both local and systematic SAT solvers.
The alldifferent constraint requires all variables which it is imposed on to be pairwise alldifferent. For example alldiff() enforces , and .
There are many different ways to implement the alldifferent constraint. The naïve version decomposes the constraint and enforces disequality on each pair of variables. More sophisticated versions (e.g. ) consider the constraint as a whole and are able to do more propagation. For example an alldifferent constraint which involves four variables with the same three possible values each cannot be satisfied, but this knowledge cannot be derived when just considering the decomposition into pairs of variables.
Even when the high-level decision of how much propagation to do has been made, a low-level decision has to be made on how to implement the constraint. For an in-depth survey of the decisions involved, see .
We make both decisions and therefore combine the selection of an algorithm (the naïve implementation or the more sophisticated one) and the tuning of algorithm parameters (which one of the more sophisticated implementations to use).
3 The benchmark instances and solvers
We evaluated the performance of the different versions of the alldifferent constraint on two different sets of problem instances. The first one was used for learning classifiers, the second one only for the evaluation of the learned classifiers.
The set we used for machine learning consisted of 277 benchmark instances from 14 different problem classes. It has been chosen to include as many instances as possible whatever our expectation of which version of the alldifferent constraint will perform best.
The set to evaluate the learned classifiers consisted of 1036 instances from 2 different problem classes that were not present in the set we used for machine learning. We chose this set for evaluation because the low number of different problem classes makes it unsuitable for training.
Our sources are Lecoutre’s XCSP repository  and our own stock of CSP instances. The reference constraint solver used is Minion  version 0.9 and its default implementation of the alldifferent constraint gacalldiff. The experiments were run with binaries compiled with g++ version 4.4.3 and Boost version 1.40.0 on machines with 8 core Intel E5430 2.66GHz, 8GB RAM running CentOS with Linux kernel 2.6.18-164.6.1.el5 64Bit.
We imposed a time limit of 3600 seconds for each instance. The total number of instances that no solver could solve solve because of a time out was 66 for the first set and 26 for the second set. We took the median CPU time of 3 runs for each problem instance.
Adapting the implementation decision to the problem instead of always choosing a standard implementation has the potential of achieving speedups of up to 2 on the first set of benchmark instances and speedups of up to 1.2 on the second set. A speedup of 2 means that the fastest version of alldifferent is twice as fast as the default version.
We ran the problems with 9 different versions of the alldifferent constraint – the naïve version which is equivalent to the binary decomposition and 8 different implementations of the more sophisticated version which does more propagation (see ). The amount of search done by the 8 versions which implement the more sophisticated algorithm was the same.
The instances, the binaries to run them, and everything else required to reproduce our results are available from the primary author on request.
4 Instance attributes and their measurement
We measured 37 attributes of the problem instances. They describe a wide range of features such as constraint and variable statistics and a number of attributes based on the primal graph. The primal graph has a vertex for every CSP variable, and two vertices are connected by an edge iff the two variables are in the scope of a constraint together.
- Edge density
The number of edges in divided by the number of pairs of distinct vertices.
- Clustering coefficient
For a vertex , the set of neighbours of is . The edge density among the vertices is calculated. The clustering coefficient is the mean average of this local edge density for all . It is intended to be a measure of the local cliqueness of the graph. This attribute has been used with machine learning for a model selection problem in constraint programming .
- Normalised degree
The normalised degree of a vertex is its degree divided by . The minimum, maximum, mean and median normalised degree are used.
Normalised standard deviation of degree
The standard deviation of vertex degree is normalised by dividing by .
- Width of ordering
Each of our benchmark instances has an associated variable ordering. The width of a vertex in an ordered graph is its number of parents (i.e. neighbours that precede in the ordering). The width of the ordering is the maximum width over all vertices . The width of the ordering normalised by the number of vertices was used.
- Width of graph
The width of a graph is the minimum width over all possible orderings. This can be calculated in polynomial time , and is related to some tractability results. The width of the graph normalised by the number of vertices was used.
- Variable domains
The quartiles and the mean value over the domains of all variables.
- Constraint arity
The quartiles and the mean of the arity of all constraints (the number of variables constrained by it), normalised by the number of constraints.
- Multiple shared variables
The proportion of pairs of constraints that share more than one variable.
- Normalised mean constraints per variable
For each variable, we count the number of constraints on the variable. The mean average is taken, and this is normalised by dividing by the number of constraints.
- Ratio of auxiliary variables to other variables
Auxiliary variables are introduced by decomposition of expressions in order to be able to express them in the language of the solver. We use the ratio of auxiliary variables to other variables.
The tightness of a constraint is the proportion of disallowed tuples. The tightness is estimated by sampling 1000 random tuples (that are valid w.r.t. variable domains) and testing if the tuple satisfies the constraint. The tightness quartiles and the mean tightness over all constraints is used.
- Proportion of symmetric variables
In many CSPs, the variables form equivalence classes where the number and type of constraints a variable is in are the same. For example in the CSP , are all indistinguishable, as are and . The first stage of the algorithm used by Nauty  detects this property. Given a partition of variables generated by this algorithm, we transform this into a number between 0 and 1 by taking the proportion of all pairs of variables which are in the same part of the partition.
- Alldifferent statistics
The size of the union of all variables in an alldifferent constraint divided by the number of variables. This is a measure of how many assignments to all variables that satisfy the constraint there are. We used the quartiles and the mean over all alldifferent constraints.
In creating this set of attributes, we intended to cover a wide range of possible factors that affect the performance of different alldifferent implementations. Wherever possible, we normalised attributes that would be specific to problem instances of a particular size. This is based on the intuition that similar instances of different sizes are likely to behave similarly. Computing the features took 27 seconds per instance on average.
5 Learning a problem classifier
Before we used machine learning on the set of training instances, we annotated each problem instance with the alldifferent implementation that had the best performance on it according to the following criteria. If the naïve alldifferent implementation took less CPU time than all the other ones, it was chosen, else the implementation which had the best performance in terms of search nodes per second was chosen. As all implementations except the naïve one explore the same search space, nodes per second is a more reliable measure of performance than only CPU time. If no solver was able to solve the instance, we assigned a “don’t know” annotation.
We used the WEKA 
machine learning software through the R interface to learn classifiers. We used almost all of the WEKA classifiers that were applicable to our problem – algorithms which generate decision rules, decision trees, Bayesian classifiers, nearest neighbour and neural networks. Our selection is broad and includes all major machine learning methodologies. The specific classifiers we used areBayesNet, BFTree, ConjunctiveRule, DecisionTable, FT, HyperPipes, IBk, J48, J48graft, JRip, LADTree, MultilayerPerceptron, NBTree, OneR, PART, RandomForest, RandomTree, REPTree and ZeroR, all of which are described in .
The problem of classifying problem instances here is different to normal machine learning classification problems. We do not particularly care about classifying as many instances as possible correctly; we rather care that the instances that are important to us are classified correctly. The higher the potential gain is for an instance, the more important it is to us. If, for example, the difference between making the right and the wrong decision means a difference in CPU time of 1%, we do not care whether the instance is classified correctly or not. If the difference is several orders of magnitude on the other hand, we really do want this instance to be classified correctly.
Based on this observation, we decided to measure the performance of the learned classifiers not in terms of the usual machine learning performance measures, but in terms of misclassification penalty. The misclassification penalty is the additional CPU time we require to solve a problem instance when choosing to solve it with a solver that is not the fastest one. If the selected solver was not able to solve the problem, we assumed the timeout of 3600 seconds minus the CPU time the fastest solver took to be the misclassification penalty. This only gives the lower bound, but the correct value cannot be estimated easily.
We furthermore decided to assign the maximum misclassification penalty (or the maximum possible gain) as a cost to each instance as follows. To bias the WEKA classifiers towards the instances we care about most, we used the common technique of duplicating instances . Each instance appeared in the new data set times. The particular formula to determine how often each instance occurs was chosen empirically such that instances with a low cost are not disregarded completely, but instances with a high cost are much more important. Each instance will be in the data set used for training the machine learning classifiers at least once and at most 13 times for a theoretic maximum cost of 3600.
To achieve multi-level classification, each individual classifier below consists of a combination of classifiers. First we make the decision whether to use the alldifferent version equivalent to the binary decomposition or the other one, then, based on the previous decision, we decide which specific version of the alldifferent constraint to use.
Table 1 shows the total misclassification penalty for all classifiers with and without instance duplication on the first data set. It clearly shows that our cost model improves the performance significantly in terms of misclassification penalty for almost all classifiers.
For each classifier, we did stratified -fold cross-validation – the original data set is split into parts of roughly equal size. Each of the partitions is in turn used for testing. The remaining partitions are used for training. In the end, every instance will have been used for both training and testing in different runs . Stratified cross-validation ensures that the ratio of the different classification categories in each subset is roughly equal to the ratio in the whole set. If, for example, about 50% of all problem instances in the whole data are solved fastest with the naïve implementation, it will be about 50% of the instances in each subset as well.
There are several problems we faced when generating the classifiers. First, we do not know which one of the machine learning algorithms was suited best for our classification problem; indeed we do not know whether the features of the problem instances we measured are able to capture the factors which affect the performance of each individual implementation at all. Second, the learned classifiers could be overfitted. We could evaluate the performance of each classifier on the second set of problem instances and compare it to the performance during machine learning to assess whether it might be overfitted. Even if we were able to reliably detect overfitting this way, it is not obvious how we would change or retrain the classifier to remove the overfitting. Instead, we decided to use all classifiers – for each machine learning algorithm the different classifiers created during the -fold cross-validation and the classifiers created by each different machine learning algorithm.
We decided to use three-fold cross-validation as an acceptable compromise between trying to avoid overfitting and time required to compute and run the classifiers. We combine the decisions of the individual classifiers by majority vote (bagging and boosting).
Table 2 shows the overall performance of our meta-classifier compared to the best and worst individual classifier for each set and several other hypothetical classifiers. Our meta-classifier outperforms a classifier which always makes the default decision even on the second set of problem instances. This set is an extreme case because just making the default choice is almost always the best choice – the misclassification penalty for the default choice classifier is extremely low given the large number of instances. Even though there is only very little room for improvement, we achieve some of it.
|misclassification penalty [s]|
|instance set 1||instance set 2|
|classifier||all features||cheap features||all features||cheap features|
|best classifier on set 1||0.998||0.994||131||220.3|
|worst classifier on set 1||2304||2304||223||223|
|best classifier on set 2||0.998||61.66||131||186|
|worst classifier on set 2||1.34||1.44||621||610|
It also shows that the classifiers we have learned on a data set that contains problem instances from many problem classes can be applied to a different data set with instances from different problem classes and still achieve a performance improvement. Based on this observation, we suggest that our meta-classifier is generally applicable.
Another observation we made is that the performance of the meta-classifier does not suffer even if a large number of the classifiers that it combines perform badly individually. This suggests that the classifiers complement each other – the set of instances that each one misclassifies are different for each classifier. Note also that the classifier which performs best on one set of instances is not necessarily the best performer on the other set of instances. The same observation can be made for the classifier with the worst performance on one of the instance sets. This means that we cannot simply choose “the best” classifier or discard “the worst” for a given set of training instances.
The time required to compute the features was 27 seconds per instance on average, and it took 0.2 seconds per instance on average to run the classifiers and combine their decisions. If we take this time into account, our system is slower than just using the default implementation. This is mostly because of the cost of computing all the features required to make the decision. We do however learn good classifiers in the sense that the decision they make is better than just using the standard implementation.
For the alldifferent constraint, there is little room for improvement to start with. We therefore need to focus on making a decision as quickly as possible. Most of the time required to make the decision is spent computing the features that the classifiers need. We removed the most expensive features – all the properties of the primal graph described in Section 4 apart from edge density.
The results for the reduced set of features are shown in Table 2 as well. The performance is not significantly worse and even better on the first set of instances, but the time required to compute all the features is only about 3 second per instance. On the first set of benchmarks, we solve each instance on average 8 seconds faster using our system (misclassification penalty of default decision minus that of our system divided by the number of instances in the set). We are therefore left with a performance improvement of an average of 5 seconds per instance. On the second set, we cannot reasonably expect a performance improvement – the perfect oracle classifier only achieves about 0.2 seconds per instance on average.
6 Conclusions and future work
We have applied machine learning to a complex decision problem in constraint programming. To facilitate this, we evaluated the performance of constraint solvers representing all the decisions on two large sets of problem instances. We have demonstrated that training a set of classifiers without intrinsic knowledge about each individual one and combining their decisions can improve performance significantly over always making a default decision. In particular, our combined classifier is almost as good as the best classifier in the set and much better than the worst classifier.
We have conclusively shown that we can improve significantly on default decisions suggested in the state-of-the-art literature using a relatively simple and generic procedure. We provide strong evidence for the general applicability of a set of classifiers learned on a training set to sets of new, unknown instances. We identified several problems with using machine learning to make constraint programming decisions and successfully solved them.
Our system achieves performance improvements even taking the time it takes to compute the features and run the learned classifiers into account. For atypical sets of benchmarks, where always making the default decision is the right choice in almost all of the cases, we are not able to compensate for this overhead, but we are confident that we can achieve a real speedup on average.
We have identified two major directions for future research. First, it would be beneficial to analyse the individual machine learning algorithms and evaluate their suitability for our decision problem. This would enable us to make a more informed decision about which ones to use for our purposes and may suggest opportunities for improving them.
Second, selecting which features of problem instances to compute is a non-trivial choice because of the different cost and benefit associated with each one. The classifiers we learned on the reduced set of features did not seem to suffer significantly in terms of performance. Being able to assess the benefit of each individual feature towards a classifier and contrast that to the cost of computing it would enable us to make decisions of equal quality cheaper.
The authors thank Peter Nightingale for helpful discussions about the alldifferent constraint and its implementations and descriptions of the problem features used in the analysis. Chris Jefferson provided further feature descriptions. We thank Jesse Hoey for useful discussions about machine learning and the anonymous reviewers for their feedback. In particular reviewer 2 provided helpful pointers to relevant machine learning research. Lars Kotthoff is supported by a SICSA studentship.
-  Ansótegui, C., Sellmann, M., Tierney, K.: A gender-based genetic algorithm for the automatic configuration of algorithms. In: CP. pp. 142–157 (2009)
-  Borrett, J., Tsang, E., Walsh, N.: Adaptive constraint satisfaction: The quickest first principle. In: ECAI. pp. 160–164 (1996)
-  Dechter, R.: Constraint Processing. Elsevier Science (2003)
-  Epstein, S., Freuder, E., Wallace, R., Morozov, A., Samuels, B.: The adaptive constraint engine. In: CP. pp. 525–542 (2002)
-  Gent, I., Jefferson, C., Kotthoff, L., Miguel, I., Moore, N., Nightingale, P., Petrie, K.: Learning when to use lazy learning in constraint solving. In: ECAI (2010)
-  Gent, I., Jefferson, C., Miguel, I.: Minion: A fast scalable constraint solver. In: ECAI. pp. 98–102 (2006)
-  Gent, I., Miguel, I., Moore, N.: Lazy explanations for constraint propagator. In: PADL (2010)
-  Gent, I., Miguel, I., Nightingale, P.: Generalised arc consistency for the alldifferent constraint: An empirical survey. Artif. Intell. 172(18), 1973–2000 (2008)
-  Guerri, A., Milano, M.: Learning techniques for automatic algorithm portfolio selection. In: ECAI. pp. 475–479 (2004)
-  Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.: The WEKA data mining software: An update. SIGKDD Explorations 11(1) (2009)
-  Hutter, F., Hamadi, Y., Hoos, H., Leyton-Brown, K.: Performance prediction and automated tuning of randomized and parametric algorithms. In: CP. pp. 213–228 (2006)
-  KhudaBukhsh, A., Xu, L., Hoos, H., Leyton-Brown, K.: SATenstein: Automatically building local search SAT solvers from components. In: IJCAI. pp. 517–524 (2009)
-  Kotthoff, L.: Constraint solvers: An empirical evaluation of design decisions. CIRCA preprint (2009), http://www-circa.mcs.st-and.ac.uk/Preprints/solver-design.pdf
Lagoudakis, M., Littman, M.: Reinforcement learning for algorithm selection. In: AAAI/IAAI. p. 1081 (2000)
-  Lecoutre, C.: XCSP benchmarks. http://tinyurl.com/y6hpphs (May 2010)
-  Leyton-Brown, K., Nudelman, E., Andrew, G., McFadden, J., Shoham, Y.: A portfolio approach to algorithm selection. In: IJCAI. pp. 1542–1543 (2003)
-  McKay, B.: Practical graph isomorphism. In: Numerical mathematics and computing, Proc. 10th Manitoba Conf., Winnipeg/Manitoba 1980, Congr. Numerantium 30. pp. 45–87 (1981), see also http://cs.anu.edu.au/people/bdm/nauty
-  Minton, S.: Automatically configuring constraint satisfaction programs: A case study. Constraints 1(1/2), 7–43 (1996)
-  O’Mahony, E., Hebrard, E., Holland, A., Nugent, C., O’Sullivan, B.: Using case-based reasoning in an algorithm portfolio for constraint solving. In: 19th Irish Conference on AI (2008)
-  Régin, J.C.: A filtering algorithm for constraints of difference in CSPs. In: AAAI. pp. 362–367 (1994)
-  Rice, J.: The algorithm selection problem. Advances in Computers 15, 65–118 (1976)
-  Witten, I., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann (2005)
-  Xu, L., Hutter, F., Hoos, H., Leyton-Brown, K.: SATzilla: Portfolio-based algorithm selection for SAT. J. Artif. Intell. Res. (JAIR) 32, 565–606 (2008)